subject:"\[Spark SQL\]\: SQL, Python, Scala and R API Consistency"

Re:Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-02-03 Thread 大啊

+1 for this work!
I agree with some clarifications below:
SQL is a somewhat different case. There are functions that aren't _that_ useful 
in general, kind of niche, but nevertheless exist in other SQL systems, most 
notably Hive. It's useful to try to expand SQL support to cover those to ease 
migration and interoperability. But it may not make enough sense to maintain 
those functions in Scala, and Python, and R, because they're niche.







At 2021-01-29 04:40:07, "Sean Owen"  wrote:

I think I can articulate the general idea here, though I expect it is not 
deployed consistently.


Yes there's a general desire to make APIs consistent across languages. Python 
and Scala should track pretty closely, even if R isn't really that consistent.


SQL is a somewhat different case. There are functions that aren't _that_ useful 
in general, kind of niche, but nevertheless exist in other SQL systems, most 
notably Hive. It's useful to try to expand SQL support to cover those to ease 
migration and interoperability. But it may not make enough sense to maintain 
those functions in Scala, and Python, and R, because they're niche.


I think that was what you saw with regexp_extract_all. As you can see there 
isn't perfect agreement on where to draw those lines. But I think the theory 
has been mostly consistent over time, if not the execution.


It isn't that regexp_extract_all (for example) is useless outside SQL, just, 
where do you draw the line? Supporting 10s of random SQL functions across 3 
other languages has a cost, which has to be weighed against benefit, which we 
can never measure well except anecdotally: one or two people say "I want this" 
in a sea of hundreds of thousands of users.


(I'm not sure about CalendarType - just I know that date/time types are hard 
even within, say, the JVM, let alone across languages)


For this specific case, I think there is a fine argument that 
regexp_extract_all should be added simply for consistency with regexp_extract. 
I can also see the argument that regexp_extract was a step too far, but, what's 
public is now a public API.


I'll also say that the cost of adding API functions grows as a project matures, 
and whereas it might have made sense to add this at an earlier time, it might 
not make sense now.


I come out neutral on this specific case, but would not override the opinion of 
other committers. But I hope that explains the logic that I think underpins 
what you're hearing.








On Thu, Jan 28, 2021 at 2:23 PM MrPowers  wrote:

Thank you all for your amazing work on this project.  Spark has a great
public interface and the source code is clean.  The core team has done a
great job building and maintaining this project.  My emails / GitHub
comments focus on the 1% that we might be able to improve.

Pull requests / suggestions for improvements can come across as negative,
but I'm nothing but happy & positive about this project.  The source code is
delightful to read and the internal abstractions are beautiful.

*API consistency*

The SQL, Scala, and Python APIs are generally consistent.  They all have a
reverse function for example.

Some of the new PRs have arguments against consistent rollout of functions
across the APIs.  This seems like a break in the traditional Spark
development process when functions were implemented in all APIs (except for
functions that only make sense for certain APIs like createDataset and
toDS).

The default has shifted from consistent application of function across APIs
to "case by case determination".

*Examples*

* The regexp_extract_all function was recently added to the SQL API.  It was
then added to the Scala API,  but then removed from the Scala API
  .

* There is an ongoing discussion on  if CalendarType will be added to the
Python API  

*Arguments against adding functions like regexp_extract_all to the Scala
API:*

* Some of these functions are SQL specific and don't make sense for the
other languages

* Scala users can access the SQL functions via expr

*Argument rebuttal*

I don't understand the "some of the functions are SQL specific argument".
regexp_extract_all fills a gap in the API.  Users have been forced to use
UDF workarounds for this in the past.  Users from all APIs need this
solution. 

Using expr isn't developer friendly.  Scala / Python users don't want to
manipulate SQL strings.  Nesting functions in SQL strings is complicated.
The quoting and escaping is all different.  Figuring out how to invoke
regexp_replace(col("word1"), "//", "\\,") via expr would be a real pain -
would need to figure out SQL quoting, SQL escaping, and how to access column
names instead of a column object.

Any of the org.apache.spark.sql.functions can be invoked via expr.  The core
reason the Scala/Python APIs exist is so that developers don't need to
manipulate strings for expr.

regexp_extract_all should be added to the Scala API for the sa

Re:[Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-31 Thread 大啊

+1 for this work!  But I still don't know how to distinguish common and 
uncommon functions.
It seems that we should decide case by case. This work will cause some confuse.

At 2021-01-29 04:23:08, "MrPowers"  wrote:
>Thank you all for your amazing work on this project.  Spark has a great
>public interface and the source code is clean.  The core team has done a
>great job building and maintaining this project.  My emails / GitHub
>comments focus on the 1% that we might be able to improve.
>
>Pull requests / suggestions for improvements can come across as negative,
>but I'm nothing but happy & positive about this project.  The source code is
>delightful to read and the internal abstractions are beautiful.
>
>*API consistency*
>
>The SQL, Scala, and Python APIs are generally consistent.  They all have a
>reverse function for example.
>
>Some of the new PRs have arguments against consistent rollout of functions
>across the APIs.  This seems like a break in the traditional Spark
>development process when functions were implemented in all APIs (except for
>functions that only make sense for certain APIs like createDataset and
>toDS).
>
>The default has shifted from consistent application of function across APIs
>to "case by case determination".
>
>*Examples*
>
>* The regexp_extract_all function was recently added to the SQL API.  It was
>then added to the Scala API,  but then removed from the Scala API
>  .
>
>* There is an ongoing discussion on  if CalendarType will be added to the
>Python API   
>
>*Arguments against adding functions like regexp_extract_all to the Scala
>API:*
>
>* Some of these functions are SQL specific and don't make sense for the
>other languages
>
>* Scala users can access the SQL functions via expr
>
>*Argument rebuttal*
>
>I don't understand the "some of the functions are SQL specific argument". 
>regexp_extract_all fills a gap in the API.  Users have been forced to use
>UDF workarounds for this in the past.  Users from all APIs need this
>solution.  
>
>Using expr isn't developer friendly.  Scala / Python users don't want to
>manipulate SQL strings.  Nesting functions in SQL strings is complicated. 
>The quoting and escaping is all different.  Figuring out how to invoke
>regexp_replace(col("word1"), "//", "\\,") via expr would be a real pain -
>would need to figure out SQL quoting, SQL escaping, and how to access column
>names instead of a column object.
>
>Any of the org.apache.spark.sql.functions can be invoked via expr.  The core
>reason the Scala/Python APIs exist is so that developers don't need to
>manipulate strings for expr.
>
>regexp_extract_all should be added to the Scala API for the same reasons
>that regexp_extract was added to the Scala API.  
>
>*Next steps*
>
>* I'd like to better understand why we've broken from the traditional Spark
>development process of "consistently implementing functions across all APIs"
>to "selectively implementing functions in certain APIs"
>
>* Hopefully shift the burden of proof to those in favor of inconsistent
>application.  Consistent application should be the default.  
>
>Thank you all for your excellent work on this project.
>
>- Matthew Powers (GitHub: MrPowers)
>
>
>
>--
>Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>-
>To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Kent Yao






Hi, MrPowersI'm also interested in this idea.I started https://github.com/yaooqinn/spark-func-extras a few month agoOn 2021/01/30 15:45:30, Matthew Powers  wrote:  Maciej - I like the idea of a separate library to provide easy access to>  functions that the maintainers don't want to merge into Spark core.>   I've seen this model work well in other open source communities.  The Rails>  Active Support library provides the Ruby community with core functionality>  like beginning_of_month.  The Ruby community has a good, well-supported>  function, but it's not in the Ruby codebase so it's not a maintenance>  burden - best of both worlds.>   I'll start a proof-of-concept repo.  If the repo gets popular, I'll be>  happy to donate it to a GitHub organization like Awesome Spark>   or the ASF.>   On Sat, Jan 30, 2021 at 9:35 AM Maciej  wrote:>   Just thinking out loud ‒ if there is community need for providing language>  bindings for less popular SQL functions, could these live outside main>  project or even outside the ASF?  As long as expressions are already>  implemented, bindings are trivial after all.>   If could also allow usage of more scalable hierarchy (let's say with>  modules / packages per function family).>   On 1/29/21 5:01 AM, Hyukjin Kwon wrote:>   FYI exposing methods with Column signature only is already documented on>  the top of functions.scala, and I believe that has been the current dev>  direction if I am not mistaken.>   Another point is that we should rather expose commonly used expressions.>  Its best if it considers language specific context. Many of expressions are>  for SQL compliance. Many data silence python libraries don't support such>  features as an example.> On Fri, 29 Jan 2021, 12:04 Matthew Powers, >  wrote:>   Thanks for the thoughtful responses.  I now understand why adding all the>  functions across all the APIs isn't the default.>   To Nick's point, relying on heuristics to gauge user interest, in>  addition to personal experience, is a good idea.  The regexp_extract_all>  SO thread has 16,000 views>  ,>  so I say we set the threshold to 10k, haha, just kidding!  Like Sean>  mentioned, we don't want to add niche functions.  Now we just need a way to>  figure out what's niche!>   To Reynolds point on overloading Scala functions, I think we should start>  trying to limit the number of overloaded functions.  Some functions have>  the columnName and column object function signatures.  e.g.>  approx_count_distinct(columnName: String, rsd: Double) and>  approx_count_distinct(e: Column, rsd: Double).  We can just expose the>  approx_count_distinct(e: Column, rsd: Double) variety going forward (not>  suggesting any backwards incompatible changes, just saying we don't need>  the columnName-type functions for new stuff).>   Other functions have one signature with the second object as a Scala>  object and another signature with the second object as a column object,>  e.g. date_add(start: Column, days: Column) and date_add(start: Column,>  days: Int).  We can just expose the date_add(start: Column, days: Column)>  variety cause it's general purpose.  Let me know if you think that avoiding>  Scala function overloading will help Reynold.>   Let's brainstorm Nick's idea of creating a framework that'd test Scala />  Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a>  great way to reduce the maintenance burden.  Reynold's regexp_extract code>  from 5 years ago is largely still intact - getting the job done right the>  first time is another great way to avoid maintenance!>   On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  wrote:>   There's another thing that's not mentioned … it's primarily a problem>  for Scala. Due to static typing, we need a very large number of function>  overloads for the Scala version of each function, whereas in SQL/Python>  they are just one. There's a limit on how many functions we can add, and it>  also makes it difficult to browse through the docs when there are a lot of>  functions.> On Thu, Jan 28, 2021 at 1:09 PM, Maciej  wrote:>   Just my two cents on R side.>   On 1/28/21 10:00 PM, Nicholas Chammas wrote:>   On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:>   It isn't that regexp_extract_all (for example) is useless outside SQL,>  just, where do you draw the line? Supporting 10s of random SQL functions>  across 3 other languages has a cost, which has to be weighed against>  benefit, which we can never measure well except anecdotally: one or two>  people say "I want this" in a sea of hundreds of thousands of users.>+1 to this, but I will add that Jira and Stack Overflow activity can>  sometimes give good signals about API gaps that are frustrating users. If>  there is an SO question with 30K views about how to do something that>  should have been easier, then that's an important signal about the API.>   For this specific case, I think there is a fine argument>  that regexp_extract_all should be added simply for

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Matthew Powers

Maciej - I like the idea of a separate library to provide easy access to
functions that the maintainers don't want to merge into Spark core.

I've seen this model work well in other open source communities.  The Rails
Active Support library provides the Ruby community with core functionality
like beginning_of_month.  The Ruby community has a good, well-supported
function, but it's not in the Ruby codebase so it's not a maintenance
burden - best of both worlds.

I'll start a proof-of-concept repo.  If the repo gets popular, I'll be
happy to donate it to a GitHub organization like Awesome Spark
 or the ASF.

On Sat, Jan 30, 2021 at 9:35 AM Maciej  wrote:

> Just thinking out loud ‒ if there is community need for providing language
> bindings for less popular SQL functions, could these live outside main
> project or even outside the ASF?  As long as expressions are already
> implemented, bindings are trivial after all.
>
> If could also allow usage of more scalable hierarchy (let's say with
> modules / packages per function family).
>
> On 1/29/21 5:01 AM, Hyukjin Kwon wrote:
>
> FYI exposing methods with Column signature only is already documented on
> the top of functions.scala, and I believe that has been the current dev
> direction if I am not mistaken.
>
> Another point is that we should rather expose commonly used expressions.
> Its best if it considers language specific context. Many of expressions are
> for SQL compliance. Many data silence python libraries don't support such
> features as an example.
>
>
>
> On Fri, 29 Jan 2021, 12:04 Matthew Powers, 
> wrote:
>
>> Thanks for the thoughtful responses.  I now understand why adding all the
>> functions across all the APIs isn't the default.
>>
>> To Nick's point, relying on heuristics to gauge user interest, in
>> addition to personal experience, is a good idea.  The regexp_extract_all
>> SO thread has 16,000 views
>> ,
>> so I say we set the threshold to 10k, haha, just kidding!  Like Sean
>> mentioned, we don't want to add niche functions.  Now we just need a way to
>> figure out what's niche!
>>
>> To Reynolds point on overloading Scala functions, I think we should start
>> trying to limit the number of overloaded functions.  Some functions have
>> the columnName and column object function signatures.  e.g.
>> approx_count_distinct(columnName: String, rsd: Double) and
>> approx_count_distinct(e: Column, rsd: Double).  We can just expose the
>> approx_count_distinct(e: Column, rsd: Double) variety going forward (not
>> suggesting any backwards incompatible changes, just saying we don't need
>> the columnName-type functions for new stuff).
>>
>> Other functions have one signature with the second object as a Scala
>> object and another signature with the second object as a column object,
>> e.g. date_add(start: Column, days: Column) and date_add(start: Column,
>> days: Int).  We can just expose the date_add(start: Column, days: Column)
>> variety cause it's general purpose.  Let me know if you think that avoiding
>> Scala function overloading will help Reynold.
>>
>> Let's brainstorm Nick's idea of creating a framework that'd test Scala /
>> Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a
>> great way to reduce the maintenance burden.  Reynold's regexp_extract code
>> from 5 years ago is largely still intact - getting the job done right the
>> first time is another great way to avoid maintenance!
>>
>> On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  wrote:
>>
>>> There's another thing that's not mentioned … it's primarily a problem
>>> for Scala. Due to static typing, we need a very large number of function
>>> overloads for the Scala version of each function, whereas in SQL/Python
>>> they are just one. There's a limit on how many functions we can add, and it
>>> also makes it difficult to browse through the docs when there are a lot of
>>> functions.
>>>
>>>
>>>
>>> On Thu, Jan 28, 2021 at 1:09 PM, Maciej  wrote:
>>>
 Just my two cents on R side.

 On 1/28/21 10:00 PM, Nicholas Chammas wrote:

 On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:

> It isn't that regexp_extract_all (for example) is useless outside SQL,
> just, where do you draw the line? Supporting 10s of random SQL functions
> across 3 other languages has a cost, which has to be weighed against
> benefit, which we can never measure well except anecdotally: one or two
> people say "I want this" in a sea of hundreds of thousands of users.
>

 +1 to this, but I will add that Jira and Stack Overflow activity can
 sometimes give good signals about API gaps that are frustrating users. If
 there is an SO question with 30K views about how to do something that
 should have been easier, then that's an important signal about the API.

 For this specific cas

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-30 Thread Maciej

Just thinking out loud ‒ if there is community need for providing
language bindings for less popular SQL functions, could these live
outside main project or even outside the ASF?  As long as expressions
are already implemented, bindings are trivial after all.

If could also allow usage of more scalable hierarchy (let's say with
modules / packages per function family).

On 1/29/21 5:01 AM, Hyukjin Kwon wrote:
> FYI exposing methods with Column signature only is already documented
> on the top of functions.scala, and I believe that has been the current
> dev direction if I am not mistaken.
>
> Another point is that we should rather expose commonly used
> expressions. Its best if it considers language specific context. Many
> of expressions are for SQL compliance. Many data silence python
> libraries don't support such features as an example.
>
>
>
> On Fri, 29 Jan 2021, 12:04 Matthew Powers,
> mailto:matthewkevinpow...@gmail.com>>
> wrote:
>
> Thanks for the thoughtful responses.  I now understand why adding
> all the functions across all the APIs isn't the default.
>
> To Nick's point, relying on heuristics to gauge user interest, in
> addition to personal experience, is a good idea.  The
> regexp_extract_all SO thread has 16,000 views
> 
> ,
> so I say we set the threshold to 10k, haha, just kidding!  Like
> Sean mentioned, we don't want to add niche functions.  Now we just
> need a way to figure out what's niche!
>
> To Reynolds point on overloading Scala functions, I think we
> should start trying to limit the number of overloaded functions. 
> Some functions have the columnName and column object function
> signatures.  e.g. approx_count_distinct(columnName: String, rsd:
> Double) and approx_count_distinct(e: Column, rsd: Double).  We can
> just expose the approx_count_distinct(e: Column, rsd: Double)
> variety going forward (not suggesting any backwards incompatible
> changes, just saying we don't need the columnName-type functions
> for new stuff).
>
> Other functions have one signature with the second object as a
> Scala object and another signature with the second object as a
> column object, e.g. date_add(start: Column, days: Column) and
> date_add(start: Column, days: Int).  We can just expose the
> date_add(start: Column, days: Column) variety cause it's general
> purpose.  Let me know if you think that avoiding Scala function
> overloading will help Reynold.
>
> Let's brainstorm Nick's idea of creating a framework that'd test
> Scala / Python / SQL / R implementations in one-fell-swoop.  Seems
> like that'd be a great way to reduce the maintenance burden. 
> Reynold's regexp_extract code from 5 years ago is largely still
> intact - getting the job done right the first time is another
> great way to avoid maintenance!
>
> On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  > wrote:
>
> There's another thing that's not mentioned … it's primarily a
> problem for Scala. Due to static typing, we need a very large
> number of function overloads for the Scala version of each
> function, whereas in SQL/Python they are just one. There's a
> limit on how many functions we can add, and it also makes it
> difficult to browse through the docs when there are a lot of
> functions.
>
>
>
> On Thu, Jan 28, 2021 at 1:09 PM, Maciej
> mailto:mszymkiew...@gmail.com>> wrote:
>
> Just my two cents on R side.
>
> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen
>> mailto:sro...@gmail.com>> wrote:
>>
>> It isn't that regexp_extract_all (for example) is
>> useless outside SQL, just, where do you draw the
>> line? Supporting 10s of random SQL functions across 3
>> other languages has a cost, which has to be weighed
>> against benefit, which we can never measure well
>> except anecdotally: one or two people say "I want
>> this" in a sea of hundreds of thousands of users.
>>
>>
>> +1 to this, but I will add that Jira and Stack Overflow
>> activity can sometimes give good signals about API gaps
>> that are frustrating users. If there is an SO question
>> with 30K views about how to do something that should have
>> been easier, then that's an important signal about the API.
>>
>> For this specific case, I think there is a fine
>> argument that regexp_extract_all should be added
>> simply for consistency with regexp_extract. I can
>> also see the argument that regexp_extract was a

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Hyukjin Kwon

FYI exposing methods with Column signature only is already documented on
the top of functions.scala, and I believe that has been the current dev
direction if I am not mistaken.

Another point is that we should rather expose commonly used expressions.
Its best if it considers language specific context. Many of expressions are
for SQL compliance. Many data silence python libraries don't support such
features as an example.



On Fri, 29 Jan 2021, 12:04 Matthew Powers, 
wrote:

> Thanks for the thoughtful responses.  I now understand why adding all the
> functions across all the APIs isn't the default.
>
> To Nick's point, relying on heuristics to gauge user interest, in
> addition to personal experience, is a good idea.  The regexp_extract_all
> SO thread has 16,000 views
> ,
> so I say we set the threshold to 10k, haha, just kidding!  Like Sean
> mentioned, we don't want to add niche functions.  Now we just need a way to
> figure out what's niche!
>
> To Reynolds point on overloading Scala functions, I think we should start
> trying to limit the number of overloaded functions.  Some functions have
> the columnName and column object function signatures.  e.g.
> approx_count_distinct(columnName: String, rsd: Double) and
> approx_count_distinct(e: Column, rsd: Double).  We can just expose the
> approx_count_distinct(e: Column, rsd: Double) variety going forward (not
> suggesting any backwards incompatible changes, just saying we don't need
> the columnName-type functions for new stuff).
>
> Other functions have one signature with the second object as a Scala
> object and another signature with the second object as a column object,
> e.g. date_add(start: Column, days: Column) and date_add(start: Column,
> days: Int).  We can just expose the date_add(start: Column, days: Column)
> variety cause it's general purpose.  Let me know if you think that avoiding
> Scala function overloading will help Reynold.
>
> Let's brainstorm Nick's idea of creating a framework that'd test Scala /
> Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a
> great way to reduce the maintenance burden.  Reynold's regexp_extract code
> from 5 years ago is largely still intact - getting the job done right the
> first time is another great way to avoid maintenance!
>
> On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  wrote:
>
>> There's another thing that's not mentioned … it's primarily a problem for
>> Scala. Due to static typing, we need a very large number of function
>> overloads for the Scala version of each function, whereas in SQL/Python
>> they are just one. There's a limit on how many functions we can add, and it
>> also makes it difficult to browse through the docs when there are a lot of
>> functions.
>>
>>
>>
>> On Thu, Jan 28, 2021 at 1:09 PM, Maciej  wrote:
>>
>>> Just my two cents on R side.
>>>
>>> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>>>
>>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:
>>>
 It isn't that regexp_extract_all (for example) is useless outside SQL,
 just, where do you draw the line? Supporting 10s of random SQL functions
 across 3 other languages has a cost, which has to be weighed against
 benefit, which we can never measure well except anecdotally: one or two
 people say "I want this" in a sea of hundreds of thousands of users.

>>>
>>> +1 to this, but I will add that Jira and Stack Overflow activity can
>>> sometimes give good signals about API gaps that are frustrating users. If
>>> there is an SO question with 30K views about how to do something that
>>> should have been easier, then that's an important signal about the API.
>>>
>>> For this specific case, I think there is a fine argument
 that regexp_extract_all should be added simply for consistency
 with regexp_extract. I can also see the argument that regexp_extract was a
 step too far, but, what's public is now a public API.

>>>
>>> I think in this case a few references to where/how people are having to
>>> work around missing a direct function for regexp_extract_all could help
>>> guide the decision. But that itself means we are making these decisions on
>>> a case-by-case basis.
>>>
>>> From a user perspective, it's definitely conceptually simpler to have
>>> SQL functions be consistent and available across all APIs.
>>>
>>> Perhaps if we had a way to lower the maintenance burden of keeping
>>> functions in sync across SQL/Scala/Python/R, it would be easier for
>>> everyone to agree to just have all the functions be included across the
>>> board all the time.
>>>
>>> Python aligns quite well with Scala so that might be fine, but R is a
>>> bit tricky thing. Especially lack of proper namespaces makes it rather
>>> risky to have packages that export hundreds of functions. sparkly handles
>>> this neatly with NSE, but I don't think we're going to go this way.
>>>
>>>
>>> Wo

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Matthew Powers

Thanks for the thoughtful responses.  I now understand why adding all the
functions across all the APIs isn't the default.

To Nick's point, relying on heuristics to gauge user interest, in
addition to personal experience, is a good idea.  The regexp_extract_all SO
thread has 16,000 views
,
so I say we set the threshold to 10k, haha, just kidding!  Like Sean
mentioned, we don't want to add niche functions.  Now we just need a way to
figure out what's niche!

To Reynolds point on overloading Scala functions, I think we should start
trying to limit the number of overloaded functions.  Some functions have
the columnName and column object function signatures.  e.g.
approx_count_distinct(columnName: String, rsd: Double) and
approx_count_distinct(e: Column, rsd: Double).  We can just expose the
approx_count_distinct(e: Column, rsd: Double) variety going forward (not
suggesting any backwards incompatible changes, just saying we don't need
the columnName-type functions for new stuff).

Other functions have one signature with the second object as a Scala object
and another signature with the second object as a column object, e.g.
date_add(start: Column, days: Column) and date_add(start: Column, days:
Int).  We can just expose the date_add(start: Column, days: Column) variety
cause it's general purpose.  Let me know if you think that avoiding Scala
function overloading will help Reynold.

Let's brainstorm Nick's idea of creating a framework that'd test Scala /
Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a
great way to reduce the maintenance burden.  Reynold's regexp_extract code
from 5 years ago is largely still intact - getting the job done right the
first time is another great way to avoid maintenance!

On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin  wrote:

> There's another thing that's not mentioned … it's primarily a problem for
> Scala. Due to static typing, we need a very large number of function
> overloads for the Scala version of each function, whereas in SQL/Python
> they are just one. There's a limit on how many functions we can add, and it
> also makes it difficult to browse through the docs when there are a lot of
> functions.
>
>
>
> On Thu, Jan 28, 2021 at 1:09 PM, Maciej  wrote:
>
>> Just my two cents on R side.
>>
>> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
>>
>> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:
>>
>>> It isn't that regexp_extract_all (for example) is useless outside SQL,
>>> just, where do you draw the line? Supporting 10s of random SQL functions
>>> across 3 other languages has a cost, which has to be weighed against
>>> benefit, which we can never measure well except anecdotally: one or two
>>> people say "I want this" in a sea of hundreds of thousands of users.
>>>
>>
>> +1 to this, but I will add that Jira and Stack Overflow activity can
>> sometimes give good signals about API gaps that are frustrating users. If
>> there is an SO question with 30K views about how to do something that
>> should have been easier, then that's an important signal about the API.
>>
>> For this specific case, I think there is a fine argument
>>> that regexp_extract_all should be added simply for consistency
>>> with regexp_extract. I can also see the argument that regexp_extract was a
>>> step too far, but, what's public is now a public API.
>>>
>>
>> I think in this case a few references to where/how people are having to
>> work around missing a direct function for regexp_extract_all could help
>> guide the decision. But that itself means we are making these decisions on
>> a case-by-case basis.
>>
>> From a user perspective, it's definitely conceptually simpler to have SQL
>> functions be consistent and available across all APIs.
>>
>> Perhaps if we had a way to lower the maintenance burden of keeping
>> functions in sync across SQL/Scala/Python/R, it would be easier for
>> everyone to agree to just have all the functions be included across the
>> board all the time.
>>
>> Python aligns quite well with Scala so that might be fine, but R is a bit
>> tricky thing. Especially lack of proper namespaces makes it rather risky to
>> have packages that export hundreds of functions. sparkly handles this
>> neatly with NSE, but I don't think we're going to go this way.
>>
>>
>> Would, for example, some sort of automatic testing mechanism for SQL
>> functions help here? Something that uses a common function testing
>> specification to automatically test SQL, Scala, Python, and R functions,
>> without requiring maintainers to write tests for each language's version of
>> the functions. Would that address the maintenance burden?
>>
>> With R we don't really test most of the functions beyond the simple
>> "callability". One the complex ones, that require some non-trivial
>> transformations of arguments, are fully tested.
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Reynold Xin

There's another thing that's not mentioned … it's primarily a problem for 
Scala. Due to static typing, we need a very large number of function overloads 
for the Scala version of each function, whereas in SQL/Python they are just 
one. There's a limit on how many functions we can add, and it also makes it 
difficult to browse through the docs when there are a lot of functions.

On Thu, Jan 28, 2021 at 1:09 PM, Maciej < mszymkiew...@gmail.com > wrote:

> 
> Just my two cents on R side.
> 
> 
> 
> On 1/28/21 10:00 PM, Nicholas Chammas wrote:
> 
> 
>> On Thu, Jan 28 , 2021 at 3:40 PM Sean Owen < srowen@ gmail. com (
>> sro...@gmail.com ) > wrote:
>> 
>> 
>>> It isn't that regexp_extract_all (for example) is useless outside SQL,
>>> just, where do you draw the line? Supporting 10s of random SQL functions
>>> across 3 other languages has a cost, which has to be weighed against
>>> benefit, which we can never measure well except anecdotally: one or two
>>> people say "I want this" in a sea of hundreds of thousands of users.
>>> 
>> 
>> 
>> 
>> +1 to this, but I will add that Jira and Stack Overflow activity can
>> sometimes give good signals about API gaps that are frustrating users. If
>> there is an SO question with 30K views about how to do something that
>> should have been easier, then that's an important signal about the API.
>> 
>> 
>> 
>>> For this specific case, I think there is a fine argument that
>>> regexp_extract_all should be added simply for consistency with
>>> regexp_extract. I can also see the argument that regexp_extract was a step
>>> too far, but, what's public is now a public API.
>>> 
>> 
>> 
>> 
>> I think in this case a few references to where/how people are having to
>> work around missing a direct function for regexp_extract_all could help
>> guide the decision. But that itself means we are making these decisions on
>> a case-by-case basis.
>> 
>> 
>> From a user perspective, it's definitely conceptually simpler to have SQL
>> functions be consistent and available across all APIs.
>> 
>> 
>> 
>> Perhaps if we had a way to lower the maintenance burden of keeping
>> functions in sync across SQL/Scala/Python/R, it would be easier for
>> everyone to agree to just have all the functions be included across the
>> board all the time.
>> 
> 
> 
> 
> Python aligns quite well with Scala so that might be fine, but R is a bit
> tricky thing. Especially lack of proper namespaces makes it rather risky
> to have packages that export hundreds of functions. sparkly handles this
> neatly with NSE, but I don't think we're going to go this way.
> 
> 
> 
>> 
>> 
>> Would, for example, some sort of automatic testing mechanism for SQL
>> functions help here? Something that uses a common function testing
>> specification to automatically test SQL, Scala, Python, and R functions,
>> without requiring maintainers to write tests for each language's version
>> of the functions. Would that address the maintenance burden?
>> 
> 
> 
> 
> With R we don't really test most of the functions beyond the simple
> "callability". One the complex ones, that require some non-trivial
> transformations of arguments, are fully tested.
> 
> 
> -- 
> Best regards,
> Maciej Szymkiewicz
> 
> Web: https:/ / zero323. net ( https://zero323.net )
> Keybase: https:/ / keybase. io/ zero323 ( https://keybase.io/zero323 )
> Gigs: https:/ / www. codementor. io/ @ zero323 (
> https://www.codementor.io/@zero323 )
> PGP: A30CEF0C31A501EC
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Maciej

Just my two cents on R side.

On 1/28/21 10:00 PM, Nicholas Chammas wrote:
> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  > wrote:
>
> It isn't that regexp_extract_all (for example) is useless outside
> SQL, just, where do you draw the line? Supporting 10s of random
> SQL functions across 3 other languages has a cost, which has to be
> weighed against benefit, which we can never measure well except
> anecdotally: one or two people say "I want this" in a sea of
> hundreds of thousands of users.
>
>
> +1 to this, but I will add that Jira and Stack Overflow activity can
> sometimes give good signals about API gaps that are frustrating users.
> If there is an SO question with 30K views about how to do something
> that should have been easier, then that's an important signal about
> the API.
>
> For this specific case, I think there is a fine argument
> that regexp_extract_all should be added simply for consistency
> with regexp_extract. I can also see the argument
> that regexp_extract was a step too far, but, what's public is now
> a public API.
>
>
> I think in this case a few references to where/how people are having
> to work around missing a direct function for regexp_extract_all could
> help guide the decision. But that itself means we are making these
> decisions on a case-by-case basis.
>
> From a user perspective, it's definitely conceptually simpler to have
> SQL functions be consistent and available across all APIs.
>
> Perhaps if we had a way to lower the maintenance burden of keeping
> functions in sync across SQL/Scala/Python/R, it would be easier for
> everyone to agree to just have all the functions be included across
> the board all the time.

Python aligns quite well with Scala so that might be fine, but R is a
bit tricky thing. Especially lack of proper namespaces makes it rather
risky to have packages that export hundreds of functions. sparkly
handles this neatly with NSE, but I don't think we're going to go this way.

>
> Would, for example, some sort of automatic testing mechanism for SQL
> functions help here? Something that uses a common function testing
> specification to automatically test SQL, Scala, Python, and R
> functions, without requiring maintainers to write tests for each
> language's version of the functions. Would that address the
> maintenance burden?

With R we don't really test most of the functions beyond the simple
"callability". One the complex ones, that require some non-trivial
transformations of arguments, are fully tested.

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC



OpenPGP_signature
Description: OpenPGP digital signature

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Nicholas Chammas

On Thu, Jan 28, 2021 at 3:40 PM Sean Owen  wrote:

> It isn't that regexp_extract_all (for example) is useless outside SQL,
> just, where do you draw the line? Supporting 10s of random SQL functions
> across 3 other languages has a cost, which has to be weighed against
> benefit, which we can never measure well except anecdotally: one or two
> people say "I want this" in a sea of hundreds of thousands of users.
>

+1 to this, but I will add that Jira and Stack Overflow activity can
sometimes give good signals about API gaps that are frustrating users. If
there is an SO question with 30K views about how to do something that
should have been easier, then that's an important signal about the API.

For this specific case, I think there is a fine argument
> that regexp_extract_all should be added simply for consistency
> with regexp_extract. I can also see the argument that regexp_extract was a
> step too far, but, what's public is now a public API.
>

I think in this case a few references to where/how people are having to
work around missing a direct function for regexp_extract_all could help
guide the decision. But that itself means we are making these decisions on
a case-by-case basis.

>From a user perspective, it's definitely conceptually simpler to have SQL
functions be consistent and available across all APIs.

Perhaps if we had a way to lower the maintenance burden of keeping
functions in sync across SQL/Scala/Python/R, it would be easier for
everyone to agree to just have all the functions be included across the
board all the time.

Would, for example, some sort of automatic testing mechanism for SQL
functions help here? Something that uses a common function testing
specification to automatically test SQL, Scala, Python, and R functions,
without requiring maintainers to write tests for each language's version of
the functions. Would that address the maintenance burden?

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread Sean Owen

I think I can articulate the general idea here, though I expect it is not
deployed consistently.

Yes there's a general desire to make APIs consistent across languages.
Python and Scala should track pretty closely, even if R isn't really that
consistent.

SQL is a somewhat different case. There are functions that aren't _that_
useful in general, kind of niche, but nevertheless exist in other SQL
systems, most notably Hive. It's useful to try to expand SQL support to
cover those to ease migration and interoperability. But it may not make
enough sense to maintain those functions in Scala, and Python, and R,
because they're niche.

I think that was what you saw with regexp_extract_all. As you can see there
isn't perfect agreement on where to draw those lines. But I think the
theory has been mostly consistent over time, if not the execution.

It isn't that regexp_extract_all (for example) is useless outside SQL,
just, where do you draw the line? Supporting 10s of random SQL functions
across 3 other languages has a cost, which has to be weighed against
benefit, which we can never measure well except anecdotally: one or two
people say "I want this" in a sea of hundreds of thousands of users.

(I'm not sure about CalendarType - just I know that date/time types are
hard even within, say, the JVM, let alone across languages)

For this specific case, I think there is a fine argument
that regexp_extract_all should be added simply for consistency
with regexp_extract. I can also see the argument that regexp_extract was a
step too far, but, what's public is now a public API.

I'll also say that the cost of adding API functions grows as a project
matures, and whereas it might have made sense to add this at an earlier
time, it might not make sense now.

I come out neutral on this specific case, but would not override the
opinion of other committers. But I hope that explains the logic that I
think underpins what you're hearing.

On Thu, Jan 28, 2021 at 2:23 PM MrPowers 
wrote:

> Thank you all for your amazing work on this project.  Spark has a great
> public interface and the source code is clean.  The core team has done a
> great job building and maintaining this project.  My emails / GitHub
> comments focus on the 1% that we might be able to improve.
>
> Pull requests / suggestions for improvements can come across as negative,
> but I'm nothing but happy & positive about this project.  The source code
> is
> delightful to read and the internal abstractions are beautiful.
>
> *API consistency*
>
> The SQL, Scala, and Python APIs are generally consistent.  They all have a
> reverse function for example.
>
> Some of the new PRs have arguments against consistent rollout of functions
> across the APIs.  This seems like a break in the traditional Spark
> development process when functions were implemented in all APIs (except for
> functions that only make sense for certain APIs like createDataset and
> toDS).
>
> The default has shifted from consistent application of function across APIs
> to "case by case determination".
>
> *Examples*
>
> * The regexp_extract_all function was recently added to the SQL API.  It
> was
> then added to the Scala API,  but then removed from the Scala API
>   .
>
> * There is an ongoing discussion on  if CalendarType will be added to the
> Python API 
>
> *Arguments against adding functions like regexp_extract_all to the Scala
> API:*
>
> * Some of these functions are SQL specific and don't make sense for the
> other languages
>
> * Scala users can access the SQL functions via expr
>
> *Argument rebuttal*
>
> I don't understand the "some of the functions are SQL specific argument".
> regexp_extract_all fills a gap in the API.  Users have been forced to use
> UDF workarounds for this in the past.  Users from all APIs need this
> solution.
>
> Using expr isn't developer friendly.  Scala / Python users don't want to
> manipulate SQL strings.  Nesting functions in SQL strings is complicated.
> The quoting and escaping is all different.  Figuring out how to invoke
> regexp_replace(col("word1"), "//", "\\,") via expr would be a real pain -
> would need to figure out SQL quoting, SQL escaping, and how to access
> column
> names instead of a column object.
>
> Any of the org.apache.spark.sql.functions can be invoked via expr.  The
> core
> reason the Scala/Python APIs exist is so that developers don't need to
> manipulate strings for expr.
>
> regexp_extract_all should be added to the Scala API for the same reasons
> that regexp_extract was added to the Scala API.
>
> *Next steps*
>
> * I'd like to better understand why we've broken from the traditional Spark
> development process of "consistently implementing functions across all
> APIs"
> to "selectively implementing functions in certain APIs"
>
> * Hopefully shift the burden of proof to those in favor of inconsistent
> application.  Consistent application should

[Spark SQL]: SQL, Python, Scala and R API Consistency

2021-01-28 Thread MrPowers

Thank you all for your amazing work on this project.  Spark has a great
public interface and the source code is clean.  The core team has done a
great job building and maintaining this project.  My emails / GitHub
comments focus on the 1% that we might be able to improve.

Pull requests / suggestions for improvements can come across as negative,
but I'm nothing but happy & positive about this project.  The source code is
delightful to read and the internal abstractions are beautiful.

*API consistency*

The SQL, Scala, and Python APIs are generally consistent.  They all have a
reverse function for example.

Some of the new PRs have arguments against consistent rollout of functions
across the APIs.  This seems like a break in the traditional Spark
development process when functions were implemented in all APIs (except for
functions that only make sense for certain APIs like createDataset and
toDS).

The default has shifted from consistent application of function across APIs
to "case by case determination".

*Examples*

* The regexp_extract_all function was recently added to the SQL API.  It was
then added to the Scala API,  but then removed from the Scala API
  .

* There is an ongoing discussion on  if CalendarType will be added to the
Python API   

*Arguments against adding functions like regexp_extract_all to the Scala
API:*

* Some of these functions are SQL specific and don't make sense for the
other languages

* Scala users can access the SQL functions via expr

*Argument rebuttal*

I don't understand the "some of the functions are SQL specific argument". 
regexp_extract_all fills a gap in the API.  Users have been forced to use
UDF workarounds for this in the past.  Users from all APIs need this
solution.  

Using expr isn't developer friendly.  Scala / Python users don't want to
manipulate SQL strings.  Nesting functions in SQL strings is complicated. 
The quoting and escaping is all different.  Figuring out how to invoke
regexp_replace(col("word1"), "//", "\\,") via expr would be a real pain -
would need to figure out SQL quoting, SQL escaping, and how to access column
names instead of a column object.

Any of the org.apache.spark.sql.functions can be invoked via expr.  The core
reason the Scala/Python APIs exist is so that developers don't need to
manipulate strings for expr.

regexp_extract_all should be added to the Scala API for the same reasons
that regexp_extract was added to the Scala API.  

*Next steps*

* I'd like to better understand why we've broken from the traditional Spark
development process of "consistently implementing functions across all APIs"
to "selectively implementing functions in certain APIs"

* Hopefully shift the burden of proof to those in favor of inconsistent
application.  Consistent application should be the default.  

Thank you all for your excellent work on this project.

- Matthew Powers (GitHub: MrPowers)



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re:Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re:[Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

Re: [Spark SQL]: SQL, Python, Scala and R API Consistency

[Spark SQL]: SQL, Python, Scala and R API Consistency

12 matches

Site Navigation

Mail list logo

Footer information