Hi, MrPowers
I'm also interested in this idea.

On 2021/01/30 15:45:30, Matthew Powers <m...@gmail.com> wrote:
Maciej - I like the idea of a separate library to provide easy access to>
functions that the maintainers don't want to merge into Spark core.>

I've seen this model work well in other open source communities.  The Rails>
Active Support library provides the Ruby community with core functionality>
like beginning_of_month.  The Ruby community has a good, well-supported>
function, but it's not in the Ruby codebase so it's not a maintenance>
burden - best of both worlds.>

I'll start a proof-of-concept repo.  If the repo gets popular, I'll be>
happy to donate it to a GitHub organization like Awesome Spark>
<https://github.com/awesome-spark> or the ASF.>

On Sat, Jan 30, 2021 at 9:35 AM Maciej <ms...@gmail.com> wrote:>

Just thinking out loud ‒ if there is community need for providing language>
bindings for less popular SQL functions, could these live outside main>
project or even outside the ASF?  As long as expressions are already>
implemented, bindings are trivial after all.>

If could also allow usage of more scalable hierarchy (let's say with>
modules / packages per function family).>

On 1/29/21 5:01 AM, Hyukjin Kwon wrote:>

FYI exposing methods with Column signature only is already documented on>
the top of functions.scala, and I believe that has been the current dev>
direction if I am not mistaken.>

Another point is that we should rather expose commonly used expressions.>
Its best if it considers language specific context. Many of expressions are>
for SQL compliance. Many data silence python libraries don't support such>
features as an example.>



On Fri, 29 Jan 2021, 12:04 Matthew Powers, <ma...@gmail.com>>
wrote:>

Thanks for the thoughtful responses.  I now understand why adding all the>
functions across all the APIs isn't the default.>

To Nick's point, relying on heuristics to gauge user interest, in>
addition to personal experience, is a good idea.  The regexp_extract_all>
SO thread has 16,000 views>
<https://stackoverflow.com/questions/47981699/extract-words-from-a-string-column-in-spark-dataframe/47989473>,>
so I say we set the threshold to 10k, haha, just kidding!  Like Sean>
mentioned, we don't want to add niche functions.  Now we just need a way to>
figure out what's niche!>

To Reynolds point on overloading Scala functions, I think we should start>
trying to limit the number of overloaded functions.  Some functions have>
the columnName and column object function signatures.  e.g.>
approx_count_distinct(columnName: String, rsd: Double) and>
approx_count_distinct(e: Column, rsd: Double).  We can just expose the>
approx_count_distinct(e: Column, rsd: Double) variety going forward (not>
suggesting any backwards incompatible changes, just saying we don't need>
the columnName-type functions for new stuff).>

Other functions have one signature with the second object as a Scala>
object and another signature with the second object as a column object,>
e.g. date_add(start: Column, days: Column) and date_add(start: Column,>
days: Int).  We can just expose the date_add(start: Column, days: Column)>
variety cause it's general purpose.  Let me know if you think that avoiding>
Scala function overloading will help Reynold.>

Let's brainstorm Nick's idea of creating a framework that'd test Scala />
Python / SQL / R implementations in one-fell-swoop.  Seems like that'd be a>
great way to reduce the maintenance burden.  Reynold's regexp_extract code>
from 5 years ago is largely still intact - getting the job done right the>
first time is another great way to avoid maintenance!>

On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin <rx...@databricks.com> wrote:>

There's another thing that's not mentioned … it's primarily a problem>
for Scala. Due to static typing, we need a very large number of function>
overloads for the Scala version of each function, whereas in SQL/Python>
they are just one. There's a limit on how many functions we can add, and it>
also makes it difficult to browse through the docs when there are a lot of>
functions.>



On Thu, Jan 28, 2021 at 1:09 PM, Maciej <ms...@gmail.com> wrote:>

Just my two cents on R side.>

On 1/28/21 10:00 PM, Nicholas Chammas wrote:>

On Thu, Jan 28, 2021 at 3:40 PM Sean Owen <sr...@gmail.com> wrote:>

It isn't that regexp_extract_all (for example) is useless outside SQL,>
just, where do you draw the line? Supporting 10s of random SQL functions>
across 3 other languages has a cost, which has to be weighed against>
benefit, which we can never measure well except anecdotally: one or two>
people say "I want this" in a sea of hundreds of thousands of users.>


+1 to this, but I will add that Jira and Stack Overflow activity can>
sometimes give good signals about API gaps that are frustrating users. If>
there is an SO question with 30K views about how to do something that>
should have been easier, then that's an important signal about the API.>

For this specific case, I think there is a fine argument>
that regexp_extract_all should be added simply for consistency>
with regexp_extract. I can also see the argument that regexp_extract was a>
step too far, but, what's public is now a public API.>


I think in this case a few references to where/how people are having to>
work around missing a direct function for regexp_extract_all could help>
guide the decision. But that itself means we are making these decisions on>
a case-by-case basis.>

From a user perspective, it's definitely conceptually simpler to have>
SQL functions be consistent and available across all APIs.>

Perhaps if we had a way to lower the maintenance burden of keeping>
functions in sync across SQL/Scala/Python/R, it would be easier for>
everyone to agree to just have all the functions be included across the>
board all the time.>

Python aligns quite well with Scala so that might be fine, but R is a>
bit tricky thing. Especially lack of proper namespaces makes it rather>
risky to have packages that export hundreds of functions. sparkly handles>
this neatly with NSE, but I don't think we're going to go this way.>


Would, for example, some sort of automatic testing mechanism for SQL>
functions help here? Something that uses a common function testing>
specification to automatically test SQL, Scala, Python, and R functions,>
without requiring maintainers to write tests for each language's version of>
the functions. Would that address the maintenance burden?>

With R we don't really test most of the functions beyond the simple>
"callability". One the complex ones, that require some non-trivial>
transformations of arguments, are fully tested.>

-->
Best regards,>
Maciej Szymkiewicz>

Web: https://zero323.net>
Keybase: https://keybase.io/zero323>
Gigs: https://www.codementor.io/@zero323>
PGP: A30CEF0C31A501EC>



-->
Best regards,>
Maciej Szymkiewicz>

Web: https://zero323.net>
Keybase: https://keybase.io/zero323>
Gigs: https://www.codementor.io/@zero323>
PGP: A30CEF0C31A501EC>




Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




Reply via email to