Hi, MrPowers
I'm also interested in this idea.
I started https://github.com/yaooqinn/spark-func-extras a few month ago
Maciej - I like the idea of a separate library to provide easy access to>
functions that the maintainers don't want to merge into Spark core.>
I've seen this model work well in other open source communities. The Rails>
Active Support library provides the Ruby community with core functionality>
like beginning_of_month. The Ruby community has a good, well-supported>
function, but it's not in the Ruby codebase so it's not a maintenance>
burden - best of both worlds.>
I'll start a proof-of-concept repo. If the repo gets popular, I'll be>
happy to donate it to a GitHub organization like Awesome Spark>
<https://github.com/awesome-spark> or the ASF.>
On Sat, Jan 30, 2021 at 9:35 AM Maciej <ms...@gmail.com> wrote:>
Just thinking out loud ‒ if there is community need for providing language>
bindings for less popular SQL functions, could these live outside main>
project or even outside the ASF? As long as expressions are already>
implemented, bindings are trivial after all.>
If could also allow usage of more scalable hierarchy (let's say with>
modules / packages per function family).>
On 1/29/21 5:01 AM, Hyukjin Kwon wrote:>
FYI exposing methods with Column signature only is already documented on>
the top of functions.scala, and I believe that has been the current dev>
direction if I am not mistaken.>
Another point is that we should rather expose commonly used expressions.>
Its best if it considers language specific context. Many of expressions are>
for SQL compliance. Many data silence python libraries don't support such>
features as an example.>
On Fri, 29 Jan 2021, 12:04 Matthew Powers, <ma...@gmail.com>>
wrote:>
-->
Thanks for the thoughtful responses. I now understand why adding all the>
functions across all the APIs isn't the default.>
To Nick's point, relying on heuristics to gauge user interest, in>
addition to personal experience, is a good idea. The regexp_extract_all>
SO thread has 16,000 views>
<https://stackoverflow.com/questions/47981699/extract-words-from-a-string-column-in-spark-dataframe/47989473>,>
so I say we set the threshold to 10k, haha, just kidding! Like Sean>
mentioned, we don't want to add niche functions. Now we just need a way to>
figure out what's niche!>
To Reynolds point on overloading Scala functions, I think we should start>
trying to limit the number of overloaded functions. Some functions have>
the columnName and column object function signatures. e.g.>
approx_count_distinct(columnName: String, rsd: Double) and>
approx_count_distinct(e: Column, rsd: Double). We can just expose the>
approx_count_distinct(e: Column, rsd: Double) variety going forward (not>
suggesting any backwards incompatible changes, just saying we don't need>
the columnName-type functions for new stuff).>
Other functions have one signature with the second object as a Scala>
object and another signature with the second object as a column object,>
e.g. date_add(start: Column, days: Column) and date_add(start: Column,>
days: Int). We can just expose the date_add(start: Column, days: Column)>
variety cause it's general purpose. Let me know if you think that avoiding>
Scala function overloading will help Reynold.>
Let's brainstorm Nick's idea of creating a framework that'd test Scala />
Python / SQL / R implementations in one-fell-swoop. Seems like that'd be a>
great way to reduce the maintenance burden. Reynold's regexp_extract code>
from 5 years ago is largely still intact - getting the job done right the>
first time is another great way to avoid maintenance!>
On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin <rx...@databricks.com> wrote:>
There's another thing that's not mentioned … it's primarily a problem>
for Scala. Due to static typing, we need a very large number of function>
overloads for the Scala version of each function, whereas in SQL/Python>
they are just one. There's a limit on how many functions we can add, and it>
also makes it difficult to browse through the docs when there are a lot of>
functions.>
On Thu, Jan 28, 2021 at 1:09 PM, Maciej <ms...@gmail.com> wrote:>
Just my two cents on R side.>
On 1/28/21 10:00 PM, Nicholas Chammas wrote:>
On Thu, Jan 28, 2021 at 3:40 PM Sean Owen <sr...@gmail.com> wrote:>
+1 to this, but I will add that Jira and Stack Overflow activity can>
It isn't that regexp_extract_all (for example) is useless outside SQL,>
just, where do you draw the line? Supporting 10s of random SQL functions>
across 3 other languages has a cost, which has to be weighed against>
benefit, which we can never measure well except anecdotally: one or two>
people say "I want this" in a sea of hundreds of thousands of users.>
sometimes give good signals about API gaps that are frustrating users. If>
there is an SO question with 30K views about how to do something that>
should have been easier, then that's an important signal about the API.>
For this specific case, I think there is a fine argument>
that regexp_extract_all should be added simply for consistency>I think in this case a few references to where/how people are having to>
with regexp_extract. I can also see the argument that regexp_extract was a>
step too far, but, what's public is now a public API.>
work around missing a direct function for regexp_extract_all could help>
guide the decision. But that itself means we are making these decisions on>
a case-by-case basis.>
From a user perspective, it's definitely conceptually simpler to have>
SQL functions be consistent and available across all APIs.>
Perhaps if we had a way to lower the maintenance burden of keeping>
functions in sync across SQL/Scala/Python/R, it would be easier for>
everyone to agree to just have all the functions be included across the>
board all the time.>
Python aligns quite well with Scala so that might be fine, but R is a>
bit tricky thing. Especially lack of proper namespaces makes it rather>
risky to have packages that export hundreds of functions. sparkly handles>
this neatly with NSE, but I don't think we're going to go this way.>
Would, for example, some sort of automatic testing mechanism for SQL>
functions help here? Something that uses a common function testing>
specification to automatically test SQL, Scala, Python, and R functions,>
without requiring maintainers to write tests for each language's version of>
the functions. Would that address the maintenance burden?>
With R we don't really test most of the functions beyond the simple>
"callability". One the complex ones, that require some non-trivial>
transformations of arguments, are fully tested.>
-->
Best regards,>
Maciej Szymkiewicz>
Web: https://zero323.net>
Keybase: https://keybase.io/zero323>
Gigs: https://www.codementor.io/@zero323>
PGP: A30CEF0C31A501EC>
Best regards,>
Maciej Szymkiewicz>
Web: https://zero323.net>
Keybase: https://keybase.io/zero323>
Gigs: https://www.codementor.io/@zero323>
PGP: A30CEF0C31A501EC>
Kent Yao
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.