Thanks for the thoughtful responses. I now understand why adding all the functions across all the APIs isn't the default.
To Nick's point, relying on heuristics to gauge user interest, in addition to personal experience, is a good idea. The regexp_extract_all SO thread has 16,000 views <https://stackoverflow.com/questions/47981699/extract-words-from-a-string-column-in-spark-dataframe/47989473>, so I say we set the threshold to 10k, haha, just kidding! Like Sean mentioned, we don't want to add niche functions. Now we just need a way to figure out what's niche! To Reynolds point on overloading Scala functions, I think we should start trying to limit the number of overloaded functions. Some functions have the columnName and column object function signatures. e.g. approx_count_distinct(columnName: String, rsd: Double) and approx_count_distinct(e: Column, rsd: Double). We can just expose the approx_count_distinct(e: Column, rsd: Double) variety going forward (not suggesting any backwards incompatible changes, just saying we don't need the columnName-type functions for new stuff). Other functions have one signature with the second object as a Scala object and another signature with the second object as a column object, e.g. date_add(start: Column, days: Column) and date_add(start: Column, days: Int). We can just expose the date_add(start: Column, days: Column) variety cause it's general purpose. Let me know if you think that avoiding Scala function overloading will help Reynold. Let's brainstorm Nick's idea of creating a framework that'd test Scala / Python / SQL / R implementations in one-fell-swoop. Seems like that'd be a great way to reduce the maintenance burden. Reynold's regexp_extract code from 5 years ago is largely still intact - getting the job done right the first time is another great way to avoid maintenance! On Thu, Jan 28, 2021 at 6:38 PM Reynold Xin <r...@databricks.com> wrote: > There's another thing that's not mentioned … it's primarily a problem for > Scala. Due to static typing, we need a very large number of function > overloads for the Scala version of each function, whereas in SQL/Python > they are just one. There's a limit on how many functions we can add, and it > also makes it difficult to browse through the docs when there are a lot of > functions. > > > > On Thu, Jan 28, 2021 at 1:09 PM, Maciej <mszymkiew...@gmail.com> wrote: > >> Just my two cents on R side. >> >> On 1/28/21 10:00 PM, Nicholas Chammas wrote: >> >> On Thu, Jan 28, 2021 at 3:40 PM Sean Owen <sro...@gmail.com> wrote: >> >>> It isn't that regexp_extract_all (for example) is useless outside SQL, >>> just, where do you draw the line? Supporting 10s of random SQL functions >>> across 3 other languages has a cost, which has to be weighed against >>> benefit, which we can never measure well except anecdotally: one or two >>> people say "I want this" in a sea of hundreds of thousands of users. >>> >> >> +1 to this, but I will add that Jira and Stack Overflow activity can >> sometimes give good signals about API gaps that are frustrating users. If >> there is an SO question with 30K views about how to do something that >> should have been easier, then that's an important signal about the API. >> >> For this specific case, I think there is a fine argument >>> that regexp_extract_all should be added simply for consistency >>> with regexp_extract. I can also see the argument that regexp_extract was a >>> step too far, but, what's public is now a public API. >>> >> >> I think in this case a few references to where/how people are having to >> work around missing a direct function for regexp_extract_all could help >> guide the decision. But that itself means we are making these decisions on >> a case-by-case basis. >> >> From a user perspective, it's definitely conceptually simpler to have SQL >> functions be consistent and available across all APIs. >> >> Perhaps if we had a way to lower the maintenance burden of keeping >> functions in sync across SQL/Scala/Python/R, it would be easier for >> everyone to agree to just have all the functions be included across the >> board all the time. >> >> Python aligns quite well with Scala so that might be fine, but R is a bit >> tricky thing. Especially lack of proper namespaces makes it rather risky to >> have packages that export hundreds of functions. sparkly handles this >> neatly with NSE, but I don't think we're going to go this way. >> >> >> Would, for example, some sort of automatic testing mechanism for SQL >> functions help here? Something that uses a common function testing >> specification to automatically test SQL, Scala, Python, and R functions, >> without requiring maintainers to write tests for each language's version of >> the functions. Would that address the maintenance burden? >> >> With R we don't really test most of the functions beyond the simple >> "callability". One the complex ones, that require some non-trivial >> transformations of arguments, are fully tested. >> >> -- >> Best regards, >> Maciej Szymkiewicz >> >> Web: https://zero323.net >> Keybase: https://keybase.io/zero323 >> Gigs: https://www.codementor.io/@zero323 >> PGP: A30CEF0C31A501EC >> >> >