During my recent experience developing functions, I found that identifying locations (sql + connect functions.scala + functions.py, FunctionRegistry, + whatever is required for R) and standards for adding function signatures was not straight forward (should you use optional args or overload functions? which col/lit helpers should be used when?). Are there docs describing all of the locations + standards for defining a function? If not, that'd be great to have too.
Ryan Berti Senior Data Engineer | Ads DE M 7023217573 5808 W Sunset Blvd | Los Angeles, CA 90028 On Wed, May 24, 2023 at 12:44 AM Enrico Minack <i...@enrico.minack.dev> wrote: > +1 > > Functions available in SQL (more general in one API) should be available > in all APIs. I am very much in favor of this. > > Enrico > > > Am 24.05.23 um 09:41 schrieb Hyukjin Kwon: > > Hi all, > > I would like to discuss adding all SQL functions into Scala, Python and R > API. > We have SQL functions that do not exist in Scala, Python and R around 175. > For example, we don’t have pyspark.sql.functions.percentile but you can > invoke > it as a SQL function, e.g., SELECT percentile(...). > > The reason why we do not have all functions in the first place is that we > want to > only add commonly used functions, see also > https://github.com/apache/spark/pull/21318 (which I agreed at that time) > > However, this has been raised multiple times over years, from the OSS > community, dev mailing list, JIRAs, stackoverflow, etc. > Seems it’s confusing about which function is available or not. > > Yes, we have a workaround. We can call all expressions by expr("...") or > call_udf("...", > Columns ...) > But still it seems that it’s not very user-friendly because they expect > them available under the functions namespace. > > Therefore, I would like to propose adding all expressions into all > languages so that Spark is simpler and less confusing, e.g., which API is > in functions or not. > > Any thoughts? > > >