Thank you all for your amazing work on this project. Spark has a great public interface and the source code is clean. The core team has done a great job building and maintaining this project. My emails / GitHub comments focus on the 1% that we might be able to improve.
Pull requests / suggestions for improvements can come across as negative, but I'm nothing but happy & positive about this project. The source code is delightful to read and the internal abstractions are beautiful. *API consistency* The SQL, Scala, and Python APIs are generally consistent. They all have a reverse function for example. Some of the new PRs have arguments against consistent rollout of functions across the APIs. This seems like a break in the traditional Spark development process when functions were implemented in all APIs (except for functions that only make sense for certain APIs like createDataset and toDS). The default has shifted from consistent application of function across APIs to "case by case determination". *Examples* * The regexp_extract_all function was recently added to the SQL API. It was then added to the Scala API, but then removed from the Scala API <https://github.com/apache/spark/pull/31346> . * There is an ongoing discussion on if CalendarType will be added to the Python API <https://github.com/apache/spark/pull/29935> *Arguments against adding functions like regexp_extract_all to the Scala API:* * Some of these functions are SQL specific and don't make sense for the other languages * Scala users can access the SQL functions via expr *Argument rebuttal* I don't understand the "some of the functions are SQL specific argument". regexp_extract_all fills a gap in the API. Users have been forced to use UDF workarounds for this in the past. Users from all APIs need this solution. Using expr isn't developer friendly. Scala / Python users don't want to manipulate SQL strings. Nesting functions in SQL strings is complicated. The quoting and escaping is all different. Figuring out how to invoke regexp_replace(col("word1"), "//", "\\,") via expr would be a real pain - would need to figure out SQL quoting, SQL escaping, and how to access column names instead of a column object. Any of the org.apache.spark.sql.functions can be invoked via expr. The core reason the Scala/Python APIs exist is so that developers don't need to manipulate strings for expr. regexp_extract_all should be added to the Scala API for the same reasons that regexp_extract was added to the Scala API. *Next steps* * I'd like to better understand why we've broken from the traditional Spark development process of "consistently implementing functions across all APIs" to "selectively implementing functions in certain APIs" * Hopefully shift the burden of proof to those in favor of inconsistent application. Consistent application should be the default. Thank you all for your excellent work on this project. - Matthew Powers (GitHub: MrPowers) -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org