I just commented on the PR -- I personally don't think it's worth removing support for, say, max("foo") over max(col("foo")) or max($"foo") in Scala. We can make breaking changes in Spark 3 but this seems like it would unnecessarily break a lot of code. The string arg is more concise in Python and I can't think of cases where it's particularly ambiguous or confusing; on the contrary it's more natural coming from SQL.
What we do have are inconsistencies and errors in support of string vs Column as fixed in the PR. I was surprised to see that df.select(abs("col")) throws an error while df.select(sqrt("col")) doesn't. I think that's easy to fix on the Python side. Really I think the question is: do we need to add methods like "def abs(String)" and more in Scala? that would remain inconsistent even if the Pyspark side is fixed. On Sun, Feb 24, 2019 at 8:54 AM André Mello <asmello...@gmail.com> wrote: > > # Context > > This comes from [SPARK-26979], which became PR #23879 and then PR > #23882. The following reflects all the findings made so far. > > # Description > > Currently, in the Scala API, some SQL functions have two overloads, > one taking a string that names the column to be operated on, the other > taking a proper Column object. This allows for two patterns of calling > these functions, which is a source of inconsistency and generates > confusion for new users, since it is hard to predict which functions > will take a column name or not. > > The PySpark API partially solves this problem by internally converting > the argument to a Column object prior to passing it through to the > underlying JVM implementation. This allows for a consistent use of > name literals across the API, except for a few violations: > > - lower() > - upper() > - abs() > - bitwiseNOT() > - ltrim() > - rtrim() > - trim() > - ascii() > - base64() > - unbase64() > > These violations happen because for a subset of the SQL functions, > PySpark uses a functional mechanism (`_create_function`) to directly > call the underlying JVM equivalent by name, thus skipping the > conversion step. In most cases the column name pattern still works > because the Scala API has its own support for string arguments, but > the aforementioned functions are also exceptions there. > > My proposal was to solve this problem by adding the string support > where it was missing in the PySpark API. Since this is a purely > additive change, it doesn't break past code. Additionally, I find the > API sugar to be a positive feature, since code like `max("foo")` is > more concise and readable than `max(col("foo"))`. It adheres to the > DRY philosophy and is consistent with Python's preference for > readability over type protection. > > However, upon submission of the PR, a discussion was started about > whether it wouldn't be better to entirely deprecate string support > instead - in particular with major release 3.0 in mind. The reasoning, > as I understood it, was that this approach is more explicit and type > safe, which is preferred in Java/Scala, plus it reduces the API > surface area - and the Python API should be consistent with the others > as well. > > Upon request by @HyukjinKwon I'm submitting this matter for discussion > by this mailing list. > > # Summary > > There is a problem with inconsistency in the Scala/Python SQL API, > where sometimes you can use a column name string as a proxy, and > sometimes you have to use a proper Column object. To solve it there > are two approaches - to remove the string support entirely, or to add > it where it is missing. Which approach is best? > > Hope this is clear. > > -- André. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org