# Context This comes from [SPARK-26979], which became PR #23879 and then PR #23882. The following reflects all the findings made so far.
# Description Currently, in the Scala API, some SQL functions have two overloads, one taking a string that names the column to be operated on, the other taking a proper Column object. This allows for two patterns of calling these functions, which is a source of inconsistency and generates confusion for new users, since it is hard to predict which functions will take a column name or not. The PySpark API partially solves this problem by internally converting the argument to a Column object prior to passing it through to the underlying JVM implementation. This allows for a consistent use of name literals across the API, except for a few violations: - lower() - upper() - abs() - bitwiseNOT() - ltrim() - rtrim() - trim() - ascii() - base64() - unbase64() These violations happen because for a subset of the SQL functions, PySpark uses a functional mechanism (`_create_function`) to directly call the underlying JVM equivalent by name, thus skipping the conversion step. In most cases the column name pattern still works because the Scala API has its own support for string arguments, but the aforementioned functions are also exceptions there. My proposal was to solve this problem by adding the string support where it was missing in the PySpark API. Since this is a purely additive change, it doesn't break past code. Additionally, I find the API sugar to be a positive feature, since code like `max("foo")` is more concise and readable than `max(col("foo"))`. It adheres to the DRY philosophy and is consistent with Python's preference for readability over type protection. However, upon submission of the PR, a discussion was started about whether it wouldn't be better to entirely deprecate string support instead - in particular with major release 3.0 in mind. The reasoning, as I understood it, was that this approach is more explicit and type safe, which is preferred in Java/Scala, plus it reduces the API surface area - and the Python API should be consistent with the others as well. Upon request by @HyukjinKwon I'm submitting this matter for discussion by this mailing list. # Summary There is a problem with inconsistency in the Scala/Python SQL API, where sometimes you can use a column name string as a proxy, and sometimes you have to use a proper Column object. To solve it there are two approaches - to remove the string support entirely, or to add it where it is missing. Which approach is best? Hope this is clear. -- André. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org