[DISCUSS][SQL][PySpark] Column name support for SQL functions

André Mello Sun, 24 Feb 2019 06:54:18 -0800

# Context

This comes from [SPARK-26979], which became PR #23879 and then PR
#23882. The following reflects all the findings made so far.


# Description

Currently, in the Scala API, some SQL functions have two overloads,
one taking a string that names the column to be operated on, the other
taking a proper Column object. This allows for two patterns of calling
these functions, which is a source of inconsistency and generates
confusion for new users, since it is hard to predict which functions
will take a column name or not.

The PySpark API partially solves this problem by internally converting
the argument to a Column object prior to passing it through to the
underlying JVM implementation. This allows for a consistent use of
name literals across the API, except for a few violations:

- lower()
- upper()
- abs()
- bitwiseNOT()
- ltrim()
- rtrim()
- trim()
- ascii()
- base64()
- unbase64()

These violations happen because for a subset of the SQL functions,
PySpark uses a functional mechanism (`_create_function`) to directly
call the underlying JVM equivalent by name, thus skipping the
conversion step. In most cases the column name pattern still works
because the Scala API has its own support for string arguments, but
the aforementioned functions are also exceptions there.

My proposal was to solve this problem by adding the string support
where it was missing in the PySpark API. Since this is a purely
additive change, it doesn't break past code. Additionally, I find the
API sugar to be a positive feature, since code like `max("foo")` is
more concise and readable than `max(col("foo"))`. It adheres to the
DRY philosophy and is consistent with Python's preference for
readability over type protection.

However, upon submission of the PR, a discussion was started about
whether it wouldn't be better to entirely deprecate string support
instead - in particular with major release 3.0 in mind. The reasoning,
as I understood it, was that this approach is more explicit and type
safe, which is preferred in Java/Scala, plus it reduces the API
surface area - and the Python API should be consistent with the others
as well.

Upon request by @HyukjinKwon I'm submitting this matter for discussion
by this mailing list.

# Summary

There is a problem with inconsistency in the Scala/Python SQL API,
where sometimes you can use a column name string as a proxy, and
sometimes you have to use a proper Column object. To solve it there
are two approaches - to remove the string support entirely, or to add
it where it is missing. Which approach is best?

Hope this is clear.

-- André.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[DISCUSS][SQL][PySpark] Column name support for SQL functions

Reply via email to