Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Sean Owen Sun, 24 Feb 2019 07:00:02 -0800

I just commented on the PR -- I personally don't think it's worth
removing support for, say, max("foo") over max(col("foo")) or
max($"foo") in Scala. We can make breaking changes in Spark 3 but this
seems like it would unnecessarily break a lot of code. The string arg
is more concise in Python and I can't think of cases where it's
particularly ambiguous or confusing; on the contrary it's more natural
coming from SQL.


What we do have are inconsistencies and errors in support of string vs
Column as fixed in the PR. I was surprised to see that
df.select(abs("col")) throws an error while df.select(sqrt("col"))
doesn't. I think that's easy to fix on the Python side. Really I think
the question is: do we need to add methods like "def abs(String)" and
more in Scala? that would remain inconsistent even if the Pyspark side
is fixed.

On Sun, Feb 24, 2019 at 8:54 AM André Mello <asmello...@gmail.com> wrote:
>
> # Context
>
> This comes from [SPARK-26979], which became PR #23879 and then PR
> #23882. The following reflects all the findings made so far.
>
> # Description
>
> Currently, in the Scala API, some SQL functions have two overloads,
> one taking a string that names the column to be operated on, the other
> taking a proper Column object. This allows for two patterns of calling
> these functions, which is a source of inconsistency and generates
> confusion for new users, since it is hard to predict which functions
> will take a column name or not.
>
> The PySpark API partially solves this problem by internally converting
> the argument to a Column object prior to passing it through to the
> underlying JVM implementation. This allows for a consistent use of
> name literals across the API, except for a few violations:
>
> - lower()
> - upper()
> - abs()
> - bitwiseNOT()
> - ltrim()
> - rtrim()
> - trim()
> - ascii()
> - base64()
> - unbase64()
>
> These violations happen because for a subset of the SQL functions,
> PySpark uses a functional mechanism (`_create_function`) to directly
> call the underlying JVM equivalent by name, thus skipping the
> conversion step. In most cases the column name pattern still works
> because the Scala API has its own support for string arguments, but
> the aforementioned functions are also exceptions there.
>
> My proposal was to solve this problem by adding the string support
> where it was missing in the PySpark API. Since this is a purely
> additive change, it doesn't break past code. Additionally, I find the
> API sugar to be a positive feature, since code like `max("foo")` is
> more concise and readable than `max(col("foo"))`. It adheres to the
> DRY philosophy and is consistent with Python's preference for
> readability over type protection.
>
> However, upon submission of the PR, a discussion was started about
> whether it wouldn't be better to entirely deprecate string support
> instead - in particular with major release 3.0 in mind. The reasoning,
> as I understood it, was that this approach is more explicit and type
> safe, which is preferred in Java/Scala, plus it reduces the API
> surface area - and the Python API should be consistent with the others
> as well.
>
> Upon request by @HyukjinKwon I'm submitting this matter for discussion
> by this mailing list.
>
> # Summary
>
> There is a problem with inconsistency in the Scala/Python SQL API,
> where sometimes you can use a column name string as a proxy, and
> sometimes you have to use a proper Column object. To solve it there
> are two approaches - to remove the string support entirely, or to add
> it where it is missing. Which approach is best?
>
> Hope this is clear.
>
> -- André.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Reply via email to