Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/19443
> I think the argument about consistency here is valid, though, I agree
with @jaceklaskowski that changes should go one way or the other, i.e. allow
string column names or remove this option completely.
Yea but to be more specific, I tend to agree with @cloud-fan's suggestion,
rather deprecating the current string support first. IMHO, It should have been
no such arguments if we didn't start to support string parameters.
> I don't really think that is ambiguous, as functions in SQL should either
accept column object or column name as string, with the exception of lit; well,
and col which accepts only string. Ambiguous functions like concat should
always check for type and if it's a string, force change it to Column. This
enforces usage of string only as the column name, and lit if it is an actual
string literal.
I think ambiguity in parameters is not the main concern but I think it's
adding more complexity and growing APIs (causing maintenance cost and etc.),
particularly in Scala side as discussed before. In general, I usually go -0 or
-1 if the workaround is easy and the existing usage does not look quite
awkward. It is true that the current mixed (column and string types) looks a
bit odd but this could be solved by deprecating it as I said above.
> Also, in case of Python the argument is also a little bit different. We
need to take into account that many objects like dict or Pandas' DataFrame made
addressing columns by string name more Pythonic way of dealing with columns.
Thus, Python (and to some extent SQL and R) users expect to be able to use
columns by their string names as using a special object for column is a bit
more Java (and, thus, Scala) way of looking at things. Bear in mind, that a lot
of users of these interfaces are not necessarily technical and strict Column
usage argument is a bit alien to them. Thus, I would argue that even if Column
argument would be enforced in Java and Scala API, other APIs should keep the
by-column-name call possibile, as it is now done in Python, i.e. by mapping the
string names into Column.
I think we should start this from consistent API support first in this case
in general. I do like Pythonic way (to be more clear, Python only specific) we
currently support, e.g., udf decorator, context manager support and etc. but, I
mean, this case sounds not compelling enough for fixing this Python
specifically alone, and this is why it's -0 not -1.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]