Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19443 > I think the argument about consistency here is valid, though, I agree with @jaceklaskowski that changes should go one way or the other, i.e. allow string column names or remove this option completely. Yea but to be more specific, I tend to agree with @cloud-fan's suggestion, rather deprecating the current string support first. IMHO, It should have been no such arguments if we didn't start to support string parameters. > I don't really think that is ambiguous, as functions in SQL should either accept column object or column name as string, with the exception of lit; well, and col which accepts only string. Ambiguous functions like concat should always check for type and if it's a string, force change it to Column. This enforces usage of string only as the column name, and lit if it is an actual string literal. I think ambiguity in parameters is not the main concern but I think it's adding more complexity and growing APIs (causing maintenance cost and etc.), particularly in Scala side as discussed before. In general, I usually go -0 or -1 if the workaround is easy and the existing usage does not look quite awkward. It is true that the current mixed (column and string types) looks a bit odd but this could be solved by deprecating it as I said above. > Also, in case of Python the argument is also a little bit different. We need to take into account that many objects like dict or Pandas' DataFrame made addressing columns by string name more Pythonic way of dealing with columns. Thus, Python (and to some extent SQL and R) users expect to be able to use columns by their string names as using a special object for column is a bit more Java (and, thus, Scala) way of looking at things. Bear in mind, that a lot of users of these interfaces are not necessarily technical and strict Column usage argument is a bit alien to them. Thus, I would argue that even if Column argument would be enforced in Java and Scala API, other APIs should keep the by-column-name call possibile, as it is now done in Python, i.e. by mapping the string names into Column. I think we should start this from consistent API support first in this case in general. I do like Pythonic way (to be more clear, Python only specific) we currently support, e.g., udf decorator, context manager support and etc. but, I mean, this case sounds not compelling enough for fixing this Python specifically alone, and this is why it's -0 not -1.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org