[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

jsnowacki Fri, 06 Oct 2017 06:30:15 -0700

Github user jsnowacki commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    @HyukjinKwon Thanks for pointing that out. I think the argument about 
consistency here is valid, though, I agree with @jaceklaskowski that changes 
should go one way or the other, i.e. allow string column names or remove this 
option completely. I don't really think that is ambiguous, as functions in SQL 
should either accept column object or column name as string, with the exception 
of `lit`; well, and `col` which accepts only string. Ambiguous functions like 
`concat` should always check for type and if it's a string, force change it to 
`Column`. This enforces usage of string only as the column name, and `lit` if 
it is an actual string literal. 
    
    Also, in case of Python the argument is also a little bit different. We 
need to take into account that many objects like `dict` or Pandas' `DataFrame` 
made addressing columns by string name more Pythonic way of dealing with 
columns. Thus, Python (and to some extent SQL and R) users expect to be able to 
use columns by their string names as using a special object for column is a bit 
more Java (and, thus, Scala) way of looking at things. Bear in mind, that a lot 
of users of these interfaces are not necessarily technical and strict `Column` 
usage argument is a bit alien to them. Thus, I would argue that even if 
`Column` argument would be enforced in Java and Scala API, other APIs should 
keep the by-column-name call possibile, as it is now done in Python, i.e. by 
mapping the string names into `Column`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Reply via email to