[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

HyukjinKwon Fri, 06 Oct 2017 07:08:03 -0700

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19443
  
    > I think the argument about consistency here is valid, though, I agree 
with @jaceklaskowski that changes should go one way or the other, i.e. allow 
string column names or remove this option completely.
    
    Yea but to be more specific, I tend to agree with @cloud-fan's suggestion, 
rather deprecating the current string support first. IMHO, It should have been 
no such arguments if we didn't start to support string parameters.
    
    > I don't really think that is ambiguous, as functions in SQL should either 
accept column object or column name as string, with the exception of lit; well, 
and col which accepts only string. Ambiguous functions like concat should 
always check for type and if it's a string, force change it to Column. This 
enforces usage of string only as the column name, and lit if it is an actual 
string literal.
    
    I think ambiguity in parameters is not the main concern but I think it's 
adding more complexity and growing APIs (causing maintenance cost and etc.), 
particularly in Scala side as discussed before.  In general, I usually go -0 or 
-1 if the workaround is easy and the existing usage does not look quite 
awkward. It is true that the current mixed (column and string types) looks a 
bit odd but this could be solved by deprecating it as I said above.
    
    > Also, in case of Python the argument is also a little bit different. We 
need to take into account that many objects like dict or Pandas' DataFrame made 
addressing columns by string name more Pythonic way of dealing with columns. 
Thus, Python (and to some extent SQL and R) users expect to be able to use 
columns by their string names as using a special object for column is a bit 
more Java (and, thus, Scala) way of looking at things. Bear in mind, that a lot 
of users of these interfaces are not necessarily technical and strict Column 
usage argument is a bit alien to them. Thus, I would argue that even if Column 
argument would be enforced in Java and Scala API, other APIs should keep the 
by-column-name call possibile, as it is now done in Python, i.e. by mapping the 
string names into Column.
    
    I think we should start this from consistent API support first in this case 
in general. I do like Pythonic way (to be more clear, Python only specific) we 
currently support, e.g., udf decorator, context manager support and etc. but, I 
mean, this case sounds not compelling enough for fixing this Python 
specifically alone, and this is why it's -0 not -1.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19443: [SPARK-22212][SQL][PySpark] Some SQL functions in Python...

Reply via email to