Reynold Xin created SPARK-8568:
----------------------------------

             Summary: Prevent accidental use of "and" and "or" to build invalid 
expressions in Python
                 Key: SPARK-8568
                 URL: https://issues.apache.org/jira/browse/SPARK-8568
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
            Reporter: Reynold Xin
            Assignee: Davies Liu


In Spark DataFrames (and in Pandas as well), the correct way to construct a 
conjunctive expression is to use the bitwise and operator, i.e.: "(x > 5) & (y 
> 6)". 

However, a lot of users assume that they should be using the Python "and" 
keyword, i.e. doing "x > 5 and y > 6". Python's boolean evaluation logic 
converts "x > 5 and y > 6" into just "y > 6" (since "x > 5" is not None). This 
is super confusing & error prone.

We should override __bool__ and __nonzero__ for Column to throw an exception if 
users call "and" and "or" on Column expressions.

Background: see this blog post 
http://www.nodalpoint.com/unexpected-behavior-of-spark-dataframe-filter-method/




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to