zero323 commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1129228724

   > Would you give an example in which case we may diverge from pandas? I
   
   Sure thing @xinrong-databricks. Sorry for being enigmatic before. So,  very 
simple case would be something like this:
   
   ```python
   >>> import pandas as pd
   >>> import pyspark.pandas as ps
   >>> ps.Series(["foo", ""]).all()
   True
   >>> pd.Series(["foo", ""]).all()
   False
   ```
   
   In Pandas we follow standard Python truthness  convention so anything non 
empty (not empty string, not empty list) or not zero is True.
   
   In Spark, in case of strings, we expect `true` / `false` otherwise we 
evaluate cast to `NULL`.
   
   Furthermore, we have cases where `boolean` cast is not allowed. For example 
this is valid in Pandas
   
   ```python
   >>> pd.Series([["foo"], ["bar"]]).all()
   True
   ```
   but invalid in Pandas on Spark:
   
   ```python
   >>> ps.Series([["foo"], ["bar"]]).all()
   Traceback (most recent call last):
   ...
   AnalysisException: cannot resolve 'CAST(`0` AS BOOLEAN)' due to data type 
mismatch: cannot cast array<string> to boolean;
   'Aggregate [unresolvedalias(min(coalesce(cast(0#46 as boolean), true)), 
Some(org.apache.spark.sql.Column$$Lambda$1717/0x0000000100e64040@16600c7b))]
   +- Project [0#46]
      +- Project [__index_level_0__#45L, 0#46, monotonically_increasing_id() AS 
__natural_order__#49L]
         +- LogicalRDD [__index_level_0__#45L, 0#46], false
   ```
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to