zero323 commented on PR #36547:
URL: https://github.com/apache/spark/pull/36547#issuecomment-1129228724
> Would you give an example in which case we may diverge from pandas? I
Sure thing @xinrong-databricks. Sorry for being enigmatic before. So, very
simple case would be something like this:
```python
>>> import pandas as pd
>>> import pyspark.pandas as ps
>>> ps.Series(["foo", ""]).all()
True
>>> pd.Series(["foo", ""]).all()
False
```
In Pandas we follow standard Python truthness convention so anything non
empty (not empty string, not empty list) or not zero is True.
In Spark, in case of strings, we expect `true` / `false` otherwise we
evaluate cast to `NULL`.
Furthermore, we have cases where `boolean` cast is not allowed. For example
this is valid in Pandas
```python
>>> pd.Series([["foo"], ["bar"]]).all()
True
```
but invalid in Pandas on Spark:
```python
>>> ps.Series([["foo"], ["bar"]]).all()
Traceback (most recent call last):
...
AnalysisException: cannot resolve 'CAST(`0` AS BOOLEAN)' due to data type
mismatch: cannot cast array<string> to boolean;
'Aggregate [unresolvedalias(min(coalesce(cast(0#46 as boolean), true)),
Some(org.apache.spark.sql.Column$$Lambda$1717/0x0000000100e64040@16600c7b))]
+- Project [0#46]
+- Project [__index_level_0__#45L, 0#46, monotonically_increasing_id() AS
__natural_order__#49L]
+- LogicalRDD [__index_level_0__#45L, 0#46], false
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]