arminnh opened a new pull request, #46836:
URL: https://github.com/apache/spark/pull/46836
### What changes were proposed in this pull request?
* Avoid `AttributeError` (see examples below) when mixing Spark DataFrame &
Pandas or Pandas-on-Spark DataFrame in `assertDataFrameEqual` by not using
non-existent functions `assertAlmostEqual` & `assertEqual` in
`PandasOnSparkTestUtils.assert_eq`
* In `PandasOnSparkTestUtils.assert_eq`, applied the Pandas-on-Spark flow
for both params `left` and `right`, instead of only `left`, and clarified the
error to specify that a Pandas or Pandas-on-Spark object is expected, since
which is not immediately obvious from the current error: `DataFrame, DataFrame,
Series, Series, IndexIndex`
### Why are the changes needed?
* `assertDataFrameEqual` results in `AttributeError` when providing a Spark
DataFrame as the first argument and a Pandas DataFrame or a Pandas-on-Spark
DataFrame as the second argument.
### Does this PR introduce _any_ user-facing change?
* Better errors will be raised in the situation described above:
* `PySparkAssertionError` with a message is raised instead of
`AttributeError`.
* `PySparkAssertionError` error when mixing Spark & Pandas-on-Spark
DataFrames is consistently raised in `PandasOnSparkTestUtils.assert_eq`,
regardless of which one is left or right.
* Clarified error message `Expected type DataFrame, DataFrame, Series,
Series, IndexIndex, for ...` -> `Expected type Pandas or Pandas-on-Spark
DataFrame, Series, or Index for ...`
#### Setup:
```
import pandas as pd
import pyspark.pandas as ps
from pyspark.testing import assertDataFrameEqual
df1 = spark.createDataFrame([(10,), (20,), (30,)], ["Numbers"])
df2 = pd.DataFrame(data=[10, 11, 13], columns=["Numbers"])
df3 = ps.DataFrame(data=[10, 11, 13], columns=["Numbers"])
```
#### Before:
```
>>> assertDataFrameEqual(df1, df2, ignoreColumnType=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...spark/python/pyspark/testing/utils.py", line 828, in
assertDataFrameEqual
return PandasOnSparkTestUtils().assert_eq(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...spark/python/pyspark/testing/pandasutils.py", line 483, in
assert_eq
self.assertAlmostEqual(lobj, robj)
^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PandasOnSparkTestUtils' object has no attribute
'assertAlmostEqual'. Did you mean: 'assertPandasEqual'?
>>> assertDataFrameEqual(df2, df1, ignoreColumnType=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...spark/python/pyspark/testing/utils.py", line 828, in
assertDataFrameEqual
return PandasOnSparkTestUtils().assert_eq(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...spark/python/pyspark/testing/pandasutils.py", line 472, in
assert_eq
_assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
File "...spark/python/pyspark/testing/pandasutils.py", line 314, in
_assert_pandas_almost_equal
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError:
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index, for
`right` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.
>>>
>>> assertDataFrameEqual(df1, df3, ignoreColumnType=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...spark/python/pyspark/testing/utils.py", line 828, in
assertDataFrameEqual
return PandasOnSparkTestUtils().assert_eq(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...spark/python/pyspark/testing/pandasutils.py", line 483, in
assert_eq
self.assertAlmostEqual(lobj, robj)
^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PandasOnSparkTestUtils' object has no attribute
'assertAlmostEqual'. Did you mean: 'assertPandasEqual'?
>>> assertDataFrameEqual(df3, df1, ignoreColumnType=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...spark/python/pyspark/testing/utils.py", line 828, in
assertDataFrameEqual
return PandasOnSparkTestUtils().assert_eq(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...spark/python/pyspark/testing/pandasutils.py", line 438, in
assert_eq
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError:
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, DataFrame, Series,
Series, IndexIndex, for `expected` but got type <class
'pyspark.sql.classic.dataframe.DataFrame'>.
```
#### After
```
>>> assertDataFrameEqual(df1, df2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...spark/python/pyspark/testing/utils.py", line 828, in
assertDataFrameEqual
return PandasOnSparkTestUtils().assert_eq(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...spark/python/pyspark/testing/pandasutils.py", line 476, in
assert_eq
_assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
File "...spark/python/pyspark/testing/pandasutils.py", line 300, in
_assert_pandas_almost_equal
raise PySparkAssertionError(
**pyspark.errors.exceptions.base.PySparkAssertionError:
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index, for
`left` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.**
>>> assertDataFrameEqual(df2, df1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...spark/python/pyspark/testing/utils.py", line 828, in
assertDataFrameEqual
return PandasOnSparkTestUtils().assert_eq(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...spark/python/pyspark/testing/pandasutils.py", line 465, in
assert_eq
_assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
File "...spark/python/pyspark/testing/pandasutils.py", line 311, in
_assert_pandas_almost_equal
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError:
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index, for
`right` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.
>>> assertDataFrameEqual(df1, df3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...spark/python/pyspark/testing/utils.py", line 828, in
assertDataFrameEqual
return PandasOnSparkTestUtils().assert_eq(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...spark/python/pyspark/testing/pandasutils.py", line 426, in
assert_eq
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError:
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type Pandas or Pandas-on-Spark
DataFrame, Series, or Index for `left` but got type <class
'pyspark.sql.classic.dataframe.DataFrame'>.
>>> assertDataFrameEqual(df3, df1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...spark/python/pyspark/testing/utils.py", line 828, in
assertDataFrameEqual
return PandasOnSparkTestUtils().assert_eq(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...spark/python/pyspark/testing/pandasutils.py", line 438, in
assert_eq
raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError:
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type Pandas or Pandas-on-Spark
DataFrame, Series, or Index for `right` but got type <class
'pyspark.sql.classic.dataframe.DataFrame'>.
```
### How was this patch tested?
* Manually tested new behavior in local SparkSession.
* Extended existing test case with Pandas-on-Spark DataFrame to confirm the
correct error is raised when the parameters are flipped.
* Added test case with Spark DataFrame & Pandas DataFrame.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]