[PR] [WIP][PYTHON] Fix `assertDataFrameEqual` behavior with mixed DataFrame types [spark]

via GitHub Sun, 02 Jun 2024 07:23:14 -0700


arminnh opened a new pull request, #46836:
URL: https://github.com/apache/spark/pull/46836


   ### What changes were proposed in this pull request?
   * Avoid `AttributeError` (see examples below) when mixing Spark DataFrame & 
Pandas or Pandas-on-Spark DataFrame in `assertDataFrameEqual` by not using 
non-existent functions `assertAlmostEqual` & `assertEqual` in 
`PandasOnSparkTestUtils.assert_eq`
   * In `PandasOnSparkTestUtils.assert_eq`, applied the Pandas-on-Spark flow 
for both params `left` and `right`, instead of only `left`, and clarified the 
error to specify that a Pandas or Pandas-on-Spark object is expected, since 
which is not immediately obvious from the current error: `DataFrame, DataFrame, 
Series, Series, IndexIndex`
   
   ### Why are the changes needed?
   * `assertDataFrameEqual` results in `AttributeError` when providing a Spark 
DataFrame as the first argument and a Pandas DataFrame or a Pandas-on-Spark 
DataFrame as the second argument.
   
   ### Does this PR introduce _any_ user-facing change?
   * Better errors will be raised in the situation described above:
     * `PySparkAssertionError` with a message is raised instead of 
`AttributeError`.
     * `PySparkAssertionError` error when mixing Spark & Pandas-on-Spark 
DataFrames is consistently raised in `PandasOnSparkTestUtils.assert_eq`, 
regardless of which one is left or right.
     * Clarified error message `Expected type DataFrame, DataFrame, Series, 
Series, IndexIndex,  for ...` -> `Expected type Pandas or Pandas-on-Spark 
DataFrame, Series, or Index for ...`
   
   #### Setup:
   
   ```
   import pandas as pd
   import pyspark.pandas as ps
   from pyspark.testing import assertDataFrameEqual
   
   df1 = spark.createDataFrame([(10,), (20,), (30,)], ["Numbers"])
   df2 = pd.DataFrame(data=[10, 11, 13], columns=["Numbers"])
   df3 = ps.DataFrame(data=[10, 11, 13], columns=["Numbers"])
   ```
   
   #### Before:
   
   ```
   >>> assertDataFrameEqual(df1, df2, ignoreColumnType=True)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "...spark/python/pyspark/testing/utils.py", line 828, in 
assertDataFrameEqual
       return PandasOnSparkTestUtils().assert_eq(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "...spark/python/pyspark/testing/pandasutils.py", line 483, in 
assert_eq
       self.assertAlmostEqual(lobj, robj)
       ^^^^^^^^^^^^^^^^^^^^^^
   AttributeError: 'PandasOnSparkTestUtils' object has no attribute 
'assertAlmostEqual'. Did you mean: 'assertPandasEqual'?
   
   >>> assertDataFrameEqual(df2, df1, ignoreColumnType=True)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "...spark/python/pyspark/testing/utils.py", line 828, in 
assertDataFrameEqual
       return PandasOnSparkTestUtils().assert_eq(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "...spark/python/pyspark/testing/pandasutils.py", line 472, in 
assert_eq
       _assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
     File "...spark/python/pyspark/testing/pandasutils.py", line 314, in 
_assert_pandas_almost_equal
       raise PySparkAssertionError(
   pyspark.errors.exceptions.base.PySparkAssertionError: 
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index,  for 
`right` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.
   >>>
   
   >>> assertDataFrameEqual(df1, df3, ignoreColumnType=True)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "...spark/python/pyspark/testing/utils.py", line 828, in 
assertDataFrameEqual
       return PandasOnSparkTestUtils().assert_eq(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "...spark/python/pyspark/testing/pandasutils.py", line 483, in 
assert_eq
       self.assertAlmostEqual(lobj, robj)
       ^^^^^^^^^^^^^^^^^^^^^^
   AttributeError: 'PandasOnSparkTestUtils' object has no attribute 
'assertAlmostEqual'. Did you mean: 'assertPandasEqual'?
   
   >>> assertDataFrameEqual(df3, df1, ignoreColumnType=True)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "...spark/python/pyspark/testing/utils.py", line 828, in 
assertDataFrameEqual
       return PandasOnSparkTestUtils().assert_eq(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "...spark/python/pyspark/testing/pandasutils.py", line 438, in 
assert_eq
       raise PySparkAssertionError(
   pyspark.errors.exceptions.base.PySparkAssertionError: 
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, DataFrame, Series, 
Series, IndexIndex,  for `expected` but got type <class 
'pyspark.sql.classic.dataframe.DataFrame'>.
   ```
   
   #### After
   
   ```
   >>> assertDataFrameEqual(df1, df2)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "...spark/python/pyspark/testing/utils.py", line 828, in 
assertDataFrameEqual
       return PandasOnSparkTestUtils().assert_eq(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "...spark/python/pyspark/testing/pandasutils.py", line 476, in 
assert_eq
       _assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
     File "...spark/python/pyspark/testing/pandasutils.py", line 300, in 
_assert_pandas_almost_equal
       raise PySparkAssertionError(
   **pyspark.errors.exceptions.base.PySparkAssertionError: 
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index,  for 
`left` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.**
   
   >>> assertDataFrameEqual(df2, df1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "...spark/python/pyspark/testing/utils.py", line 828, in 
assertDataFrameEqual
       return PandasOnSparkTestUtils().assert_eq(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "...spark/python/pyspark/testing/pandasutils.py", line 465, in 
assert_eq
       _assert_pandas_almost_equal(lobj, robj, rtol=rtol, atol=atol)
     File "...spark/python/pyspark/testing/pandasutils.py", line 311, in 
_assert_pandas_almost_equal
       raise PySparkAssertionError(
   pyspark.errors.exceptions.base.PySparkAssertionError: 
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type DataFrame, Series, Index,  for 
`right` but got type <class 'pyspark.sql.classic.dataframe.DataFrame'>.
   
   >>> assertDataFrameEqual(df1, df3)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "...spark/python/pyspark/testing/utils.py", line 828, in 
assertDataFrameEqual
       return PandasOnSparkTestUtils().assert_eq(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "...spark/python/pyspark/testing/pandasutils.py", line 426, in 
assert_eq
       raise PySparkAssertionError(
   pyspark.errors.exceptions.base.PySparkAssertionError: 
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type Pandas or Pandas-on-Spark 
DataFrame, Series, or Index for `left` but got type <class 
'pyspark.sql.classic.dataframe.DataFrame'>.
   
   >>> assertDataFrameEqual(df3, df1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "...spark/python/pyspark/testing/utils.py", line 828, in 
assertDataFrameEqual
       return PandasOnSparkTestUtils().assert_eq(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "...spark/python/pyspark/testing/pandasutils.py", line 438, in 
assert_eq
       raise PySparkAssertionError(
   pyspark.errors.exceptions.base.PySparkAssertionError: 
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type Pandas or Pandas-on-Spark 
DataFrame, Series, or Index for `right` but got type <class 
'pyspark.sql.classic.dataframe.DataFrame'>.
   ```
   
   ### How was this patch tested?
   * Manually tested new behavior in local SparkSession.
   * Extended existing test case with Pandas-on-Spark DataFrame to confirm the 
correct error is raised when the parameters are flipped.
   * Added test case with Spark DataFrame & Pandas DataFrame.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP][PYTHON] Fix `assertDataFrameEqual` behavior with mixed DataFrame types [spark]

Reply via email to