Oscar Delicaat created SPARK-41370:
--------------------------------------

             Summary: Add data frame equality check for testing purposes.
                 Key: SPARK-41370
                 URL: https://issues.apache.org/jira/browse/SPARK-41370
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
    Affects Versions: 3.3.0
            Reporter: Oscar Delicaat


We woud like to have the functionality as suggested in 
https://issues.apache.org/jira/browse/SPARK-28172 . It got closed by an 
unrelated story. The comment on the story 

> Wouldn't this require to execute both DataFrames and collect the data into 
> driver side? When the datasets are large, it's very easy for users to shoot 
> them in the foot. I won't do that without an explicit plan and design doc for 
> all other operators.

does not make sense since this will be used for unit testing purposes and the 
dataframes compared will be small.

We are currently using Pandas `assert_frame_equal` 
https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/_testing/asserters.py#L1135-L1358
 functionality which works very well. However, we are having issues with pandas 
since it only supports only a subset of the timestamps supported by Spark 
[https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-timestamp-limits.]
 This is the first time this became a blocker for us so therefore we would like 
to have the functionality to validate equality and get feedback on what is 
different similar to that in Spark. As far as I could see there is nothing in 
PySpark which currently supports does this.

Please let me know any feedback you have, thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to