Oscar Delicaat created SPARK-41370: -------------------------------------- Summary: Add data frame equality check for testing purposes. Key: SPARK-41370 URL: https://issues.apache.org/jira/browse/SPARK-41370 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.3.0 Reporter: Oscar Delicaat
We woud like to have the functionality as suggested in https://issues.apache.org/jira/browse/SPARK-28172 . It got closed by an unrelated story. The comment on the story > Wouldn't this require to execute both DataFrames and collect the data into > driver side? When the datasets are large, it's very easy for users to shoot > them in the foot. I won't do that without an explicit plan and design doc for > all other operators. does not make sense since this will be used for unit testing purposes and the dataframes compared will be small. We are currently using Pandas `assert_frame_equal` https://github.com/pandas-dev/pandas/blob/v1.5.2/pandas/_testing/asserters.py#L1135-L1358 functionality which works very well. However, we are having issues with pandas since it only supports only a subset of the timestamps supported by Spark [https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-timestamp-limits.] This is the first time this became a blocker for us so therefore we would like to have the functionality to validate equality and get feedback on what is different similar to that in Spark. As far as I could see there is nothing in PySpark which currently supports does this. Please let me know any feedback you have, thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org