Aimilios Tsouvelekakis created SPARK-54090:
----------------------------------------------

             Summary: AssertDataframeEqual carries rows when showing differences
                 Key: SPARK-54090
                 URL: https://issues.apache.org/jira/browse/SPARK-54090
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.0.1
            Reporter: Aimilios Tsouvelekakis


When we try to do assertDataFrameEqual, on two dataframes that have not the 
same amount of rows, the output gets cascading from the difference till the 
end. Why this is happening:

 
{code:java}
    def assert_rows_equal(
        rows1: List[Row], rows2: List[Row], maxErrors: int = None, 
showOnlyDiff: bool = False
    ):
        __tracebackhide__ = True
        zipped = list(zip_longest(rows1, rows2))
        diff_rows_cnt = 0
        diff_rows = []
        has_diff_rows = False        rows_str1 = ""
        rows_str2 = ""        # count different rows
        for r1, r2 in zipped:
            if not compare_rows(r1, r2):
                diff_rows_cnt += 1
                has_diff_rows = True
                if includeDiffRows:
                    diff_rows.append((r1, r2))
                rows_str1 += str(r1) + "\n"
                rows_str2 += str(r2) + "\n"
                if maxErrors is not None and diff_rows_cnt >= maxErrors:
                    break
            elif not showOnlyDiff:
                rows_str1 += str(r1) + "\n"
                rows_str2 += str(r2) + "\n"        generated_diff = 
_context_diff(
            actual=rows_str1.splitlines(), expected=rows_str2.splitlines(), 
n=len(zipped)
        )        if has_diff_rows:
            error_msg = "Results do not match: "
            percent_diff = (diff_rows_cnt / len(zipped)) * 100
            error_msg += "( %.5f %% )" % percent_diff
            error_msg += "\n" + "\n".join(generated_diff)
            data = diff_rows if includeDiffRows else None
            raise PySparkAssertionError(
                errorClass="DIFFERENT_ROWS", messageParameters={"error_msg": 
error_msg}, data=data
            ) {code}
The problem lies in the way that we zip the lines

 
{code:java}
zipped = list(zip_longest(rows1, rows2)){code}
With zip longest we assume that the rows are in order and we do position by 
position comparison but it does not work well with checkRowOrder which defaults 
to False.


if I have 1 line difference in 100 line dataframe the result percentage won't 
be 1% but the amount of rows that cascade towards that difference. 

The best solution here would be either a set based comparison



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to