pscheurig created SPARK-51062:
---------------------------------

             Summary: assertSchemaEqual Does Not Compare Decimal Precision and 
Scale
                 Key: SPARK-51062
                 URL: https://issues.apache.org/jira/browse/SPARK-51062
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.5.4, 3.5.3, 3.5.2, 3.5.1, 3.5.0
            Reporter: pscheurig


h1. Summary

The {{assertSchemaEqual}} function in PySpark's testing utilities does not 
properly compare DecimalType fields, as it only checks the base type name (e.g. 
"decimal") without comparing precision and scale parameters. This significantly 
reduces the utility of the function for schemas containing decimal fields.
h2. Version
 * Apache Spark Version: >=3.5.0
 * Component: PySpark Testing Utils
 * Function: {{pyspark.testing.assertSchemaEqual}}

h2. Description

When comparing two schemas containing DecimalType fields with different 
precision and scale parameters, {{assertSchemaEqual}} incorrectly reports them 
as equal because it only compares the base type name ("decimal") without 
considering the precision and scale parameters.
h3. Current Behavior
{code:python}
from pyspark.sql.types import StructType, StructField, DecimalType
from pyspark.testing import assertSchemaEqual

s1 = StructType(
    [
        StructField("price_102", DecimalType(10, 2), True),
        StructField("price_80", DecimalType(8, 0), True),
    ]
)

s2 = StructType(
    [
        StructField("price_102", DecimalType(10, 4), True),  # Different scale
        StructField(
            "price_80", DecimalType(10, 2), True
        ),  # Different precision and scale
    ]
)

# This passes when it should fail
assertSchemaEqual(s1, s2)

{code}
h3. Expected Behavior

The function should compare both precision and scale parameters of DecimalType 
fields and raise a PySparkAssertionError when they differ, similar to how it 
handles other type mismatches. The error message should indicate which fields 
have mismatched decimal parameters.
h2. Impact

This issue affects data quality validation and testing scenarios where precise 
decimal specifications are crucial, such as:
 * Financial data processing where decimal precision and scale are critical
 * ETL validation where source and target schemas must match exactly

h2. Suggested Fix

The {{compare_datatypes_ignore_nullable}} function in 
{{pyspark/testing/utils.py}} should be enhanced to compare precision and scale 
parameters when dealing with decimal types:
{code:python}
def compare_datatypes_ignore_nullable(dt1: Any, dt2: Any):
    if dt1.typeName() == dt2.typeName():
        if dt1.typeName() == "decimal":
            return dt1.precision == dt2.precision and dt1.scale == dt2.scale
        elif dt1.typeName() == "array":
            return compare_datatypes_ignore_nullable(dt1.elementType, 
dt2.elementType)
        elif dt1.typeName() == "struct":
            return compare_schemas_ignore_nullable(dt1, dt2)
        else:
            return True
    else:
        return False
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to