pscheurig created SPARK-51062:
---------------------------------
Summary: assertSchemaEqual Does Not Compare Decimal Precision and
Scale
Key: SPARK-51062
URL: https://issues.apache.org/jira/browse/SPARK-51062
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.5.4, 3.5.3, 3.5.2, 3.5.1, 3.5.0
Reporter: pscheurig
h1. Summary
The {{assertSchemaEqual}} function in PySpark's testing utilities does not
properly compare DecimalType fields, as it only checks the base type name (e.g.
"decimal") without comparing precision and scale parameters. This significantly
reduces the utility of the function for schemas containing decimal fields.
h2. Version
* Apache Spark Version: >=3.5.0
* Component: PySpark Testing Utils
* Function: {{pyspark.testing.assertSchemaEqual}}
h2. Description
When comparing two schemas containing DecimalType fields with different
precision and scale parameters, {{assertSchemaEqual}} incorrectly reports them
as equal because it only compares the base type name ("decimal") without
considering the precision and scale parameters.
h3. Current Behavior
{code:python}
from pyspark.sql.types import StructType, StructField, DecimalType
from pyspark.testing import assertSchemaEqual
s1 = StructType(
[
StructField("price_102", DecimalType(10, 2), True),
StructField("price_80", DecimalType(8, 0), True),
]
)
s2 = StructType(
[
StructField("price_102", DecimalType(10, 4), True), # Different scale
StructField(
"price_80", DecimalType(10, 2), True
), # Different precision and scale
]
)
# This passes when it should fail
assertSchemaEqual(s1, s2)
{code}
h3. Expected Behavior
The function should compare both precision and scale parameters of DecimalType
fields and raise a PySparkAssertionError when they differ, similar to how it
handles other type mismatches. The error message should indicate which fields
have mismatched decimal parameters.
h2. Impact
This issue affects data quality validation and testing scenarios where precise
decimal specifications are crucial, such as:
* Financial data processing where decimal precision and scale are critical
* ETL validation where source and target schemas must match exactly
h2. Suggested Fix
The {{compare_datatypes_ignore_nullable}} function in
{{pyspark/testing/utils.py}} should be enhanced to compare precision and scale
parameters when dealing with decimal types:
{code:python}
def compare_datatypes_ignore_nullable(dt1: Any, dt2: Any):
if dt1.typeName() == dt2.typeName():
if dt1.typeName() == "decimal":
return dt1.precision == dt2.precision and dt1.scale == dt2.scale
elif dt1.typeName() == "array":
return compare_datatypes_ignore_nullable(dt1.elementType,
dt2.elementType)
elif dt1.typeName() == "struct":
return compare_schemas_ignore_nullable(dt1, dt2)
else:
return True
else:
return False
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]