sivakumardg opened a new issue, #56738:
URL: https://github.com/apache/spark/issues/56738
## Problem Statement
When reading Oracle NUMBER columns (without explicit precision/scale) via
JDBC, Spark's OracleDialect maps them to DecimalType(38, 10) with a hardcoded
scale of 10. This causes silent data precision loss for values that require
more than 10 decimal places, with no warning or error to the user.
## Current Behavior
Oracle bare NUMBER (no precision/scale defined) returns via JDBC metadata:
- getPrecision() = 0
- getScale() = -127
Spark's OracleDialect.scala handles this with a hardcoded scale of 10:
case 0 => Option(DecimalType(DecimalType.MAX_PRECISION, 10))
So any bare NUMBER column becomes Decimal(38, 10) in Spark — regardless of
the actual data stored in the column.
## Steps to Reproduce
Oracle setup:
CREATE TABLE test_number (
col1 NUMBER, -- stores small decimal values
col2 NUMBER -- stores large integer values
);
INSERT INTO test_number VALUES (0.0000123456789123456,
12345678901234567890);
PySpark read:
df = spark.read.format("jdbc") \
.option("url", oracle_url) \
.option("dbtable", "test_number") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
df.printSchema()
# root
# |-- col1: decimal(38,10)
# |-- col2: decimal(38,10)
df.show(truncate=False)
# col1: 0.0000123457 <- SILENT DATA LOSS!
# original value: 0.0000123456789123456
# col2: 12345678901234567890 <- OK
## Expected Behavior
Spark should not silently truncate data. When the scale of a bare NUMBER
column cannot be determined from JDBC metadata, the user should at minimum be
warned that precision loss may occur, or provided a way to configure the
fallback behavior.
## Why This is Critical
- The scale of bare NUMBER varies per column in Oracle—hardcoding 10
cannot be correct for all cases
- The precision loss is completely silent—no warning, no error, no
indication in the schema that data has been truncated
- NUMBER(38, 25) works correctly -> Decimal(38, 25) in Spark, proving the
issue is only with bare NUMBER
- There is currently no global workaround across multiple tables without
knowing all column names upfront
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]