andygrove opened a new issue, #4344:
URL: https://github.com/apache/datafusion-comet/issues/4344

   ## Description
   
   `native_datafusion` silently accepts integer-to-decimal Parquet reads where 
the requested decimal type cannot represent the integer values in the file. 
Spark's vectorized reader rejects these conversions with 
`SchemaColumnConvertNotSupportedException` (per 
`ParquetVectorUpdaterFactory.getUpdater`) because reading e.g. an INT64 column 
into a `DECIMAL(p,s)` whose precision is below the integer's required precision 
is unsafe. `native_datafusion` instead returns wrong (truncated/overflowed) 
values.
   
   This is the integer-to-decimal counterpart to #4297 (primitive-to-primitive 
numeric/date conversions) and #4343 (decimal-to-decimal narrowing).
   
   ## Affected tests (Spark 4.1.1, `dev/diffs/4.1.1.diff`)
   
   Currently tagged `IgnoreCometNativeDataFusion` pointing at the umbrella 
#3720:
   
   - `ParquetTypeWideningSuite` — `unsupported parquet conversion $fromType -> 
$toType`
     (the second occurrence in the suite, the integer→decimal block at line 
~264). Iterates pairs such as:
     - `ByteType  -> DECIMAL(1, 0)`
     - `ShortType -> DECIMAL(ByteDecimal.precision, 0)` / 
`DECIMAL(ByteDecimal.precision + 1, 1)` etc.
     - `IntegerType -> ShortDecimal` / `DECIMAL(IntDecimal.precision - 1, 0)` 
etc.
     - `LongType  -> IntDecimal` / `DECIMAL(LongDecimal.precision - 1, 0)` etc.
     Expects `SchemaColumnConvertNotSupportedException` when the vectorized 
reader is enabled and the target decimal precision is too small to hold the 
integer.
   
   The same tests exist in the 3.4 / 3.5 / 4.0 diffs and are ignored under 
#3720 there as well.
   
   ## Reproduction
   
   ```scala
   import org.apache.comet.CometConf
   import org.apache.spark.sql.internal.SQLConf
   
   withSQLConf(
     CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
     SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
     withTempPath { dir =>
       val path = dir.getCanonicalPath
       Seq(123456L).toDF("c")
         .selectExpr("cast(c as bigint) as c")
         .write.parquet(path)
       // LongType is INT64 in Parquet; a target DECIMAL(p, 0) with p < 19 
cannot
       // represent every Long, so Spark rejects it. native_datafusion accepts 
it.
       spark.read.schema("c decimal(5, 0)").parquet(path).show()
     }
   }
   ```
   
   ## Suggested approach
   
   Same direction as #4297 / #4343: extend the integer→decimal branch of the 
schema adapter / `replace_with_spark_cast` to mirror Spark's allowlist — only 
accept conversions where the target decimal precision is large enough to hold 
the integer's range (and scale is 0, or handled per Spark's rules). Reject 
everything else with `SparkError::ParquetSchemaConvert`.
   
   ## Parent issue
   
   Split from umbrella #3720.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to