acvictor commented on issue #11494:
URL: 
https://github.com/apache/incubator-gluten/issues/11494#issuecomment-3875668931

   @baibaichen I have thought of a fix for Delta/parquet with Velox backend 
where we can pass EXISTS_DEFAULT metadata from Spark in Gluten through the 
Substrait layer to Velox and handle it in Velox's Parquet reader.
   
   Spark's reader handles EXISTS_DEFAULT at the file-read level:
   - When opening a Parquet file, it compares the physical file schema against 
the requested read schema
   - Columns in the read schema but missing from the file are tracked in 
missingColumns
   - For missing columns with EXISTS_DEFAULT metadata, it fills the column 
vector with the default value instead of NULLs
   - Gluten/Velox bypasses this entirely because it doesn't receive the 
EXISTS_DEFAULT metadata 
   
   I don't believe it is possible to have a single fix that will cover all file 
formats. Spark itself handles this in a format-specific manner -
   - Parquet - VectorizedParquetRecordReader fills missing columns with 
defaults at file-read time
   - ORC - OrcColumnarBatchReader uses 
ResolveDefaultColumns.existenceDefaultValues() for missing columns
   - CSV - UnivocityParser applies defaults using 
ResolveDefaultColumns.existenceDefaultValues(requiredSchema) on parse error
   - JSON - JacksonParser.applyExistenceDefaultValuesToRow() applies defaults 
after parsing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to