acvictor commented on issue #11494: URL: https://github.com/apache/incubator-gluten/issues/11494#issuecomment-3875668931
@baibaichen I have thought of a fix for Delta/parquet with Velox backend where we can pass EXISTS_DEFAULT metadata from Spark in Gluten through the Substrait layer to Velox and handle it in Velox's Parquet reader. Spark's reader handles EXISTS_DEFAULT at the file-read level: - When opening a Parquet file, it compares the physical file schema against the requested read schema - Columns in the read schema but missing from the file are tracked in missingColumns - For missing columns with EXISTS_DEFAULT metadata, it fills the column vector with the default value instead of NULLs - Gluten/Velox bypasses this entirely because it doesn't receive the EXISTS_DEFAULT metadata I don't believe it is possible to have a single fix that will cover all file formats. Spark itself handles this in a format-specific manner - - Parquet - VectorizedParquetRecordReader fills missing columns with defaults at file-read time - ORC - OrcColumnarBatchReader uses ResolveDefaultColumns.existenceDefaultValues() for missing columns - CSV - UnivocityParser applies defaults using ResolveDefaultColumns.existenceDefaultValues(requiredSchema) on parse error - JSON - JacksonParser.applyExistenceDefaultValuesToRow() applies defaults after parsing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
