[PR] Use Parquet schema for scan instead of Spark schema [datafusion-comet]

via GitHub Tue, 19 Nov 2024 13:08:35 -0800


mbutrovich opened a new pull request, #1103:
URL: https://github.com/apache/datafusion-comet/pull/1103


   Currently we get the scan schema from the plan nodes scan schema, and then 
serialize that back to a Parquet schema, then parse that on the native side. 
This is lossy, particularly with timestamps. For example:
   
   ```
   schema: message root {
     optional int64 _0 (TIMESTAMP(MILLIS,true));
     optional int64 _1 (TIMESTAMP(MICROS,true));
     optional int64 _2 (TIMESTAMP(MILLIS,true));
     optional int64 _3 (TIMESTAMP(MILLIS,false));
     optional int64 _4 (TIMESTAMP(MICROS,true));
     optional int64 _5 (TIMESTAMP(MICROS,false));
     optional int64 _6 (INTEGER(64,true));
   }
   
   dataSchema: message spark_schema {
     optional int96 _0;
     optional int96 _1;
     optional int96 _2;
     optional int64 _3 (TIMESTAMP(MICROS,false));
     optional int96 _4;
     optional int64 _5 (TIMESTAMP(MICROS,false));
     optional int64 _6;
   }
   ```
   The former is the original Parquet footer, the latter is what we get after 
going through Spark. We need the original to handle int96 correctly in 
ParquetExec.
   
   This PR extracts some code from elsewhere (CometParquetFileFormat, 
CometNativeScanExec) to read the footer from the Parquet file, and serialize 
the original metadata. We also now generate the projection vector on the Spark 
side because the required columns is in Spark schema format, so will not match 
the Parquet schema 1:1. On the native side, we now have to regenerate the 
required schema from the Parquet schema using the projection vector (converted 
to a DF ProjectionMask).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Use Parquet schema for scan instead of Spark schema [datafusion-comet]

Reply via email to