dongjoon-hyun opened a new pull request #24307: Spark 25407
URL: https://github.com/apache/spark/pull/24307
 
 
   ## What changes were proposed in this pull request?
   
   As part of schema clipping in `ParquetReadSupport.scala`, we add fields in 
the Catalyst requested schema which are missing from the Parquet file schema to 
the Parquet clipped schema. However, nested schema pruning requires we ignore 
unrequested field data when reading from a Parquet file. Therefore we pass two 
schema to `ParquetRecordMaterializer`: the schema of the file data we want to 
read and the schema of the rows we want to return. The reader is responsible 
for reconciling the differences between the two.
   
   Aside from checking whether schema pruning is enabled, there is an 
additional complication to constructing the Parquet requested schema. The 
manner in which Spark's two Parquet readers reconcile the differences between 
the Parquet requested schema and the Catalyst requested schema differ. Spark's 
vectorized reader does not (currently) support reading Parquet files with 
complex types in their schema. Further, it assumes that the Parquet requested 
schema includes all fields requested in the Catalyst requested schema. It 
includes logic in its read path to skip fields in the Parquet requested schema 
which are not present in the file.
   
   Spark's parquet-mr based reader supports reading Parquet files of any kind 
of complex schema, and it supports nested schema pruning as well. Unlike the 
vectorized reader, the parquet-mr reader requires that the Parquet requested 
schema include only those fields present in the underlying Parquet file's 
schema. Therefore, in the case where we use the parquet-mr reader we intersect 
the Parquet clipped schema with the Parquet file's schema to construct the 
Parquet requested schema that's set in the `ReadContext`.
   
   ## How was this patch tested?
   
   A previously ignored test case which exercises the failure scenario this PR 
addresses has been enabled.
   
   This closes #22880

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to