[GitHub] [spark] sunchao commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

via GitHub Wed, 22 Feb 2023 09:47:22 -0800


sunchao commented on PR #39950:
URL: https://github.com/apache/spark/pull/39950#issuecomment-1440499024


   @yabola yes, we'll need to use `RangeMetadataFilter` (i.e.: 
`HadoopReadOptions.builder().withRange()`) when we initially read the footer. 
This is possible since in places like `ParquetFileFormat` we already have a 
`PartitionedFile` which is just a segment in a Parquet file with a `start` and 
`length`.
   
   The only problem is we need new non-deprecated API from `parquet-mr` to 
support this use case. Personally I think we can just use the deprecated 
[API](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L632)
 for now, and replace it after a new Parquet version is released.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on pull request #39950: [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader

Reply via email to