c21 commented on a change in pull request #34498:
URL: https://github.com/apache/spark/pull/34498#discussion_r744305297
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
##########
@@ -92,11 +92,6 @@ case class ParquetPartitionReaderFactory(
ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
} else {
// For aggregate push down, we will get max/min/count from footer
statistics.
- // We want to read the footer for the whole file instead of reading
multiple
- // footers for every split of the file. Basically if the start (the
beginning of)
- // the offset in PartitionedFile is 0, we will read the footer.
Otherwise, it means
- // that we have already read footer for that file, so we will skip
reading again.
- if (file.start != 0) return null
Review comment:
> Quick question on: Existing unit test in
FileSourceAggregatePushDownSuite.scala. How did the existing tests pass before
this PR?
@HyukjinKwon - I think we are in the same page based on your latest comment,
but just to be noisy here in case anything is missing. Before this PR, when a
single file is split into multiple splits across multiple tasks, we have the
logic here to only process the split of file if `file.start == 0`, so only the
first split of file will be processed, and every file is processed only once.
So here is the trick.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]