[GitHub] [spark] c21 commented on a change in pull request #34498: [SPARK-37220][SQL] Do not split input file for Parquet reader with aggregate push down

GitBox Sun, 07 Nov 2021 11:35:22 -0800


c21 commented on a change in pull request #34498:
URL: https://github.com/apache/spark/pull/34498#discussion_r744305297




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala
##########
@@ -92,11 +92,6 @@ case class ParquetPartitionReaderFactory(
       ParquetFooterReader.readFooter(conf, filePath, SKIP_ROW_GROUPS)
     } else {
       // For aggregate push down, we will get max/min/count from footer 
statistics.
-      // We want to read the footer for the whole file instead of reading 
multiple
-      // footers for every split of the file. Basically if the start (the 
beginning of)
-      // the offset in PartitionedFile is 0, we will read the footer. 
Otherwise, it means
-      // that we have already read footer for that file, so we will skip 
reading again.
-      if (file.start != 0) return null

Review comment:
       > Quick question on: Existing unit test in 
FileSourceAggregatePushDownSuite.scala. How did the existing tests pass before 
this PR?
   
   @HyukjinKwon - I think we are in the same page based on your latest comment, 
but just to be noisy here in case anything is missing. Before this PR, when a 
single file is split into multiple splits across multiple tasks, we have the 
logic here to only process the split of file if `file.start == 0`, so only the 
first split of file will be processed, and every file is processed only once. 
So here is the trick.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on a change in pull request #34498: [SPARK-37220][SQL] Do not split input file for Parquet reader with aggregate push down

Reply via email to