gszadovszky commented on a change in pull request #26804: 
[WIP][SPARK-26346][BUILD][SQL] Upgrade parquet to 1.11.0
URL: https://github.com/apache/spark/pull/26804#discussion_r363275673
 
 

 ##########
 File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
 ##########
 @@ -298,7 +298,7 @@ private void initializeInternal() throws IOException, 
UnsupportedOperationExcept
 
   private void checkEndOfRowGroup() throws IOException {
     if (rowsReturned != totalCountLoadedSoFar) return;
-    PageReadStore pages = reader.readNextRowGroup();
+    PageReadStore pages = reader.readNextFilteredRowGroup();
 
 Review comment:
   The changing the call `readNextRowGroup` to `readNextFilteredRowGroup` 
without any addition change in the page reading logic is not correct. See the 
related docs 
[here](https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L870-L873)
 and 
[here](https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0/parquet-column/src/main/java/org/apache/parquet/column/impl/SynchronizingColumnReader.java#L32-L60).
 If pages are really skipped based on the column indexes you have to align the 
values otherwise the rows might be mixed up.
   PPD might have some performance penalty if no pages can be skipped because 
we need to read all the related column/offset indexes. In case of no filter is 
defined or the original `readNextRowGroup` method is called then the 
column/offset indexes are not read so no additional IO occurs. I am not sure if 
this is tightly related to the column index feature.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to