[GitHub] [spark] sunchao commented on a change in pull request #32753: [SPARK-34859][SQL] Handle column index when using vectorized Parquet reader

GitBox Mon, 21 Jun 2021 11:11:09 -0700


sunchao commented on a change in pull request #32753:
URL: https://github.com/apache/spark/pull/32753#discussion_r655599949




##########
File path: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -170,77 +170,135 @@ public int readInteger() {
    *  }
    */
   public void readBatch(

Review comment:
       Sorry @cloud-fan, I should've add more context in the PR description. 
Let me try to add here and copy later to there.
   
   1. The column index filtering is largely implemented in parquet-mr (via 
classes such as `ColumnIndex` and `ColumnIndexFilter`), and the filtered 
Parquet pages are returned to Spark through the 
`ParquetFileReader.readNextFilteredRowGroup` and 
`ParquetFileReader.getFilteredRecordCount` API. Please see #31393 for the 
related changes in the vectorized reader path.
   2. Spark needs more work to handle mis-aligned Parquet pages returned from 
parquet-mr side, when there are multiple columns and their type width are 
different (e.g., int and bigint). For this issue, @lxian already gave a pretty 
good description in 
[SPARK-34859](https://issues.apache.org/jira/browse/SPARK-34859). To support 
the case, Spark needs to leverage the API 
[`PageReadStore.getRowIndexes`](https://javadoc.io/doc/org.apache.parquet/parquet-column/latest/org/apache/parquet/column/page/PageReadStore.html),
 which returns the indexes of all rows (note the difference between rows and 
values: for flat schema there is no difference between the two, but for complex 
schema they're different) after filtering within a Parquet row group. In 
addition, because there are gaps between pages, we'll need to know what is the 
index for the first row in a page, so we can compare indexes of values (rows) 
from a page with the row indexes mentioned above. This is provided by the 
`DataPage.getFirstRo
 wIndex` method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #32753: [SPARK-34859][SQL] Handle column index when using vectorized Parquet reader

Reply via email to