sunchao commented on a change in pull request #32753:
URL: https://github.com/apache/spark/pull/32753#discussion_r655599949
##########
File path:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
##########
@@ -170,77 +170,135 @@ public int readInteger() {
* }
*/
public void readBatch(
Review comment:
Sorry @cloud-fan, I should've add more context in the PR description.
Let me try to add here and copy later to there.
1. The column index filtering is largely implemented in parquet-mr (via
classes such as `ColumnIndex` and `ColumnIndexFilter`), and the filtered
Parquet pages are returned to Spark through the
`ParquetFileReader.readNextFilteredRowGroup` and
`ParquetFileReader.getFilteredRecordCount` API. Please see #31393 for the
related changes in the vectorized reader path.
2. Spark needs more work to handle mis-aligned Parquet pages returned from
parquet-mr side, when there are multiple columns and their type width are
different (e.g., int and bigint). For this issue, @lxian already gave a pretty
good description in
[SPARK-34859](https://issues.apache.org/jira/browse/SPARK-34859). To support
the case, Spark needs to leverage the API
[`PageReadStore.getRowIndexes`](https://javadoc.io/doc/org.apache.parquet/parquet-column/latest/org/apache/parquet/column/page/PageReadStore.html),
which returns the indexes of all rows (note the difference between rows and
values: for flat schema there is no difference between the two, but for complex
schema they're different) after filtering within a Parquet row group. In
addition, because there are gaps between pages, we'll need to know what is the
index for the first row in a page, so we can compare indexes of values (rows)
from a page with the row indexes mentioned above. This is provided by the
`DataPage.getFirstRo
wIndex` method.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]