[
https://issues.apache.org/jira/browse/IMPALA-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837216#comment-16837216
]
ASF subversion and git services commented on IMPALA-5843:
---------------------------------------------------------
Commit d423979866c737005882f54d157819e43897a5e8 in impala's branch
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d423979 ]
IMPALA-5843: Use page index in Parquet files to skip pages
This commit implements page filtering based on the Parquet page index.
The read and evaluation of the page index is done by the
HdfsParquetScanner. At first, we determine the row ranges we are
interested in, and based on the row ranges we determine the candidate
pages for each column that we are reading.
We still issue one ScanRange per column chunk, but we specify
sub-ranges that store the candidate pages, i.e. we don't read
the whole column chunk, but only fractions of it.
Pages are not aligned across column chunks, i.e. page #2 of column A
might store completely different rows than page #2 of column B.
It means we need to implement some kind of row-skipping logic
when we read the data pages. This logic is implemented in
BaseScalarColumnReader and ScalarColumnReader. Collection column
readers know nothing about page filtering.
Page filtering can be turned off by setting the query option
'read_parquet_page_index' to false.
Testing:
* added some unit tests for the row range and
page selection logic
* generated various Parquet files with Parquet-MR
* enabled Page index writing and wrote selective queries against
tables written by Impala. Current tests are likely to use page
filtering transparently.
Performance:
* Measured locally, observed 3x to 20x speedup for selective queries.
The speedup was proportional to the IO operations need to be done.
* The TPCH benchmark didn't show a significant performance change. It
is not a suprise since the data is not being sorted in any useful
way. So the main goal was to not introduce perf regression.
TODO:
* measure performance for remote reads
Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Reviewed-on: http://gerrit.cloudera.org:8080/12065
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Use page index in Parquet files to skip pages
> ---------------------------------------------
>
> Key: IMPALA-5843
> URL: https://issues.apache.org/jira/browse/IMPALA-5843
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend
> Affects Versions: Impala 2.10.0
> Reporter: Lars Volker
> Assignee: Zoltán Borók-Nagy
> Priority: Critical
> Labels: parquet, performance
>
> Once IMPALA-5842 has been resolved, we should skip pages based on the page
> index in Parquet files.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]