[
https://issues.apache.org/jira/browse/IMPALA-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817569#comment-17817569
]
ASF subversion and git services commented on IMPALA-12631:
----------------------------------------------------------
Commit 13030f840a23e67c5e9923e8b1abab3b717c106a in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=13030f840 ]
IMPALA-12796: Add is_footer_only in TFileSplitGeneratorSpec
Several tests in test_scanners.py failed by wrong row counts with S3
target filesystem after IMPALA-12631. S3 filesystem does not have block.
Planner will produce TFileSplitGeneratorSpec instead of
TScanRangeLocationList, and IMPALA-12631 miss to address necessary
changes in TFileSplitGeneratorSpec. Meanwhile, it already changed the
behavior of hdfs-parquet-scanner.cc. For each scan range, the new code
will loop file_metadata_.row_groups, while the old code just take one
entry of file_metadata_.row_groups after calling NextRowGroup().
This patch address the issue by adding is_footer_only field in
TFileSplitGeneratorSpec schedule accordingly in schedule.cc. This also
add field 'is_footer_scanner_' in hdfs-columnar-scanner.h to check that
optimized count star only applied with footer range.
Testing:
- Pass core tests with S3 target filesystem.
Change-Id: Iaa6e3c14debe68cf601131c6594774c8c695923e
Reviewed-on: http://gerrit.cloudera.org:8080/21021
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Improve count star performance for parquet scans
> ------------------------------------------------
>
> Key: IMPALA-12631
> URL: https://issues.apache.org/jira/browse/IMPALA-12631
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: YifanZhang
> Assignee: YifanZhang
> Priority: Major
> Fix For: Impala 4.4.0
>
>
> The code in the backend function HdfsParquetScanner::GetNextInternal() is not
> efficient now. We use row group statistics instead of file meta statistics,
> which leads to unnecessary materialization overhead.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]