[ 
https://issues.apache.org/jira/browse/IMPALA-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17817569#comment-17817569
 ] 

ASF subversion and git services commented on IMPALA-12631:
----------------------------------------------------------

Commit 13030f840a23e67c5e9923e8b1abab3b717c106a in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=13030f840 ]

IMPALA-12796: Add is_footer_only in TFileSplitGeneratorSpec

Several tests in test_scanners.py failed by wrong row counts with S3
target filesystem after IMPALA-12631. S3 filesystem does not have block.
Planner will produce TFileSplitGeneratorSpec instead of
TScanRangeLocationList, and IMPALA-12631 miss to address necessary
changes in TFileSplitGeneratorSpec. Meanwhile, it already changed the
behavior of hdfs-parquet-scanner.cc. For each scan range, the new code
will loop file_metadata_.row_groups, while the old code just take one
entry of file_metadata_.row_groups after calling NextRowGroup().

This patch address the issue by adding is_footer_only field in
TFileSplitGeneratorSpec schedule accordingly in schedule.cc. This also
add field 'is_footer_scanner_' in hdfs-columnar-scanner.h to check that
optimized count star only applied with footer range.

Testing:
- Pass core tests with S3 target filesystem.

Change-Id: Iaa6e3c14debe68cf601131c6594774c8c695923e
Reviewed-on: http://gerrit.cloudera.org:8080/21021
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Improve count star performance for parquet scans
> ------------------------------------------------
>
>                 Key: IMPALA-12631
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12631
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: YifanZhang
>            Assignee: YifanZhang
>            Priority: Major
>             Fix For: Impala 4.4.0
>
>
> The code in the backend function HdfsParquetScanner::GetNextInternal() is not 
> efficient now. We use row group statistics instead of file meta statistics, 
> which leads to unnecessary materialization overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to