[
https://issues.apache.org/jira/browse/IMPALA-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690374#comment-17690374
]
ASF subversion and git services commented on IMPALA-11081:
----------------------------------------------------------
Commit ff7b5db6002ccb047262cd7118e2e11ab09ef40a in impala's branch
refs/heads/master from zhangyifan27
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ff7b5db60 ]
IMPALA-11081: Fix incorrect results in partition key scan
This patch fixes incorrect results caused by short-circuit partition
key scan in the case where a Parquet/ORC file contains multiple
blocks.
IMPALA-8834 introduced the optimization that generating only one
scan range that corresponding to the first block per file. Backends
only issue footer ranges for Parquet/ORC files for file-metadata-only
queries(see HdfsScanner::IssueFooterRanges()), which leads to
incorrect results if the first block doesn't include a file footer.
This bug is fixed by returning a scan range corresponding to the last
block for Parquet/ORC files to make sure it contains a file footer.
Testing:
- Added e2e tests to verify the fix.
Change-Id: I17331ed6c26a747e0509dcbaf427cd52808943b1
Reviewed-on: http://gerrit.cloudera.org:8080/19471
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Partition key scan optimization may return incorrect results when partition
> file have more than one block
> ---------------------------------------------------------------------------------------------------------
>
> Key: IMPALA-11081
> URL: https://issues.apache.org/jira/browse/IMPALA-11081
> Project: IMPALA
> Issue Type: Bug
> Affects Versions: Impala 4.0.0
> Reporter: carolinchen
> Assignee: YifanZhang
> Priority: Critical
>
> In https://issues.apache.org/jira/browse/IMPALA-8834 will only generate one
> scan range for partition key's scan, but it may cause wrong result.
> In this case, when a file with more than one block.
> # The planner will only transforms the first block into TScanRange, which
> does not include footer.
> # The backend can't find the split with the footer, so that can neither
> parse the footer nor do the scan.
> so that the paritition key scan's result will be incorrect.
>
> see this snippet in HdfsScanNode.java:
>
> {code:java}
> private Pair<Boolean, Long> transformBlocksToScanRanges(
> FeFsPartition partition, FileDescriptor fileDesc,
> boolean fsHasBlocks, long scanRangeBytesLimit,
> Analyzer analyzer) {
> for (int i = 0; i < fileDesc.getNumFileBlocks(); ++i) {
> // Only generate one scan range for partition key scans.
> if (isPartitionKeyScan_) break;
> }
> }{code}
> In FE, if file with more than one block do partition key scan,
> transformBlocksToScanRanges will not include footer range.
>
> see this snippet in hdfs-scanner.cc:
>
> {code:java}
> /// Issue just the footer range for each file. This function is only used ///
> in parquet and orc scanners. We'll then parse the footer and pick out /// the
> columns we want.
> Status HdfsScanner::IssueFooterRanges(HdfsScanNodeBase* scan_node,
> const THdfsFileFormat::type& file_type,
> const std::vector<HdfsFileDesc*>& files) {
> // Try to find the split with the footer.
> ScanRange* footer_split = FindFooterSplit(files[i]);
> }{code}
> In BE, there no footer split won't add range to do the scan.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]