[ 
https://issues.apache.org/jira/browse/IMPALA-8834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690375#comment-17690375
 ] 

ASF subversion and git services commented on IMPALA-8834:
---------------------------------------------------------

Commit ff7b5db6002ccb047262cd7118e2e11ab09ef40a in impala's branch 
refs/heads/master from zhangyifan27
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ff7b5db60 ]

IMPALA-11081: Fix incorrect results in partition key scan

This patch fixes incorrect results caused by short-circuit partition
key scan in the case where a Parquet/ORC file contains multiple
blocks.

IMPALA-8834 introduced the optimization that generating only one
scan range that corresponding to the first block per file. Backends
only issue footer ranges for Parquet/ORC files for file-metadata-only
queries(see HdfsScanner::IssueFooterRanges()), which leads to
incorrect results if the first block doesn't include a file footer.
This bug is fixed by returning a scan range corresponding to the last
block for Parquet/ORC files to make sure it contains a file footer.

Testing:
- Added e2e tests to verify the fix.

Change-Id: I17331ed6c26a747e0509dcbaf427cd52808943b1
Reviewed-on: http://gerrit.cloudera.org:8080/19471
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Investigate enabling safe version of OPTIMIZE_PARTITION_KEY_SCANS by default
> ----------------------------------------------------------------------------
>
>                 Key: IMPALA-8834
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8834
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>              Labels: perf
>             Fix For: Impala 4.0.0
>
>
> Just an idea I had while updating the docs. We already have the logic in the 
> planner to determine when a partition key scan has "distinct" semantics - 
> i.e. you obtain the correct results so long as the scan returns 0 rows for a 
> partition when there are no rows present, and at least one row when there is 
> a row present (but the exact number doesn't affect correctness).
> We could push this knowledge down into the scan nodes and have them terminate 
> early. This would be quite efficient if file handles and footers were already 
> cached and greatly reduce the number of rows flowing through the rest of the 
> plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to