[ 
https://issues.apache.org/jira/browse/IMPALA-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421128#comment-17421128
 ] 

ASF subversion and git services commented on IMPALA-10894:
----------------------------------------------------------

Commit d7068ace15b5c7affe0812155f037789905ef74d in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d7068ac ]

IMPALA-10894: Pushing down predicates in reading "original files" of ACID tables

ACID tables can have "original files" that don't have full ACID schema.
For instance, if we upgrade a non-ACID table to full ACID, the original
files won't be changed so they don't have ACID columns, i.e. operation,
originalTransaction, bucket, rowid, and currentTransaction.

Besides rowid, the other 4 columns can be calculated based on the file
path. We calculate the rowid as row index inside the file. This is done
by setting a first row id for the split then the OrcStructReader fills
the rowid slot with values auto-incremented by one.

However, if we push down predicates into the ORC reader, some rows may
be skipped. The ORC lib guarantees that rows in a returned batch are
consecutive. But consecutive batches may skip rows in the middle. So we
can't simply auto-increment the first row id by 1 to calculate the row
index. Instead, we should use orc::RowReader::getRowNumber() to update
the first row index of the returned batch.

This patch changes the row index initialization logic to use
orc::RowReader::getRowNumber(), and removes the branch that skips
pushing down predicates on such case.

Tests:
 - Ran test_full_acid_original_files

Change-Id: I5bfdb624fcaf62ffa22f53025761b9dee3fe58a2
Reviewed-on: http://gerrit.cloudera.org:8080/17870
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Pushing down predicates in reading "original files" of ACID tables
> ------------------------------------------------------------------
>
>                 Key: IMPALA-10894
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10894
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>
> “Original files” don't store special ACID columns. We generate the row id by 
> using the row index of the file. The orc reader doesn't provide interfaces 
> for retrieving the row index of a row in the file. When predicates are pushed 
> down into the orc reader, the returned batch will skip some rows. So we can't 
> calculate the actual row index in file using its index in the batch.
> Currently we skip pushing down predicates in reading such files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to