[
https://issues.apache.org/jira/browse/IMPALA-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421128#comment-17421128
]
ASF subversion and git services commented on IMPALA-10894:
----------------------------------------------------------
Commit d7068ace15b5c7affe0812155f037789905ef74d in impala's branch
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d7068ac ]
IMPALA-10894: Pushing down predicates in reading "original files" of ACID tables
ACID tables can have "original files" that don't have full ACID schema.
For instance, if we upgrade a non-ACID table to full ACID, the original
files won't be changed so they don't have ACID columns, i.e. operation,
originalTransaction, bucket, rowid, and currentTransaction.
Besides rowid, the other 4 columns can be calculated based on the file
path. We calculate the rowid as row index inside the file. This is done
by setting a first row id for the split then the OrcStructReader fills
the rowid slot with values auto-incremented by one.
However, if we push down predicates into the ORC reader, some rows may
be skipped. The ORC lib guarantees that rows in a returned batch are
consecutive. But consecutive batches may skip rows in the middle. So we
can't simply auto-increment the first row id by 1 to calculate the row
index. Instead, we should use orc::RowReader::getRowNumber() to update
the first row index of the returned batch.
This patch changes the row index initialization logic to use
orc::RowReader::getRowNumber(), and removes the branch that skips
pushing down predicates on such case.
Tests:
- Ran test_full_acid_original_files
Change-Id: I5bfdb624fcaf62ffa22f53025761b9dee3fe58a2
Reviewed-on: http://gerrit.cloudera.org:8080/17870
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Pushing down predicates in reading "original files" of ACID tables
> ------------------------------------------------------------------
>
> Key: IMPALA-10894
> URL: https://issues.apache.org/jira/browse/IMPALA-10894
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
>
> “Original files” don't store special ACID columns. We generate the row id by
> using the row index of the file. The orc reader doesn't provide interfaces
> for retrieving the row index of a row in the file. When predicates are pushed
> down into the orc reader, the returned batch will skip some rows. So we can't
> calculate the actual row index in file using its index in the batch.
> Currently we skip pushing down predicates in reading such files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]