[
https://issues.apache.org/jira/browse/IMPALA-10894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420433#comment-17420433
]
Quanlong Huang commented on IMPALA-10894:
-----------------------------------------
The ORC lib actually provides an interface to retrieve the row number in file
of the first row in previous returned batch:
https://github.com/apache/orc/blob/rel/release-1.7.0/c++/include/orc/Reader.hh#L560
{code:cpp}
/**
* Get the row number of the first row in the previously read batch.
* @return the row number of the previous batch.
*/
virtual uint64_t getRowNumber() const = 0;
{code}
We can call orc::RowReader::next() to read the batch and then use
orc::RowReader::getRowNumber() to get the first row id of the batch.
The implementation of SearchArgument(predicate pushdown) ensures that rows in a
batch are consecutive:
https://github.com/apache/orc/blob/rel/release-1.7.0/c%2B%2B/src/Reader.cc#L1073-L1094
> Pushing down predicates in reading "original files" of ACID tables
> ------------------------------------------------------------------
>
> Key: IMPALA-10894
> URL: https://issues.apache.org/jira/browse/IMPALA-10894
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
>
> “Original files” don't store special ACID columns. We generate the row id by
> using the row index of the file. The orc reader doesn't provide interfaces
> for retrieving the row index of a row in the file. When predicates are pushed
> down into the orc reader, the returned batch will skip some rows. So we can't
> calculate the actual row index in file using its index in the batch.
> Currently we skip pushing down predicates in reading such files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]