[
https://issues.apache.org/jira/browse/ARROW-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396886#comment-17396886
]
Weston Pace commented on ARROW-13518:
-------------------------------------
ARROW-13599 is somewhat related.
> Identify selected row when using filters
> ----------------------------------------
>
> Key: ARROW-13518
> URL: https://issues.apache.org/jira/browse/ARROW-13518
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Parquet, Python
> Reporter: Yair Lenga
> Priority: Major
>
> I created a proposed enhancement to speed up reading of specific rows
> arrow-13517 https://issues.apache.org/jira/browse/ARROW-13517
> proposing extending the functions that provides filter parquet.read_table
> ([https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table])
> to support returning actual row numbers (e.g, row_group and row_index).
> with the proposed enhancement, this can provide for faster reading of the
> data (e.g. by caching the return indices, and reading the full data when
> needed).
> proposed implementation will be to add 2 pseudo columns, which can be
> requested in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’,
> ‘dealid’, …] or similar.
> * $row_group - 0 based row group index
> * $row_index - 0 based position within the row group
> * $row_file_index - 0 based position in the file (not critical), can be
> constructed from the other two
>
> not sure if this requires change to the c++ interface, or just to the python
> part of pyarrow.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)