[ 
https://issues.apache.org/jira/browse/ARROW-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396886#comment-17396886
 ] 

Weston Pace commented on ARROW-13518:
-------------------------------------

ARROW-13599 is somewhat related.

> Identify selected row when using filters
> ----------------------------------------
>
>                 Key: ARROW-13518
>                 URL: https://issues.apache.org/jira/browse/ARROW-13518
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Parquet, Python
>            Reporter: Yair Lenga
>            Priority: Major
>
> I created a proposed enhancement to speed up reading of specific rows 
> arrow-13517 https://issues.apache.org/jira/browse/ARROW-13517
> proposing extending the functions that provides filter parquet.read_table 
> ([https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table])
>  to support returning actual row numbers (e.g, row_group and row_index). 
> with the proposed enhancement, this can provide for faster reading of the 
> data (e.g. by caching the return indices, and reading the full data when 
> needed). 
> proposed implementation will be to add 2 pseudo columns, which can be 
> requested in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, 
> ‘dealid’, …] or similar.
>  * $row_group - 0 based row group index
>  * $row_index - 0  based position within the row group
>  * $row_file_index - 0 based position in the file (not critical), can be 
> constructed from the other two
>  
> not sure if this requires change to the c++ interface, or just to the python 
> part of pyarrow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to