[jira] [Updated] (ARROW-13518) Identify selected row when using filters

Yair Lenga (Jira) Sun, 01 Aug 2021 06:05:07 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yair Lenga updated ARROW-13518:
-------------------------------
    Description: 
I created a proposed enhancement to speed up reading of specific rows 
arrow-13517 https://issues.apache.org/jira/browse/ARROW-13517

proposing extending the functions that provides filter parquet.read_table 
([https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table])
 to support returning actual row numbers (e.g, row_group and row_index). 

with the proposed enhancement, this can provide for faster reading of the data 
(e.g. by caching the return indices, and reading the full data when needed). 

proposed implementation will be to add 2 pseudo columns, which can be requested 
in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, ‘dealid’, …] 
or similar.
 * $row_group - 0 based row group index
 * $row_index - 0  based position within the row group
 * $row_file_index - 0 based position in the file (not critical), can be 
constructed from the other two

 

not sure if this requires change to the c++ interface, or just to the python 
part of pyarrow.

 

  was:
I created a proposed enhancement to speed up reading of specific rows 
arrow-13517 [https://issues.apache.org/jira/browse/ARROW-13517]

proposing extending the functions that provides filter parquet.read_table 
([https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table])
 to support returning actual row numbers (e.g, row_group and row_index). 

with the proposed enhancement, this can provide for faster reading of the data 
(e.g. by caching the return indices, and reading the full data when needed). 

proposed implementation will be to add 2 pseudo columns, which can be requested 
in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, ‘dealid’, …] 
or similar.

 

not sure if this requires change to the c++ interface, or just to the python 
part of pyarrow.

 


> Identify selected row when using filters
> ----------------------------------------
>
>                 Key: ARROW-13518
>                 URL: https://issues.apache.org/jira/browse/ARROW-13518
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Parquet, Python
>            Reporter: Yair Lenga
>            Priority: Major
>
> I created a proposed enhancement to speed up reading of specific rows 
> arrow-13517 https://issues.apache.org/jira/browse/ARROW-13517
> proposing extending the functions that provides filter parquet.read_table 
> ([https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table])
>  to support returning actual row numbers (e.g, row_group and row_index). 
> with the proposed enhancement, this can provide for faster reading of the 
> data (e.g. by caching the return indices, and reading the full data when 
> needed). 
> proposed implementation will be to add 2 pseudo columns, which can be 
> requested in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, 
> ‘dealid’, …] or similar.
>  * $row_group - 0 based row group index
>  * $row_index - 0  based position within the row group
>  * $row_file_index - 0 based position in the file (not critical), can be 
> constructed from the other two
>  
> not sure if this requires change to the c++ interface, or just to the python 
> part of pyarrow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-13518) Identify selected row when using filters

Reply via email to