[ 
https://issues.apache.org/jira/browse/ARROW-13517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393784#comment-17393784
 ] 

Yair Lenga commented on ARROW-13517:
------------------------------------

Thanks for taking the time to provide good feedback.

You are correct that there is something "wrong" with my local box. I suspect 
that I am running out of actual memory (low End free AWS instance), resulting 
in actual IO/swap, whereas the S3 select is not short on resources.

Running on a "fresh" instance solve make the processing noticeably better.

Thanks again for your patience. Yair

> Selective reading of rows for parquet file
> ------------------------------------------
>
>                 Key: ARROW-13517
>                 URL: https://issues.apache.org/jira/browse/ARROW-13517
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Parquet, Python
>            Reporter: Yair Lenga
>            Priority: Major
>
> The current interface for selective reading is to use *filters* 
> [https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html]
> The approach works well when the filters are simple (field in (v1, v2, v3, 
> …), and when the number of columns in small. It does not work well for the 
> folllowing conditions, which currently requires reading the complete set into 
> (python) memory.
>  * when condition is complex (e.g. condition between attributes: field1 + 
> field2 > filed3)
>  * When file as many columns (making it costly to create python structures).
> I have a repository of large number of parquet files (thousands of files, 500 
> MB each, 200  column), where specific records had to be selected quickly 
> based on logical condition that does not fit the filter condition. Very small 
> numbers of rows (<500) have to be returned.
> Proposed feature is to aextend read_row_group to support passing an array of 
> rows to read (list of integer in ascending order). 
> {code:java}
> pq =  pyarrow.parquet.ParquetFile(…)
> dd = PY.read_row_group(…, rows=[ 5, 35, …. ]{code}
> Using this method will enable complex filtering in two stages, eliminitating 
> the need to read all rows into memory.
>  # First pass - read attributes for filtering, collect row numbers that match 
> (complex) condition.
>  # second pass - create a python table with matching rows using the proposed 
> rows= parameter to read row group.
> I believe possible to achieve something similar using the c++ stream_reader 
> ([https://github.com/apache/arrow/blob/master/cpp/src/parquet/stream_reader.cc]),
>  which is not exposed to python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to