[
https://issues.apache.org/jira/browse/ARROW-13517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393784#comment-17393784
]
Yair Lenga commented on ARROW-13517:
------------------------------------
Thanks for taking the time to provide good feedback.
You are correct that there is something "wrong" with my local box. I suspect
that I am running out of actual memory (low End free AWS instance), resulting
in actual IO/swap, whereas the S3 select is not short on resources.
Running on a "fresh" instance solve make the processing noticeably better.
Thanks again for your patience. Yair
> Selective reading of rows for parquet file
> ------------------------------------------
>
> Key: ARROW-13517
> URL: https://issues.apache.org/jira/browse/ARROW-13517
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Parquet, Python
> Reporter: Yair Lenga
> Priority: Major
>
> The current interface for selective reading is to use *filters*
> [https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html]
> The approach works well when the filters are simple (field in (v1, v2, v3,
> …), and when the number of columns in small. It does not work well for the
> folllowing conditions, which currently requires reading the complete set into
> (python) memory.
> * when condition is complex (e.g. condition between attributes: field1 +
> field2 > filed3)
> * When file as many columns (making it costly to create python structures).
> I have a repository of large number of parquet files (thousands of files, 500
> MB each, 200 column), where specific records had to be selected quickly
> based on logical condition that does not fit the filter condition. Very small
> numbers of rows (<500) have to be returned.
> Proposed feature is to aextend read_row_group to support passing an array of
> rows to read (list of integer in ascending order).
> {code:java}
> pq = pyarrow.parquet.ParquetFile(…)
> dd = PY.read_row_group(…, rows=[ 5, 35, …. ]{code}
> Using this method will enable complex filtering in two stages, eliminitating
> the need to read all rows into memory.
> # First pass - read attributes for filtering, collect row numbers that match
> (complex) condition.
> # second pass - create a python table with matching rows using the proposed
> rows= parameter to read row group.
> I believe possible to achieve something similar using the c++ stream_reader
> ([https://github.com/apache/arrow/blob/master/cpp/src/parquet/stream_reader.cc]),
> which is not exposed to python.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)