[jira] [Created] (ARROW-13517) Selective reading of rows for parquet file

Yair Lenga (Jira) Sun, 01 Aug 2021 03:41:14 -0700

Yair Lenga created ARROW-13517:
----------------------------------

             Summary: Selective reading of rows for parquet file
                 Key: ARROW-13517
                 URL: https://issues.apache.org/jira/browse/ARROW-13517
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++, Parquet, Python
            Reporter: Yair Lenga



The current interface for selective reading is to use *filters* 
[https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html]

The approach works well when the filters are simple (field in (v1, v2, v3, …), 
and when the number of columns in small. It does not work well for the 
folllowing conditions, which currently requires reading the complete set into 
(python) memory.
 * when condition is complex (e.g. condition between attributes: field1 + 
field2 > filed3)
 * When file as many columns (making it costly to create python structures).

I have a repository of large number of parquet files (thousands of files, 500 
MB each, 200  column), where specific records had to be selected quickly based 
on logical condition that does not fit the filter condition. Very small numbers 
of rows (<500) have to be returned.

Proposed feature is to allow extend read_row_group to support passing an array 
of rows to read (list of integer in ascending order). 
{code:java}
pq =  pyarrow.parquet.ParquetFile(…)
dd = PY.read_row_group(…, rows=[ 5, 35, …. ]{code}
Using this method will enable complex filtering in two stages, eliminitating 
the need to read all rows into memory.
 # First pass - read attributes for filtering, collect row numbers that match 
(complex) condition.
 # second pass - create a python table with matching rows using the proposed 
rows= parameter to read row group.

I believe possible to achieve something similar using the c++ stream_reader 
([https://github.com/apache/arrow/blob/master/cpp/src/parquet/stream_reader.cc]),
 which is not exposed to python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13517) Selective reading of rows for parquet file

Reply via email to