[jira] [Commented] (ARROW-13517) Selective reading of rows for parquet file

Weston Pace (Jira) Mon, 02 Aug 2021 12:07:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391771#comment-17391771
 ]


Weston Pace commented on ARROW-13517:
-------------------------------------

> when condition is complex (e.g. condition between attributes: field1 + field2 
> > filed3)

This is currently possible using the new datasets API and expressions:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads

table = pa.Table.from_pydict({'x': [1, 2, 3, 4, 5, 6], 'y': [-2, -2, -2, -2, 
-2, -2], 'z': [-10, 0, 0, -10, 0, 0]})
pq.write_table(table, '/tmp/foo.parquet')

ds = pads.dataset('/tmp/foo.parquet')
table2 = ds.to_table(filter=(pads.field('x') + pads.field('y')) > 
pads.field('z'))
print(table2.to_pydict())
{code}

> When file as many columns (making it costly to create python structures).

It's not clear to me what is expensive here.  You mention there are 200 
columns.  Both the old and new approaches should be pretty workable 
manipulating metadata of 200 columns in python.

> I believe possible to achieve something similar using the c++ stream_reader

You might be able to achieve some benefits in some situations by skipping data. 
 In general this is a difficult problem with Parquet.  For example, compressed 
pages typically need to be fully read into memory and can't support any kind of 
skipping.  Furthermore, many pages are going to use run length encoding, which 
makes it impossible to skip to a particular value.  As a general rule of thumb 
you cannot do a partial read of a parquet page.  So if a page has 100k rows in 
it and you only want 10 rows out of it then, even if you have the indices, you 
typically need to read the page into memory first.



> Selective reading of rows for parquet file
> ------------------------------------------
>
>                 Key: ARROW-13517
>                 URL: https://issues.apache.org/jira/browse/ARROW-13517
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Parquet, Python
>            Reporter: Yair Lenga
>            Priority: Major
>
> The current interface for selective reading is to use *filters* 
> [https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html]
> The approach works well when the filters are simple (field in (v1, v2, v3, 
> …), and when the number of columns in small. It does not work well for the 
> folllowing conditions, which currently requires reading the complete set into 
> (python) memory.
>  * when condition is complex (e.g. condition between attributes: field1 + 
> field2 > filed3)
>  * When file as many columns (making it costly to create python structures).
> I have a repository of large number of parquet files (thousands of files, 500 
> MB each, 200  column), where specific records had to be selected quickly 
> based on logical condition that does not fit the filter condition. Very small 
> numbers of rows (<500) have to be returned.
> Proposed feature is to aextend read_row_group to support passing an array of 
> rows to read (list of integer in ascending order). 
> {code:java}
> pq =  pyarrow.parquet.ParquetFile(…)
> dd = PY.read_row_group(…, rows=[ 5, 35, …. ]{code}
> Using this method will enable complex filtering in two stages, eliminitating 
> the need to read all rows into memory.
>  # First pass - read attributes for filtering, collect row numbers that match 
> (complex) condition.
>  # second pass - create a python table with matching rows using the proposed 
> rows= parameter to read row group.
> I believe possible to achieve something similar using the c++ stream_reader 
> ([https://github.com/apache/arrow/blob/master/cpp/src/parquet/stream_reader.cc]),
>  which is not exposed to python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13517) Selective reading of rows for parquet file

Reply via email to