[ 
https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298151#comment-17298151
 ] 

Ben Kietzman commented on ARROW-7224:
-------------------------------------

[~andydoug] the metadata you mentioned makes me think of 
[pyarrow.dataset.parquet_dataset|https://arrow.apache.org/docs/python/dataset.html#working-with-parquet-datasets],
 which allows construction of a dataset from a single metadata-only parquet 
file. When such a _metadata file can be written filters will be applied not 
only to partition keys but also to row group statistics without IO; might be 
worth a look.

The fundamental problem I see is a mismatch in how {{pq.ParquetDatset}} and 
{{pyarrow.dataset.*}} consider filters: the former considers a filter during 
construction while the latter is intended to handle multiple reads with 
differing filters. In the latter case, the dataset is intended to _encapsulate_ 
the mapping between partitions and files:

{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset('s3://bucket-name/dataset-name', format='parquet')

kindness_day_filter = ds.field('year') == 2019 & ds.field('month') == 11 & 
ds.field('day') == 13
kindness_day_table = dataset.to_table(filter=kindness_day_filter)

# repeat with other filters
kindness_week_filter = ds.field('year') == 2019 & ds.field('month') == 11 & 
ds.field('day') >= 10 & ds.field('day') <= 16
{code}

If that's neither of these is acceptable, note that {{pq.ParquetDataset}} can 
be constructed from a list of paths of data files. If an explicit list of files 
is available then the dataset can be (repeatedly) constructed without invoking 
a directory listing method. 

FWIW, we have https://issues.apache.org/jira/browse/ARROW-8163 to track lazier 
construction of datasets and the only issue I'm aware of which we had for 
supporting more sophisticated listing behavior is 
https://issues.apache.org/jira/browse/ARROW-6257 . It seems before lazy 
construction of a dataset could benefit this case we'd need to also support 
producing a stream of results from querying a FileSelector [~apitrou]

https://github.com/apache/arrow/pull/9632

[~jorisvandenbossche] 

> [C++][Dataset] Partition level filters should be able to provide filtering to 
> file systems
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7224
>                 URL: https://issues.apache.org/jira/browse/ARROW-7224
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Micah Kornfield
>            Priority: Major
>              Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases 
> to use it to optimize file system list calls.  This can greatly improve the 
> speed for reading data from partitions because fewer number of 
> directories/files need to be explored/expanded.  I've fallen behind on the 
> dataset code, but I want to make sure this issue is tracked someplace.  This 
> came up in SO question linked below (feel free to correct my analysis if I 
> missed the functionality someplace).
> Reference: 
> [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to