[
https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298178#comment-17298178
]
Ben Kietzman commented on ARROW-7224:
-------------------------------------
bq. I have a use case that involves a dataset with over 1M files on s3. I
update the cache file incrementally after an overnight update job completes
avoiding having to reindex the entire dataset each time.
Another potential workaround would be to create a custom FileSystem which
replaces directory listing calls with reads of this cache file. In Python, this
can be done by subclassing PyFileSystem and
[FileSystemHandler|https://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystemHandler.html#pyarrow.fs.FileSystemHandler]
or through
[fsspec|https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems]
> [C++][Dataset] Partition level filters should be able to provide filtering to
> file systems
> ------------------------------------------------------------------------------------------
>
> Key: ARROW-7224
> URL: https://issues.apache.org/jira/browse/ARROW-7224
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Micah Kornfield
> Priority: Major
> Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases
> to use it to optimize file system list calls. This can greatly improve the
> speed for reading data from partitions because fewer number of
> directories/files need to be explored/expanded. I've fallen behind on the
> dataset code, but I want to make sure this issue is tracked someplace. This
> came up in SO question linked below (feel free to correct my analysis if I
> missed the functionality someplace).
> Reference:
> [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)