Micah Kornfield created ARROW-7224:
--------------------------------------
Summary: [C++][Datasets] Partition level filters should be able to
provide filtering to file systems
Key: ARROW-7224
URL: https://issues.apache.org/jira/browse/ARROW-7224
Project: Apache Arrow
Issue Type: Improvement
Components: C++, C++ - Dataset
Reporter: Micah Kornfield
When providing a filter for partitions, it should be possible in some cases to
use it to optimize file system list calls. This can greatly improve the speed
for reading data from partitions because fewer number of directories/files need
to be explored/expanded. I've fallen behind on the dataset code, but I want to
make sure this issue is tracked someplace. This came up in SO question linked
below (feel free to correct my analysis if I missed the functionality
someplace).
Reference:
[https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)