[ https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979242#comment-16979242 ]
Francois Saint-Jacques commented on ARROW-7224: ----------------------------------------------- There's a confusion between the new dataset API (in C++) and the existing ParquetDataset that is purely in python. > [Python] Partition level filters should be able to provide filtering to file > systems > ------------------------------------------------------------------------------------ > > Key: ARROW-7224 > URL: https://issues.apache.org/jira/browse/ARROW-7224 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Micah Kornfield > Priority: Major > > When providing a filter for partitions, it should be possible in some cases > to use it to optimize file system list calls. This can greatly improve the > speed for reading data from partitions because fewer number of > directories/files need to be explored/expanded. I've fallen behind on the > dataset code, but I want to make sure this issue is tracked someplace. This > came up in SO question linked below (feel free to correct my analysis if I > missed the functionality someplace). > Reference: > [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477] -- This message was sent by Atlassian Jira (v8.3.4#803005)