[jira] [Commented] (ARROW-7224) [C++][Dataset] Partition level filters should be able to provide filtering to file systems

Micah Kornfield (Jira) Thu, 11 Mar 2021 09:14:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299725#comment-17299725
 ]


Micah Kornfield commented on ARROW-7224:
----------------------------------------

[~jorisvandenbossche] I think this is the [relevant 
API|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownFilters.java]
 from DataSourceV2.  

 

It seems like a bad user experience to expose a "filter on construction 
parameter", but it could be a way to mitigate this.  I think workarounds 
[~bkietz] proposed are also workable.  As I've said before I think supporting 
the feature that this JIRA is asking for is complex and potentially requires 
big changes to Datasets so I understand if it isn't immediately prioritized 
(but I think it can have a large impact for common cases).

> [C++][Dataset] Partition level filters should be able to provide filtering to 
> file systems
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7224
>                 URL: https://issues.apache.org/jira/browse/ARROW-7224
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Micah Kornfield
>            Priority: Major
>              Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases 
> to use it to optimize file system list calls.  This can greatly improve the 
> speed for reading data from partitions because fewer number of 
> directories/files need to be explored/expanded.  I've fallen behind on the 
> dataset code, but I want to make sure this issue is tracked someplace.  This 
> came up in SO question linked below (feel free to correct my analysis if I 
> missed the functionality someplace).
> Reference: 
> [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7224) [C++][Dataset] Partition level filters should be able to provide filtering to file systems

Reply via email to