[ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-15943:
--------------------------------
    Labels: dataset good-second-issue  (was: dataset)

> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-15943
>                 URL: https://issues.apache.org/jira/browse/ARROW-15943
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>              Labels: dataset, good-second-issue
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to