[
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace updated ARROW-15943:
--------------------------------
Labels: dataset good-second-issue (was: dataset)
> [C++] Filter which files to be read in as part of filesystem, filtered using
> a string
> -------------------------------------------------------------------------------------
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nicola Crane
> Priority: Major
> Labels: dataset, good-second-issue
>
> There is a report from a user (see this Stack Overflow post [1]) who has used
> the {{basename_template}} parameter to write files to a dataset, some of
> which have the prefix {{"summary"}} and others which have the prefix
> "{{{}prediction"{}}}. This data is saved in partitioned directories. They
> want to be able to read back in the data, so that, as well as the partition
> variables in their dataset, they can choose which subset (predictions vs.
> summaries) to read back in.
> This isn't currently possible; if they try to open a dataset with a list of
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their
> data is stored, but it could be useful to be able to pass in some sort of
> filter to determine which files get read in as a dataset.
>
> [1]
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)