Nicola Crane created ARROW-15943:
------------------------------------
Summary: [C++] Filter which files to be read in as part of
filesystem, filtered using a string
Key: ARROW-15943
URL: https://issues.apache.org/jira/browse/ARROW-15943
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Nicola Crane
There is a report from a user (see this Stack Overflow post [1]) who has used
the {{basename_template}} parameter to write files to a dataset, some of which
have the prefix {{"summary"}} and others which have the prefix
"{{{}prediction"{}}}. This data is saved in partitioned directories. They
want to be able to read back in the data, so that, as well as the partition
variables in their dataset, they can choose which subset (predictions vs.
summaries) to read back in.
This isn't currently possible; if they try to open a dataset with a list of
files, they cannot read it in as partitioned data.
A short-term solution is to suggest they change the structure of how their data
is stored, but it could be useful to be able to pass in some sort of filter to
determine which files get read in as a dataset.
[1]
[https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)