[jira] [Created] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

Nicola Crane (Jira) Tue, 15 Mar 2022 10:22:06 -0700

Nicola Crane created ARROW-15943:
------------------------------------

             Summary: [C++] Filter which files to be read in as part of 
filesystem, filtered using a string
                 Key: ARROW-15943
                 URL: https://issues.apache.org/jira/browse/ARROW-15943
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Nicola Crane



There is a report from a user (see this Stack Overflow post [1]) who has used 
the {{basename_template}} parameter to write files to a dataset, some of which 
have the prefix {{"summary"}} and others which have the prefix 
"{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
want to be able to read back in the data, so that, as well as the partition 
variables in their dataset, they can choose which subset (predictions vs. 
summaries) to read back in.  

This isn't currently possible; if they try to open a dataset with a list of 
files, they cannot read it in as partitioned data.

A short-term solution is to suggest they change the structure of how their data 
is stored, but it could be useful to be able to pass in some sort of filter to 
determine which files get read in as a dataset.

 

[1] 
[https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

Reply via email to