[
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577474#comment-17577474
]
Weston Pace commented on ARROW-15943:
-------------------------------------
I could see a few different ways this could be implemented:
* We could add support for exclusion / inclusion filters for dataset
discovery. These could be regular expressions that are applied to the
filenames to determine whether we should or should not include them.
* We could do more to support custom partitioning functions. The user could
then create their own partitioning which includes this part of the filename as
a partitioning column.
* We could (not sure if we support this today or not) make sure we support
filtering based on the filename column. However, this approach has the
downside of loading all the unwanted data into memory.
Do any of those approaches seem more appealing than the others?
> [C++] Filter which files to be read in as part of filesystem, filtered using
> a string
> -------------------------------------------------------------------------------------
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nicola Crane
> Priority: Major
> Labels: dataset
>
> There is a report from a user (see this Stack Overflow post [1]) who has used
> the {{basename_template}} parameter to write files to a dataset, some of
> which have the prefix {{"summary"}} and others which have the prefix
> "{{{}prediction"{}}}. This data is saved in partitioned directories. They
> want to be able to read back in the data, so that, as well as the partition
> variables in their dataset, they can choose which subset (predictions vs.
> summaries) to read back in.
> This isn't currently possible; if they try to open a dataset with a list of
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their
> data is stored, but it could be useful to be able to pass in some sort of
> filter to determine which files get read in as a dataset.
>
> [1]
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)