[
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577621#comment-17577621
]
Nicola Crane commented on ARROW-15943:
--------------------------------------
I'm not sure.
>From an R perspective, if it's an option, I think it would be fine to support
>passing in a list of filenames but still being able to use the directory names
>as dataset variables, if that's possible (as R users are likely to be
>comfortable pre-filtering the list of files). This feels like it would fit
>with option 3; I am currently working on ARROW-15260 which would allow users
>to add the fragment filename as a column, which they could then use to filter
>on (though I recall in a conversation on that PR or ticket, you mentioning
>that we can't properly do pushdown filtering yet using that?) However, you
>mention the issue of loading the unwanted data into memory - I guess for these
>users they might choose to use something other than arrow if this was
>acceptable to them.
Option 1 sounds good too.
I don't fully understand what option 2 would look like, but if it's something
we could wrap in R to achieve solutions to the 2 linked Stack Overflow
questions, then great.
Ultimately, I don't think there's an obvious best approach here, and that
solving for the simplest case ("I have directories containing files, which I
wish to both selectively load in some files from, but also use the directory
structure to create variables") will get us most of the way there unless any
super-specialist use cases emerge later. Option 1 sounds potentially simplest?
> [C++] Filter which files to be read in as part of filesystem, filtered using
> a string
> -------------------------------------------------------------------------------------
>
> Key: ARROW-15943
> URL: https://issues.apache.org/jira/browse/ARROW-15943
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nicola Crane
> Priority: Major
> Labels: dataset
>
> There is a report from a user (see this Stack Overflow post [1]) who has used
> the {{basename_template}} parameter to write files to a dataset, some of
> which have the prefix {{"summary"}} and others which have the prefix
> "{{{}prediction"{}}}. This data is saved in partitioned directories. They
> want to be able to read back in the data, so that, as well as the partition
> variables in their dataset, they can choose which subset (predictions vs.
> summaries) to read back in.
> This isn't currently possible; if they try to open a dataset with a list of
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their
> data is stored, but it could be useful to be able to pass in some sort of
> filter to determine which files get read in as a dataset.
>
> [1]
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)