[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

Nicola Crane (Jira) Tue, 09 Aug 2022 13:53:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-15943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577621#comment-17577621
 ]


Nicola Crane commented on ARROW-15943:
--------------------------------------

I'm not sure.

>From an R perspective, if it's an option, I think it would be fine to support 
>passing in a list of filenames but still being able to use the directory names 
>as dataset variables, if that's possible (as R users are likely to be 
>comfortable pre-filtering the list of files).  This feels like it would fit 
>with option 3; I am currently working on ARROW-15260 which would allow users 
>to add the fragment filename as a column, which they could then use to filter 
>on (though I recall in a conversation on that PR or ticket, you mentioning 
>that we can't properly do pushdown filtering yet using that?)  However, you 
>mention the issue of loading the unwanted data into memory - I guess for these 
>users they might choose to use something other than arrow if this was 
>acceptable to them.

Option 1 sounds good too.

I don't fully understand what option 2 would look like, but if it's something 
we could wrap in R to achieve solutions to the 2 linked Stack Overflow 
questions, then great.  

Ultimately, I don't think there's an obvious best approach here, and that 
solving for the simplest case ("I have directories containing files, which I 
wish to both selectively load in some files from, but also use the directory 
structure to create variables") will get us most of the way there unless any 
super-specialist use cases emerge later.  Option 1 sounds potentially simplest?




> [C++] Filter which files to be read in as part of filesystem, filtered using 
> a string
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-15943
>                 URL: https://issues.apache.org/jira/browse/ARROW-15943
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>              Labels: dataset
>
> There is a report from a user (see this Stack Overflow post [1]) who has used 
> the {{basename_template}} parameter to write files to a dataset, some of 
> which have the prefix {{"summary"}} and others which have the prefix 
> "{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
> want to be able to read back in the data, so that, as well as the partition 
> variables in their dataset, they can choose which subset (predictions vs. 
> summaries) to read back in.  
> This isn't currently possible; if they try to open a dataset with a list of 
> files, they cannot read it in as partitioned data.
> A short-term solution is to suggest they change the structure of how their 
> data is stored, but it could be useful to be able to pass in some sort of 
> filter to determine which files get read in as a dataset.
>  
> [1] 
> [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

Reply via email to