[ 
https://issues.apache.org/jira/browse/ARROW-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179484#comment-17179484
 ] 

Joris Van den Bossche commented on ARROW-9748:
----------------------------------------------

I am not sure we should necessarily remove the "constructor-from-selector" at 
all? In any case, that doesn't necessarily seem needed to also do when removing 
the "ignore_prefixes" functionality (because passing just a selector without 
any "ignore_prefixes" logic doesn't give any ambiguity or edge cases, I think?)

_If_ we remove the custom "ignore_prefixes" behaviour from the 
FileSystemDatasetFactory discovery functionality, the question is how to 
replace it:

- Do we add similar functionality to {{FileSelector}} ? So we still provide 
basic "ignore_prefix" functionality in C++ (for our python/R bindings to use), 
but move it from the dataset discovery implementation to the filesystem 
(FileSelector) implementation (which is maybe a more logical place to put this)
- Or do we fully delegate this responsibility of providing "ignore_prefix" (or 
more advanced file filtering) functionality to the bindings? So both Python and 
R need to implement this in their "(open_)dataset(..)" wrapper, both using 
tools available in their language.

Another feature we might want to provide which is related to this is basic 
"globbing" / wildcard syntax, like {{"/my/dataset/*.parquet"}}, which can also 
give a way to filter out unwanted files (and the same question applies here: do 
we want to add this feature to the FileSelector, or is this a responsibility of 
the bindings?)


> [C++][Dataset] Remove Selector, ignore_prefixes from FileSystemDatasetFactory
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-9748
>                 URL: https://issues.apache.org/jira/browse/ARROW-9748
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 1.0.0
>            Reporter: Ben Kietzman
>            Priority: Major
>              Labels: dataset
>             Fix For: 2.0.0
>
>
> Currently FileSystemDatasetFactory can be constructed with an explicit 
> listing of files or with a {{fs::FileSelector}}. Since the selector does not 
> support sophisticated selection criteria, 
> {{FileSystemFactoryOptions::selector_ignore_prefixes}} to allow users to 
> exclude undesired files such as {{_metadata}} or {{.DS_STORE}}.
> The selector + ignored prefixes mechanism is inflexible with numerous edge 
> cases ( ARROW-9644 ARROW-9573 ). Furthermore, implementing more advanced file 
> selection logic in dataset discovery prevents it from being reused by other 
> consumers of the file system api.
> Remove FileSystemDatasetFactory's constructor-from-selector, optionally 
> adding that functionality directly to {{fs::FileSelector}}. An explicit 
> listing of files for use in construction of a FileSystemDatasetFactory can 
> then be assembled using an {{fs::FileSelector}} and/or other globbing 
> libraries, with arbitrary inclusion logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to