[
https://issues.apache.org/jira/browse/ARROW-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179484#comment-17179484
]
Joris Van den Bossche commented on ARROW-9748:
----------------------------------------------
I am not sure we should necessarily remove the "constructor-from-selector" at
all? In any case, that doesn't necessarily seem needed to also do when removing
the "ignore_prefixes" functionality (because passing just a selector without
any "ignore_prefixes" logic doesn't give any ambiguity or edge cases, I think?)
_If_ we remove the custom "ignore_prefixes" behaviour from the
FileSystemDatasetFactory discovery functionality, the question is how to
replace it:
- Do we add similar functionality to {{FileSelector}} ? So we still provide
basic "ignore_prefix" functionality in C++ (for our python/R bindings to use),
but move it from the dataset discovery implementation to the filesystem
(FileSelector) implementation (which is maybe a more logical place to put this)
- Or do we fully delegate this responsibility of providing "ignore_prefix" (or
more advanced file filtering) functionality to the bindings? So both Python and
R need to implement this in their "(open_)dataset(..)" wrapper, both using
tools available in their language.
Another feature we might want to provide which is related to this is basic
"globbing" / wildcard syntax, like {{"/my/dataset/*.parquet"}}, which can also
give a way to filter out unwanted files (and the same question applies here: do
we want to add this feature to the FileSelector, or is this a responsibility of
the bindings?)
> [C++][Dataset] Remove Selector, ignore_prefixes from FileSystemDatasetFactory
> -----------------------------------------------------------------------------
>
> Key: ARROW-9748
> URL: https://issues.apache.org/jira/browse/ARROW-9748
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 1.0.0
> Reporter: Ben Kietzman
> Priority: Major
> Labels: dataset
> Fix For: 2.0.0
>
>
> Currently FileSystemDatasetFactory can be constructed with an explicit
> listing of files or with a {{fs::FileSelector}}. Since the selector does not
> support sophisticated selection criteria,
> {{FileSystemFactoryOptions::selector_ignore_prefixes}} to allow users to
> exclude undesired files such as {{_metadata}} or {{.DS_STORE}}.
> The selector + ignored prefixes mechanism is inflexible with numerous edge
> cases ( ARROW-9644 ARROW-9573 ). Furthermore, implementing more advanced file
> selection logic in dataset discovery prevents it from being reused by other
> consumers of the file system api.
> Remove FileSystemDatasetFactory's constructor-from-selector, optionally
> adding that functionality directly to {{fs::FileSelector}}. An explicit
> listing of files for use in construction of a FileSystemDatasetFactory can
> then be assembled using an {{fs::FileSelector}} and/or other globbing
> libraries, with arbitrary inclusion logic.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)