[ 
https://issues.apache.org/jira/browse/ARROW-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967516#comment-16967516
 ] 

Francois Saint-Jacques commented on ARROW-7061:
-----------------------------------------------

I started adding features to fs::Selector, notably max depth recursion,  I 
intended to add a filter function option to the selector but [~apitrou] 
objected lightly, arguing that if this is desired, the user could filter the 
explicit list of FileStats returned by the selector. Hence, this is why 
FileSystemDataSourceDiscovery supports both option (an explicit list of 
FileStats or a Selector).

Ideally, we want the "it-just-works" feeling, some suggestions:
 * Detect failures early, e.g. in `FileSystemBasedDataSource::Make` should we 
scan all files and detect if they can be parsed by the format driver? How 
should we propagate the failure, ignore file and warn, or abort via failure? 
The failure to parse the file is implicitly done by `Inspect` call.
 * Should we filter by file extension by default (if the user is passing a 
Selector and not an explicit list of FileStats). At first it seems very 
convenient, but it can lead to situation of silently ignoring important files 
just because of implicit naming convention.
 * Should we settle that the `Selector` constructor is the it-just-works route, 
and the explicit vector<FileStats> is the power user route? 

> [C++][Dataset] FileSystemDiscovery with ParquetFileFormat should ignore files 
> that aren't Parquet
> -------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7061
>                 URL: https://issues.apache.org/jira/browse/ARROW-7061
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++ - Dataset
>            Reporter: Neal Richardson
>            Priority: Major
>
> I got {{Invalid parquet file. Corrupt footer.}} trying to read real data. 
> Turned out it was because I had opened the directory in macOS Finder and it 
> had added the junk .DS_Store files. Once I deleted them, the Dataset created 
> fine. 
> If we're creating a DataSource with Parquet files, we should ignore any 
> non-Parquet files we encounter when scanning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to