[
https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128405#comment-17128405
]
Francois Saint-Jacques edited comment on ARROW-7673 at 6/8/20, 3:56 PM:
------------------------------------------------------------------------
This has been refactored/fixed in ARROW-8058:
{code:python}
In [40]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016",
format="csv")
Out[40]: <pyarrow._dataset.FileSystemDataset at 0x7fef446b2930>
In [41]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016",
format="parquet")
...
OSError: Could not open parquet input source
'/home/fsaintjacques/datasets/nyc-tlc/csv/2016/01/data.csv': Invalid: Parquet
magic bytes not found in footer. Either the file is corrupted or this is not a
parquet file.
In [42]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/parquet/2016",
format="parquet")
Out[42]: <pyarrow._dataset.FileSystemDataset at 0x7fef447ad7f0>
{code}
was (Author: fsaintjacques):
This has been refactored in ARROW-8058:
{code:python}
In [40]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016",
format="csv")
Out[40]: <pyarrow._dataset.FileSystemDataset at 0x7fef446b2930>
In [41]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016",
format="parquet")
...
OSError: Could not open parquet input source
'/home/fsaintjacques/datasets/nyc-tlc/csv/2016/01/data.csv': Invalid: Parquet
magic bytes not found in footer. Either the file is corrupted or this is not a
parquet file.
In [42]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/parquet/2016",
format="parquet")
Out[42]: <pyarrow._dataset.FileSystemDataset at 0x7fef447ad7f0>
{code}
> [C++][Dataset] Revisit File discovery failure mode
> --------------------------------------------------
>
> Key: ARROW-7673
> URL: https://issues.apache.org/jira/browse/ARROW-7673
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Francois Saint-Jacques
> Assignee: Francois Saint-Jacques
> Priority: Major
> Labels: dataset
> Fix For: 1.0.0
>
>
> Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will
> silently ignore unsupported files (either IO error, not of the valid format,
> corruption, missing compression codecs, etc...) when creating a
> `FileSystemSource`.
> We should change this behavior to propagate an error in the Inspect/Finish
> calls by default and allow the user to toggle `exclude_invalid_files`. The
> error should contain at least the file path and a decipherable error (if
> possible).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)