[
https://issues.apache.org/jira/browse/ARROW-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche resolved ARROW-10462.
-------------------------------------------
Fix Version/s: 3.0.0
Resolution: Fixed
Issue resolved by pull request 8539
[https://github.com/apache/arrow/pull/8539]
> [Python] ParquetDatasetPiece's path broken when using fsspec fs on Windows
> --------------------------------------------------------------------------
>
> Key: ARROW-10462
> URL: https://issues.apache.org/jira/browse/ARROW-10462
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.1, 3.0.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Dask reported some failures starting with the pyarrow 2.0 release, and
> specifically on Windows: https://github.com/dask/dask/issues/6754
> After some investigation, it seems that this is due to the
> {{ParquetDatasetPiece}} its {{path}} attribute now returning a path with a
> mixture of \\ and / in it.
> It specifically happens when dask is passing a posix-style base path pointing
> to the dataset base directory (so using all {{/}}), and passing an
> fsspec-based (local) filesystem.
> From a debugging output during one of the dask tests:
> {code}
> (Pdb) dataset
> <pyarrow.parquet.ParquetDataset object at 0x00000290D7506308>
> (Pdb) dataset.paths
> 'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
> (Pdb) dataset.pieces[0].path
> 'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'
> {code}
> So you can see that the result here has a mix of \\ and {{/}}. Using
> pyarrow 1.0, this was consistently using {{/}}.
> The reason for the change is that in pyarrow 2.0 we started to replace fsspec
> LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem
> that should be equivalent). But it seems that our own LocalFileSystem has a
> {{pathsep}}} property that equals to {{os.path.sep}}, which is \\ on
> Windows
> (https://github.com/apache/arrow/blob/9231976609d352b7050f5c706b86c15e8c604927/python/pyarrow/filesystem.py#L304-L306.
> So note that while this started being broken in pyarrow 2.0 when using fsspec
> filesystem, this was already "broken" before when using our own local
> filesystem (or when not passing any filesystem). But, 1) dask always passes
> an fsspec filesystem, and 2) dask uses the piece's path as dictionary key and
> is thus especially sensitive to the change (using it as a file path to read
> something in, it will probably still work even with the mixture of path
> separators).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)