[ 
https://issues.apache.org/jira/browse/ARROW-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-10462.
-------------------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 8539
[https://github.com/apache/arrow/pull/8539]

> [Python] ParquetDatasetPiece's path broken when using fsspec fs on Windows
> --------------------------------------------------------------------------
>
>                 Key: ARROW-10462
>                 URL: https://issues.apache.org/jira/browse/ARROW-10462
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.1, 3.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Dask reported some failures starting with the pyarrow 2.0 release, and 
> specifically on Windows: https://github.com/dask/dask/issues/6754
> After some investigation, it seems that this is due to the 
> {{ParquetDatasetPiece}} its {{path}} attribute now returning a path with a 
> mixture of \\  and / in it. 
> It specifically happens when dask is passing a posix-style base path pointing 
> to the dataset base directory (so using all {{/}}), and passing an 
> fsspec-based (local) filesystem.  
> From a debugging output during one of the dask tests:
> {code}
> (Pdb) dataset
> <pyarrow.parquet.ParquetDataset object at 0x00000290D7506308>
> (Pdb) dataset.paths
> 'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
> (Pdb) dataset.pieces[0].path
> 'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'
> {code}
> So you can see that the result here has a mix of &#92;&#92; and {{/}}. Using 
> pyarrow 1.0, this was consistently using {{/}}.
> The reason for the change is that in pyarrow 2.0 we started to replace fsspec 
> LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem 
> that should be equivalent). But it seems that our own LocalFileSystem has a 
> {{pathsep}}} property that equals to {{os.path.sep}}, which is &#92;&#92; on 
> Windows 
> (https://github.com/apache/arrow/blob/9231976609d352b7050f5c706b86c15e8c604927/python/pyarrow/filesystem.py#L304-L306.
> So note that while this started being broken in pyarrow 2.0 when using fsspec 
> filesystem, this was already "broken" before when using our own local 
> filesystem (or when not passing any filesystem). But, 1) dask always passes 
> an fsspec filesystem, and 2) dask uses the piece's path as dictionary key and 
> is thus especially sensitive to the change (using it as a file path to read 
> something in, it will probably still work even with the mixture of path 
> separators).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to