[ 
https://issues.apache.org/jira/browse/ARROW-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10462:
------------------------------------------
    Description: 
Dask reported some failures starting with the pyarrow 2.0 release, and 
specifically on Windows: https://github.com/dask/dask/issues/6754

After some investigation, it seems that this is due to the 
{{ParquetDatasetPiece}} its {{path}} attribute now returning a path with a 
mixture of \\  and / in it. 

It specifically happens when dask is passing a posix-style base path pointing 
to the dataset base directory (so using all {{/}}), and passing an fsspec-based 
(local) filesystem.  
>From a debugging output during one of the dask tests:

{code}
(Pdb) dataset
<pyarrow.parquet.ParquetDataset object at 0x00000290D7506308>
(Pdb) dataset.paths
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
(Pdb) dataset.pieces[0].path
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'
{code}

So you can see that the result here has a mix of &#92;&#92; and {{/}}. Using 
pyarrow 1.0, this was consistently using {{/}}.

The reason for the change is that in pyarrow 2.0 we started to replace fsspec 
LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem 
that should be equivalent). But it seems that our own LocalFileSystem has a 
{{pathsep}}} property that equals to {{os.path.sep}}, which is &#92;&#92; on 
Windows 
(https://github.com/apache/arrow/blob/9231976609d352b7050f5c706b86c15e8c604927/python/pyarrow/filesystem.py#L304-L306.

So note that while this started being broken in pyarrow 2.0 when using fsspec 
filesystem, this was already "broken" before when using our own local 
filesystem (or when not passing any filesystem). But, 1) dask always passes an 
fsspec filesystem, and 2) dask uses the piece's path as dictionary key and is 
thus especially sensitive to the change (using it as a file path to read 
something in, it will probably still work even with the mixture of path 
separators).

  was:
Dask reported some failures starting with the pyarrow 2.0 release, and 
specifically on Windows: https://github.com/dask/dask/issues/6754

After some investigation, it seems that this is due to the 
{{ParquetDatasetPiece}} its {{path}} attribute now returning a path with a 
mixture \ or of &#92;&#92;  and / in it. 

It specifically happens when dask is passing a posix-style base path pointing 
to the dataset base directory (so using all {{/}}), and passing an fsspec-based 
(local) filesystem.  
>From a debugging output during one of the dask tests:

{code}
(Pdb) dataset
<pyarrow.parquet.ParquetDataset object at 0x00000290D7506308>
(Pdb) dataset.paths
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
(Pdb) dataset.pieces[0].path
'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'
{code}

So you can see that the result here has a mix of &#92;&#92; and {{/}}. Using 
pyarrow 1.0, this was consistently using {{/}}.

The reason for the change is that in pyarrow 2.0 we started to replace fsspec 
LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem 
that should be equivalent). But it seems that our own LocalFileSystem has a 
{{pathsep}}} property that equals to {{os.path.sep}}, which is &#92;&#92; on 
Windows 
(https://github.com/apache/arrow/blob/9231976609d352b7050f5c706b86c15e8c604927/python/pyarrow/filesystem.py#L304-L306.

So note that while this started being broken in pyarrow 2.0 when using fsspec 
filesystem, this was already "broken" before when using our own local 
filesystem (or when not passing any filesystem). But, 1) dask always passes an 
fsspec filesystem, and 2) dask uses the piece's path as dictionary key and is 
thus especially sensitive to the change (using it as a file path to read 
something in, it will probably still work even with the mixture of path 
separators).


> [Python] ParquetDatasetPiece's path broken when using fsspec fs on Windows
> --------------------------------------------------------------------------
>
>                 Key: ARROW-10462
>                 URL: https://issues.apache.org/jira/browse/ARROW-10462
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>             Fix For: 2.0.1
>
>
> Dask reported some failures starting with the pyarrow 2.0 release, and 
> specifically on Windows: https://github.com/dask/dask/issues/6754
> After some investigation, it seems that this is due to the 
> {{ParquetDatasetPiece}} its {{path}} attribute now returning a path with a 
> mixture of &#92;&#92;  and / in it. 
> It specifically happens when dask is passing a posix-style base path pointing 
> to the dataset base directory (so using all {{/}}), and passing an 
> fsspec-based (local) filesystem.  
> From a debugging output during one of the dask tests:
> {code}
> (Pdb) dataset
> <pyarrow.parquet.ParquetDataset object at 0x00000290D7506308>
> (Pdb) dataset.paths
> 'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0'
> (Pdb) dataset.pieces[0].path
> 'C:/Users/joris/AppData/Local/Temp/pytest-of-joris/pytest-25/test_partition_on_pyarrow_0\\a1=A\\a2=X\\part.0.parquet'
> {code}
> So you can see that the result here has a mix of &#92;&#92; and {{/}}. Using 
> pyarrow 1.0, this was consistently using {{/}}.
> The reason for the change is that in pyarrow 2.0 we started to replace fsspec 
> LocalFileSystem with our own LocalFileSystem (assuming for a local filesystem 
> that should be equivalent). But it seems that our own LocalFileSystem has a 
> {{pathsep}}} property that equals to {{os.path.sep}}, which is &#92;&#92; on 
> Windows 
> (https://github.com/apache/arrow/blob/9231976609d352b7050f5c706b86c15e8c604927/python/pyarrow/filesystem.py#L304-L306.
> So note that while this started being broken in pyarrow 2.0 when using fsspec 
> filesystem, this was already "broken" before when using our own local 
> filesystem (or when not passing any filesystem). But, 1) dask always passes 
> an fsspec filesystem, and 2) dask uses the piece's path as dictionary key and 
> is thus especially sensitive to the change (using it as a file path to read 
> something in, it will probably still work even with the mixture of path 
> separators).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to