[jira] [Commented] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path

Jon Rosenberg (Jira) Tue, 05 Apr 2022 15:58:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517741#comment-17517741
 ]


Jon Rosenberg commented on ARROW-16077:
---------------------------------------

OK, I had to do some separate testing since azurite is for blob storage and not 
adl, but it does seem there is a difference between how the two behave.

It appears that in blob storage
{code:java}
resource/path/to//parquet/files  {code}
is a valid and distinct path from
{code:java}
resource/path/to/parquet/files  {code}
Changing the write in your test to write to a path with only one slash but 
keeping the double slash in the read tests caused a failure for me, but it 
appeared to be due to reading an empty location.

In the data lake however any double slash path is interpreted the same as a 
single slash, which is what my error is arising out of. I unfortunately still 
don't have a public datalake path however but will look around for such a 
reproduction.

> [Python] ArrowInvalid error on reading partitioned parquet files with 
> fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16077
>                 URL: https://issues.apache.org/jira/browse/ARROW-16077
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Jon Rosenberg
>            Priority: Major
>
> Reading a partitioned parquet from adlfs with pyarrow through pandas will 
> throw unnecessary exceptions on not matching forward slashes in the listed 
> files returned from adlfs, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to/parquet/files"){code}
> results in exception of the form
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> 'path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to/parquet/files/'{code}
>  
> and testing with modifying the adlfs method to prepend slashes to all 
> returned files, we still end up with an error on file paths that would 
> otherwise be handled correctly where there is a double slash in a location 
> where there should be one, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to//parquet/files") {code}
> would throw
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> '/path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to//parquet/files/' {code}
> In both cases the ls has returned correctly from adlfs, given it's discovered 
> the file part-0001.parquet but the pyarrow exception stops what could 
> otherwise be successful processing. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path

Reply via email to