[ 
https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517534#comment-17517534
 ] 

Jon Rosenberg commented on ARROW-16077:
---------------------------------------

I'm not sure about public paths, I'll see if I can get something more specific 
to running inside the azurite image later today, but am seeing that the test 
code here is slightly different in:

1. I'm just specifying the full datalake path, and not specifying filesystem or 
storage option in my pandas read, but with my environment variables with azure 
credentials, and using the scheme of the passed url, pandas is not having an 
issue connecting to the lake. My hunch is this usage detail shouldn't affect my 
issue, but I'll verify when testing later.


2. I'm passing in the path to the partitioned files, not any file itself. That 
is, instead of
{code:java}
abfs://resource/path/to//parquet/files/part-001.parquet{code}
I would be passing
{code:java}
abfs://resource/path/to//parquet/files {code}
which requires an ls from adlfs to retrieve the parquet files to concatenate, 
and the ls is performed successfully returning the list of files EXCEPT in the 
returned list of the directory files from adlfs the double slash is not 
included in the paths, returning:
{code:java}
resource/path/to/parquet/files/part-001.parquet {code}
NOT
{code:java}
resource/path/to//parquet/files/part-001.parquet {code}
and thus PyArrow was throwing an exception for me on being outside 
{code:java}
resource/path/to//parquet/files {code}
despite otherwise being able to proceed with the read if not for this check.

> [Python] ArrowInvalid error on reading partitioned parquet files with 
> fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16077
>                 URL: https://issues.apache.org/jira/browse/ARROW-16077
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Jon Rosenberg
>            Priority: Major
>
> Reading a partitioned parquet from adlfs with pyarrow through pandas will 
> throw unnecessary exceptions on not matching forward slashes in the listed 
> files returned from adlfs, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to/parquet/files"){code}
> results in exception of the form
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> 'path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to/parquet/files/'{code}
>  
> and testing with modifying the adlfs method to prepend slashes to all 
> returned files, we still end up with an error on file paths that would 
> otherwise be handled correctly where there is a double slash in a location 
> where there should be one, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to//parquet/files") {code}
> would throw
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> '/path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to//parquet/files/' {code}
> In both cases the ls has returned correctly from adlfs, given it's discovered 
> the file part-0001.parquet but the pyarrow exception stops what could 
> otherwise be successful processing. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to