[ 
https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517444#comment-17517444
 ] 

Joris Van den Bossche commented on ARROW-16077:
-----------------------------------------------

[~jon-rosenberg-env] thanks for the report. That looks like an annoying issue! 

I am not very familiar with ADL myself (or access to it for testing. Do they 
have public datasets that can be used to test without an account like you can 
have public S3 buckets?), so can't directly help with diagnosing this issue. 
But a few questions:

Can you try passing a {{adlfs}} filesystem object manually? Something like

{code}
import adlfs
import pyarrow.parquet as pq

adl = adlfs.AzureDatalakeFileSystem(...)
pq.read_table("...", filesystem=adl)
{code}

We have had previous reports related to Azure Data Lake, so while there have 
been issues before, that also indicates it was at least possible to read from 
that to a certain extent. cc [~ldacey] did you ever run into this specific 
issue?



> [Python] ArrowInvalid error on reading partitioned parquet files with 
> fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16077
>                 URL: https://issues.apache.org/jira/browse/ARROW-16077
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Jon Rosenberg
>            Priority: Major
>
> Reading a partitioned parquet from adlfs with pyarrow through pandas will 
> throw unnecessary exceptions on not matching forward slashes in the listed 
> files returned from adlfs, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to/parquet/files"){code}
> results in exception of the form
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> 'path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to/parquet/files/'{code}
>  
> and testing with modifying the adlfs method to prepend slashes to all 
> returned files, we still end up with an error on file paths that would 
> otherwise be handled correctly where there is a double slash in a location 
> where there should be one, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to//parquet/files") {code}
> would throw
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> '/path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to//parquet/files/' {code}
> In both cases the ls has returned correctly from adlfs, given it's discovered 
> the file part-0001.parquet but the pyarrow exception stops what could 
> otherwise be successful processing. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to