[
https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517741#comment-17517741
]
Jon Rosenberg commented on ARROW-16077:
---------------------------------------
OK, I had to do some separate testing since azurite is for blob storage and not
adl, but it does seem there is a difference between how the two behave.
It appears that in blob storage
{code:java}
resource/path/to//parquet/files {code}
is a valid and distinct path from
{code:java}
resource/path/to/parquet/files {code}
Changing the write in your test to write to a path with only one slash but
keeping the double slash in the read tests caused a failure for me, but it
appeared to be due to reading an empty location.
In the data lake however any double slash path is interpreted the same as a
single slash, which is what my error is arising out of. I unfortunately still
don't have a public datalake path however but will look around for such a
reproduction.
> [Python] ArrowInvalid error on reading partitioned parquet files with
> fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-16077
> URL: https://issues.apache.org/jira/browse/ARROW-16077
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0
> Reporter: Jon Rosenberg
> Priority: Major
>
> Reading a partitioned parquet from adlfs with pyarrow through pandas will
> throw unnecessary exceptions on not matching forward slashes in the listed
> files returned from adlfs, ie:
>
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to/parquet/files"){code}
> results in exception of the form
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path
> 'path/to/parquet/files/part-0001.parquet', which is outside base dir
> '/path/to/parquet/files/'{code}
>
> and testing with modifying the adlfs method to prepend slashes to all
> returned files, we still end up with an error on file paths that would
> otherwise be handled correctly where there is a double slash in a location
> where there should be one, ie:
>
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to//parquet/files") {code}
> would throw
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path
> '/path/to/parquet/files/part-0001.parquet', which is outside base dir
> '/path/to//parquet/files/' {code}
> In both cases the ls has returned correctly from adlfs, given it's discovered
> the file part-0001.parquet but the pyarrow exception stops what could
> otherwise be successful processing.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)