[jira] [Commented] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path

Lance Dacey (Jira) Tue, 05 Apr 2022 07:26:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517478#comment-17517478
 ]


Lance Dacey commented on ARROW-16077:
-------------------------------------

I am not sure about any public datasets. Locally, I use 
[azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio]
 for testing which can be installed or run as a Docker container.

I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet 
data from Azure. I did a couple of tests with double slashes in the path. 
Perhaps I misunderstood what the original issue was, but it looks like I can 
read the data with pq.read_table and with pandas using fs.open() and 
storage_options. I pasted my quick tests below.



{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal


URL = "http://127.0.0.1:10000";
ACCOUNT_NAME = "devstoreaccount1"
KEY = 
"Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR = 
f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"


@pytest.fixture
def example_data():
    return {
        "date_id": [20210114, 20210811],
        "id": [1, 2],
        "created_at": [
            "2021-01-14 16:45:18",
            "2021-08-11 15:10:00",
        ],
        "updated_at": [
            "2021-01-14 16:45:18",
            "2021-08-11 15:10:00",
        ],
        "category": ["cow", "sheep"],
        "value": [0, 99],
    }


def test_double_slashes(example_data):
    fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, 
connection_string=CONN_STR)
    fs.mkdir("resource")
    path = "resource/path/to//parquet/files/part-001.parquet"
    table = pa.table(example_data)
    pq.write_table(table, where=path, filesystem=fs)

    # use pq.read_table() with filesystem
    new = pq.read_table(source=path, filesystem=fs)
    assert new == table

    # use adlfs filesystem.open()
    df = pd.read_parquet(fs.open(path, mode="rb"))
    dataframe_table = pa.Table.from_pandas(df)
    assert table == dataframe_table

    # use abfs path with storage options
    df2 = pd.read_parquet(f"abfs://{path}", 
storage_options={"connection_string": CONN_STR})
    assert_frame_equal(df, df2)

{code}




> [Python] ArrowInvalid error on reading partitioned parquet files with 
> fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16077
>                 URL: https://issues.apache.org/jira/browse/ARROW-16077
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Jon Rosenberg
>            Priority: Major
>
> Reading a partitioned parquet from adlfs with pyarrow through pandas will 
> throw unnecessary exceptions on not matching forward slashes in the listed 
> files returned from adlfs, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to/parquet/files"){code}
> results in exception of the form
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> 'path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to/parquet/files/'{code}
>  
> and testing with modifying the adlfs method to prepend slashes to all 
> returned files, we still end up with an error on file paths that would 
> otherwise be handled correctly where there is a double slash in a location 
> where there should be one, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to//parquet/files") {code}
> would throw
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> '/path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to//parquet/files/' {code}
> In both cases the ls has returned correctly from adlfs, given it's discovered 
> the file part-0001.parquet but the pyarrow exception stops what could 
> otherwise be successful processing. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path

Reply via email to