[
https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517478#comment-17517478
]
Lance Dacey commented on ARROW-16077:
-------------------------------------
I am not sure about any public datasets. Locally, I use
[azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio]
for testing which can be installed or run as a Docker container.
I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet
data from Azure. I did a couple of tests with double slashes in the path.
Perhaps I misunderstood what the original issue was, but it looks like I can
read the data with pq.read_table and with pandas using fs.open() and
storage_options. I pasted my quick tests below.
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal
URL = "http://127.0.0.1:10000"
ACCOUNT_NAME = "devstoreaccount1"
KEY =
"Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR =
f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"
@pytest.fixture
def example_data():
return {
"date_id": [20210114, 20210811],
"id": [1, 2],
"created_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"updated_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"category": ["cow", "sheep"],
"value": [0, 99],
}
def test_double_slashes(example_data):
fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME,
connection_string=CONN_STR)
fs.mkdir("resource")
path = "resource/path/to//parquet/files/part-001.parquet"
table = pa.table(example_data)
pq.write_table(table, where=path, filesystem=fs)
# use pq.read_table() with filesystem
new = pq.read_table(source=path, filesystem=fs)
assert new == table
# use adlfs filesystem.open()
df = pd.read_parquet(fs.open(path, mode="rb"))
dataframe_table = pa.Table.from_pandas(df)
assert table == dataframe_table
# use abfs path with storage options
df2 = pd.read_parquet(f"abfs://{path}",
storage_options={"connection_string": CONN_STR})
assert_frame_equal(df, df2)
{code}
> [Python] ArrowInvalid error on reading partitioned parquet files with
> fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-16077
> URL: https://issues.apache.org/jira/browse/ARROW-16077
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0
> Reporter: Jon Rosenberg
> Priority: Major
>
> Reading a partitioned parquet from adlfs with pyarrow through pandas will
> throw unnecessary exceptions on not matching forward slashes in the listed
> files returned from adlfs, ie:
>
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to/parquet/files"){code}
> results in exception of the form
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path
> 'path/to/parquet/files/part-0001.parquet', which is outside base dir
> '/path/to/parquet/files/'{code}
>
> and testing with modifying the adlfs method to prepend slashes to all
> returned files, we still end up with an error on file paths that would
> otherwise be handled correctly where there is a double slash in a location
> where there should be one, ie:
>
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to//parquet/files") {code}
> would throw
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path
> '/path/to/parquet/files/part-0001.parquet', which is outside base dir
> '/path/to//parquet/files/' {code}
> In both cases the ls has returned correctly from adlfs, given it's discovered
> the file part-0001.parquet but the pyarrow exception stops what could
> otherwise be successful processing.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)