yf-yang opened a new issue, #38794: URL: https://github.com/apache/arrow/issues/38794
### Describe the bug, including details regarding any error messages, version, and platform. ## How the parquet is created: ``` python import polars as pl import pyarrow.dataset as ds import s3fs s3fs = s3fs.S3FileSystem() df = pl.DataFrame() ds.write_dataset( df.to_arrow(), "s3://bucket/parquet_root", format='parquet', filesystem=s3fs, partitioning=ds.partitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])), existing_data_behavior='delete_matching' ) ``` After the action, in the bucket, path `parquet_root/abc/def/part-0.parquet` exists. ## Try to access the parquet ``` python import polars as pl import pyarrow as pa import pyarrow.dataset as ds import pyarrow.parquet as pq import s3fs s3fs = s3fs.S3FileSystem() pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))) # pyarrow.lib.ArrowInvalid: Could not open Parquet input source 's3://bucket/parquet_root/': Parquet file size is 0 bytes # after I manually call s3fs.isdir, things changes, I suspect this is another bug s3fs.isdir('s3://bucket/parquet_root/') # True # repeat the call pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))) # pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/' # another try, the same error dataset = ds.dataset( 's3://bucket/parquet_root/', format='parquet', filesystem=s3, partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])) ) # pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/' ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org