yf-yang opened a new issue, #38794:
URL: https://github.com/apache/arrow/issues/38794
### Describe the bug, including details regarding any error messages,
version, and platform.
## How the parquet is created:
``` python
import polars as pl
import pyarrow.dataset as ds
import s3fs
s3fs = s3fs.S3FileSystem()
df = pl.DataFrame()
ds.write_dataset(
df.to_arrow(),
"s3://bucket/parquet_root",
format='parquet',
filesystem=s3fs,
partitioning=ds.partitioning(pa.schema([("set", pa.string()), ("subset",
pa.string())])),
existing_data_behavior='delete_matching'
)
```
After the action, in the bucket, path `parquet_root/abc/def/part-0.parquet`
exists.
## Try to access the parquet
``` python
import polars as pl
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import s3fs
s3fs = s3fs.S3FileSystem()
pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs)
# ok
pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs)
# ok
pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()),
("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: Could not open Parquet input source
's3://bucket/parquet_root/': Parquet file size is 0 bytes
# after I manually call s3fs.isdir, things changes, I suspect this is
another bug
s3fs.isdir('s3://bucket/parquet_root/') # True
# repeat the call
pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()),
("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path
'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir
's3://bucket/parquet_root/'
# another try, the same error
dataset = ds.dataset(
's3://bucket/parquet_root/',
format='parquet',
filesystem=s3,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()),
("subset", pa.string())]))
)
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path
'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir
's3://bucket/parquet_root/'
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]