yf-yang commented on issue #38794:
URL: https://github.com/apache/arrow/issues/38794#issuecomment-1908322370
Reopen because still not work.
To reproduce:
``` python
dataset = ds.dataset(
'bucket/parquet_root/', # with slash at the end
format='parquet',
filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()),
("subset", pa.string())]))
)
dataset.head(1) # OSError: Not a regular file: 'bucket/parquet_root/'
```
``` python
dataset = ds.dataset(
'bucket/parquet_root', # remove slash
format='parquet',
filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()),
("subset", pa.string())]))
)
dataset.head(1) # pyarrow.lib.ArrowInvalid: Could not open Parquet input
source 'bucket/parquet_root': Parquet file size is 0 bytes
```
When reading a single file, it is OK
``` python
dataset = ds.dataset(
'bucket/parquet_root/abc/def/part-0.parquet'
format='parquet',
filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()),
("subset", pa.string())]))
)
dataset.head(1)
```
As the [doc of
dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow-dataset-dataset)
says:
<img width="797" alt="image"
src="https://github.com/apache/arrow/assets/36890796/b2650b76-d780-4048-818b-4350ceafe3d9">
When calling `s3.get_file_info('bucket/parquet_root')`, it returns:
``` python
<FileInfo for 'bucket/parquet_root': type=FileType.File, size=0>
```
I think that is the reason. A directory in s3 is `File` type, not
`Directory` type.
Related codes:
https://github.com/apache/arrow/blob/df83e50cdbc956846476a1dbcd5f09ef7058ed58/cpp/src/arrow/filesystem/s3fs.cc#L2644
https://github.com/apache/arrow/blob/df83e50cdbc956846476a1dbcd5f09ef7058ed58/cpp/src/arrow/filesystem/s3fs.cc#L1746-1750
If I got it right, in s3, all the directories are always treated as files.
Actually, I do try to use `boto3` to call `headObject`, and its `contentType`
shows correctly that is a folder. Should the s3 filesystem implementation be
improved?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]