Re: [I] [Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix [arrow]

via GitHub Wed, 24 Jan 2024 07:07:53 -0800


yf-yang commented on issue #38794:
URL: https://github.com/apache/arrow/issues/38794#issuecomment-1908322370


   Reopen because still not work.
   
   To reproduce:
   ``` python
   dataset = ds.dataset(
     'bucket/parquet_root/', # with slash at the end
     format='parquet', 
     filesystem=s3fs, 
     partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), 
("subset", pa.string())]))
   )
   dataset.head(1) # OSError: Not a regular file: 'bucket/parquet_root/'
   ```
   
   ``` python
   dataset = ds.dataset(
     'bucket/parquet_root', # remove slash
     format='parquet', 
     filesystem=s3fs, 
     partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), 
("subset", pa.string())]))
   )
   dataset.head(1) # pyarrow.lib.ArrowInvalid: Could not open Parquet input 
source 'bucket/parquet_root': Parquet file size is 0 bytes
   ```
   When reading a single file, it is OK
   ``` python
   dataset = ds.dataset(
     'bucket/parquet_root/abc/def/part-0.parquet'
     format='parquet', 
     filesystem=s3fs, 
     partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), 
("subset", pa.string())]))
   )
   dataset.head(1) 
   ```
   
   As the [doc of 
dataset](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow-dataset-dataset)
 says: 
   <img width="797" alt="image" 
src="https://github.com/apache/arrow/assets/36890796/b2650b76-d780-4048-818b-4350ceafe3d9";>
   
   When calling `s3.get_file_info('bucket/parquet_root')`, it returns:
   ``` python
   <FileInfo for 'bucket/parquet_root': type=FileType.File, size=0>
   ```
   
   I think that is the reason. A directory in s3 is `File` type, not 
`Directory` type.
   Related codes:
   
https://github.com/apache/arrow/blob/df83e50cdbc956846476a1dbcd5f09ef7058ed58/cpp/src/arrow/filesystem/s3fs.cc#L2644
   
https://github.com/apache/arrow/blob/df83e50cdbc956846476a1dbcd5f09ef7058ed58/cpp/src/arrow/filesystem/s3fs.cc#L1746-1750
   
   If I got it right, in s3, all the directories are always treated as files. 
Actually, I do try to use `boto3` to call `headObject`, and its `contentType` 
shows correctly that is a folder. Should the s3 filesystem implementation be 
improved?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix [arrow]

Reply via email to