yf-yang opened a new issue, #38794:
URL: https://github.com/apache/arrow/issues/38794

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## How the parquet is created:
   ``` python
   import polars as pl
   import pyarrow.dataset as ds
   import s3fs
   
   s3fs = s3fs.S3FileSystem()
   df = pl.DataFrame()
   ds.write_dataset(
       df.to_arrow(),
       "s3://bucket/parquet_root",
       format='parquet', 
       filesystem=s3fs,
       partitioning=ds.partitioning(pa.schema([("set", pa.string()), ("subset", 
pa.string())])),
       existing_data_behavior='delete_matching'
   )
   ```
   
   After the action, in the bucket, path `parquet_root/abc/def/part-0.parquet` 
exists.
   
   ## Try to access the parquet
   ``` python
   import polars as pl
   import pyarrow as pa
   import pyarrow.dataset as ds
   import pyarrow.parquet as pq
   import s3fs
   
   s3fs = s3fs.S3FileSystem()
   pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) 
# ok
   
pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs)
 # ok
   
   pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, 
     partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), 
("subset", pa.string())])))
   # pyarrow.lib.ArrowInvalid: Could not open Parquet input source 
's3://bucket/parquet_root/': Parquet file size is 0 bytes
   
   # after I manually call s3fs.isdir, things changes, I suspect this is 
another bug
   s3fs.isdir('s3://bucket/parquet_root/') # True
   
   # repeat the call
   pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, 
     partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), 
("subset", pa.string())])))
   # pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 
's3://bucket/parquet_root/'
   
   # another try, the same error
   dataset = ds.dataset(
     's3://bucket/parquet_root/',
     format='parquet', 
     filesystem=s3, 
     partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), 
("subset", pa.string())]))
   )
   # pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 
's3://bucket/parquet_root/'
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to