amoeba commented on issue #43574:
URL: https://github.com/apache/arrow/issues/43574#issuecomment-2332362113

   Thanks. Some thoughts:
   
   - `read_table` errors in your original code where `ds.dataset` does not 
because (1) `read_table` defaults to Hive partitioning and `ds.dataset` doesn't 
(2) your file contains a `source_id` field and its file path also includes 
`source_id=X` as a component of the path. With partitioned datasets, partition 
fields are usually omitted from the files themselves and I'm not sure what the 
behavior should be if the user leaves them in. The current behavior seems to be 
that the reader ignores the field in the file and trusts the partition field 
value in the file path.
   - `ds.dataset` succeeds because it's defaulting to Directory partitioning so 
it's totally ignoring the Hive partition scheme in your file path. You can make 
the `ds.dataset` call fail if you specify Hive partitioning (though with a 
slightly different error).
   
   You have a few workarounds:
   
   1. Remove the `source_id` field from your Parquet files. This is what I 
would do.
   2. Manually specify a schema,
       ```python
       schm = pa.schema([pa.field("source_id", pa.string())])
       
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet", 
schema=schm)
       ```
   3. Manually specify `partitoining=None`
   
   Is there a reason why (1) might not work for you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to