amoeba commented on issue #43574:
URL: https://github.com/apache/arrow/issues/43574#issuecomment-2332362113
Thanks. Some thoughts:
- `read_table` errors in your original code where `ds.dataset` does not
because (1) `read_table` defaults to Hive partitioning and `ds.dataset` doesn't
(2) your file contains a `source_id` field and its file path also includes
`source_id=X` as a component of the path. With partitioned datasets, partition
fields are usually omitted from the files themselves and I'm not sure what the
behavior should be if the user leaves them in. The current behavior seems to be
that the reader ignores the field in the file and trusts the partition field
value in the file path.
- `ds.dataset` succeeds because it's defaulting to Directory partitioning so
it's totally ignoring the Hive partition scheme in your file path. You can make
the `ds.dataset` call fail if you specify Hive partitioning (though with a
slightly different error).
You have a few workarounds:
1. Remove the `source_id` field from your Parquet files. This is what I
would do.
2. Manually specify a schema,
```python
schm = pa.schema([pa.field("source_id", pa.string())])
pq.read_table("dataset_root/source_id=9319/li191r_9319_2023-01-02.parquet",
schema=schm)
```
3. Manually specify `partitoining=None`
Is there a reason why (1) might not work for you?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]