Aaron Philip created ARROW-16564:
------------------------------------
Summary: Add option to have dataset infer the parquet schema from
the last file instead of first.
Key: ARROW-16564
URL: https://issues.apache.org/jira/browse/ARROW-16564
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 8.0.0
Reporter: Aaron Philip
According to
[https://arrow.apache.org/docs/python/dataset.html#dataset-discovery], dataset
will infer the schema for parquet based on the first file in the path.
I have a situation where a column was added to the schema after a certain date.
As a result, when I try to read the parquet in this path, the new column is
ignored because it is not part of the schema of the first file in that path.
I would like the option to infer the schema based on the last file in the path
to avoid this issue.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)