[jira] [Created] (ARROW-16564) Add option to have dataset infer the parquet schema from the last file instead of first.

Aaron Philip (Jira) Thu, 12 May 2022 16:22:05 -0700

Aaron Philip created ARROW-16564:
------------------------------------

             Summary: Add option to have dataset infer the parquet schema from 
the last file instead of first.
                 Key: ARROW-16564
                 URL: https://issues.apache.org/jira/browse/ARROW-16564
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 8.0.0
            Reporter: Aaron Philip



According to 
[https://arrow.apache.org/docs/python/dataset.html#dataset-discovery], dataset 
will infer the schema for parquet based on the first file in the path.

I have a situation where a column was added to the schema after a certain date. 
As a result, when I try to read the parquet in this path, the new column is 
ignored because it is not part of the schema of the first file in that path.

I would like the option to infer the schema based on the last file in the path 
to avoid this issue. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16564) Add option to have dataset infer the parquet schema from the last file instead of first.

Reply via email to