[
https://issues.apache.org/jira/browse/ARROW-16564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536475#comment-17536475
]
Alenka Frim commented on ARROW-16564:
-------------------------------------
And you can use {{pq.read_schema}} to save custom schema from a selected file,
not needing to do it manually. Following the previous example:
{code:python}
>>> schema_new = pq.read_schema(base / "parquet_dataset/data2.parquet")
>>> schema_new
a: int64
b: double
c: int64
>>> ds.dataset(base / "parquet_dataset", format="parquet",
>>> schema=schema_new).to_table()
pyarrow.Table
a: int64
b: double
c: int64
----
a: [[0,1,2,3,4],[5,6,7,8,9]]
b:
[[0.6874650721200516,0.9966515452505028,-1.0808214751696879,-0.9947358097037932,-1.241984930419355],[0.03802943296976132,0.6485772781216572,-0.21611062870855477,-0.6399976359764785,0.6034991641788295]]
c: [[1,2,1,2,1],[2,1,2,1,2]]
{code}
> [Python] Add option to have dataset infer the parquet schema from the last
> file
> -------------------------------------------------------------------------------
>
> Key: ARROW-16564
> URL: https://issues.apache.org/jira/browse/ARROW-16564
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 7.0.0, 8.0.0
> Reporter: Aaron Philip
> Priority: Minor
>
> According to
> [https://arrow.apache.org/docs/python/dataset.html#dataset-discovery],
> dataset will infer the schema for parquet based on the first file in the path.
> I have a situation where a column was added to the schema after a certain
> date. As a result, when I try to read the parquet in this path, the new
> column is ignored because it is not part of the schema of the first file in
> that path.
> I would like the option to infer the schema based on the last file in the
> path to avoid this issue.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)