[jira] [Commented] (ARROW-16564) [Python] Add option to have dataset infer the parquet schema from the last file

Alenka Frim (Jira) Fri, 13 May 2022 00:17:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536475#comment-17536475
 ]


Alenka Frim commented on ARROW-16564:
-------------------------------------

And you can use {{pq.read_schema}} to save custom schema from a selected file, 
not needing to do it manually. Following the previous example:
{code:python}
>>> schema_new = pq.read_schema(base / "parquet_dataset/data2.parquet")
>>> schema_new
a: int64
b: double
c: int64

>>> ds.dataset(base / "parquet_dataset", format="parquet", 
>>> schema=schema_new).to_table()
pyarrow.Table
a: int64
b: double
c: int64
----
a: [[0,1,2,3,4],[5,6,7,8,9]]
b: 
[[0.6874650721200516,0.9966515452505028,-1.0808214751696879,-0.9947358097037932,-1.241984930419355],[0.03802943296976132,0.6485772781216572,-0.21611062870855477,-0.6399976359764785,0.6034991641788295]]
c: [[1,2,1,2,1],[2,1,2,1,2]]
{code}

> [Python] Add option to have dataset infer the parquet schema from the last 
> file
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-16564
>                 URL: https://issues.apache.org/jira/browse/ARROW-16564
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 7.0.0, 8.0.0
>            Reporter: Aaron Philip
>            Priority: Minor
>
> According to 
> [https://arrow.apache.org/docs/python/dataset.html#dataset-discovery], 
> dataset will infer the schema for parquet based on the first file in the path.
> I have a situation where a column was added to the schema after a certain 
> date. As a result, when I try to read the parquet in this path, the new 
> column is ignored because it is not part of the schema of the first file in 
> that path.
> I would like the option to infer the schema based on the last file in the 
> path to avoid this issue. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16564) [Python] Add option to have dataset infer the parquet schema from the last file

Reply via email to