[jira] [Commented] (ARROW-16564) [Python] Add option to have dataset infer the parquet schema from the last file

Aaron Philip (Jira) Fri, 13 May 2022 08:51:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536738#comment-17536738
 ]


Aaron Philip commented on ARROW-16564:
--------------------------------------

If I use the new schema, is there any way for Pyarrow to ignore files that 
don't conform to that schema?

> [Python] Add option to have dataset infer the parquet schema from the last 
> file
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-16564
>                 URL: https://issues.apache.org/jira/browse/ARROW-16564
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 7.0.0, 8.0.0
>            Reporter: Aaron Philip
>            Priority: Minor
>
> According to 
> [https://arrow.apache.org/docs/python/dataset.html#dataset-discovery], 
> dataset will infer the schema for parquet based on the first file in the path.
> I have a situation where a column was added to the schema after a certain 
> date. As a result, when I try to read the parquet in this path, the new 
> column is ignored because it is not part of the schema of the first file in 
> that path.
> I would like the option to infer the schema based on the last file in the 
> path to avoid this issue. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16564) [Python] Add option to have dataset infer the parquet schema from the last file

Reply via email to