[jira] [Commented] (ARROW-16564) [Python] Add option to have dataset infer the parquet schema from the last file

Alenka Frim (Jira) Fri, 13 May 2022 00:13:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536474#comment-17536474
 ]


Alenka Frim commented on ARROW-16564:
-------------------------------------

You can add a custom schema to `ds.dataset()` which should solve you problem.

Using an example from the docs:
{code:python}
import tempfile
import pathlib
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

# Creating an example dataset with two parquet files
base = pathlib.Path(tempfile.mkdtemp(prefix="pyarrow-"))
(base / "parquet_dataset").mkdir(exist_ok=True)
table = pa.table({'a': range(10), 'b': np.random.randn(10), 'c': [1, 2] * 5})
pq.write_table(table.slice(0, 5), base / "parquet_dataset/data1.parquet")
pq.write_table(table.slice(5, 10), base / "parquet_dataset/data2.parquet")

# reading the data

import pyarrow.dataset as ds
# Define the schema that includes the newly added column
# In this example both parquet files are without the new column to show that 
also works
schema = pa.schema([
    ('a', pa.int64()),
    ('b', pa.float64()),
    ('c', pa.int64()),
    ('new', pa.bool_())
])
# Read the data with the new schema and convert it into a table to check the 
result
ds.dataset(base / "parquet_dataset", format="parquet", schema=schema).to_table()
{code}
You should get:
{code:python}
>>> ds.dataset(base / "parquet_dataset", format="parquet", 
>>> schema=schema).to_table()
pyarrow.Table
a: int64
b: double
c: int64
new: bool
----
a: [[0,1,2,3,4],[5,6,7,8,9]]
b: 
[[0.2154222083206493,1.4903968099339355,0.7135195619005714,0.10383436484447274,-1.7986589196024543],[0.7329661943015637,-0.025262270751709868,-1.5908999186628758,0.7745704844800078,-0.9614861072871166]]
c: [[1,2,1,2,1],[2,1,2,1,2]]
new: [[null,null,null,null,null],[null,null,null,null,null]]
{code}

> [Python] Add option to have dataset infer the parquet schema from the last 
> file
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-16564
>                 URL: https://issues.apache.org/jira/browse/ARROW-16564
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 7.0.0, 8.0.0
>            Reporter: Aaron Philip
>            Priority: Minor
>
> According to 
> [https://arrow.apache.org/docs/python/dataset.html#dataset-discovery], 
> dataset will infer the schema for parquet based on the first file in the path.
> I have a situation where a column was added to the schema after a certain 
> date. As a result, when I try to read the parquet in this path, the new 
> column is ignored because it is not part of the schema of the first file in 
> that path.
> I would like the option to infer the schema based on the last file in the 
> path to avoid this issue. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16564) [Python] Add option to have dataset infer the parquet schema from the last file

Reply via email to