[
https://issues.apache.org/jira/browse/ARROW-16564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536474#comment-17536474
]
Alenka Frim commented on ARROW-16564:
-------------------------------------
You can add a custom schema to `ds.dataset()` which should solve you problem.
Using an example from the docs:
{code:python}
import tempfile
import pathlib
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
# Creating an example dataset with two parquet files
base = pathlib.Path(tempfile.mkdtemp(prefix="pyarrow-"))
(base / "parquet_dataset").mkdir(exist_ok=True)
table = pa.table({'a': range(10), 'b': np.random.randn(10), 'c': [1, 2] * 5})
pq.write_table(table.slice(0, 5), base / "parquet_dataset/data1.parquet")
pq.write_table(table.slice(5, 10), base / "parquet_dataset/data2.parquet")
# reading the data
import pyarrow.dataset as ds
# Define the schema that includes the newly added column
# In this example both parquet files are without the new column to show that
also works
schema = pa.schema([
('a', pa.int64()),
('b', pa.float64()),
('c', pa.int64()),
('new', pa.bool_())
])
# Read the data with the new schema and convert it into a table to check the
result
ds.dataset(base / "parquet_dataset", format="parquet", schema=schema).to_table()
{code}
You should get:
{code:python}
>>> ds.dataset(base / "parquet_dataset", format="parquet",
>>> schema=schema).to_table()
pyarrow.Table
a: int64
b: double
c: int64
new: bool
----
a: [[0,1,2,3,4],[5,6,7,8,9]]
b:
[[0.2154222083206493,1.4903968099339355,0.7135195619005714,0.10383436484447274,-1.7986589196024543],[0.7329661943015637,-0.025262270751709868,-1.5908999186628758,0.7745704844800078,-0.9614861072871166]]
c: [[1,2,1,2,1],[2,1,2,1,2]]
new: [[null,null,null,null,null],[null,null,null,null,null]]
{code}
> [Python] Add option to have dataset infer the parquet schema from the last
> file
> -------------------------------------------------------------------------------
>
> Key: ARROW-16564
> URL: https://issues.apache.org/jira/browse/ARROW-16564
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 7.0.0, 8.0.0
> Reporter: Aaron Philip
> Priority: Minor
>
> According to
> [https://arrow.apache.org/docs/python/dataset.html#dataset-discovery],
> dataset will infer the schema for parquet based on the first file in the path.
> I have a situation where a column was added to the schema after a certain
> date. As a result, when I try to read the parquet in this path, the new
> column is ignored because it is not part of the schema of the first file in
> that path.
> I would like the option to infer the schema based on the last file in the
> path to avoid this issue.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)