[
https://issues.apache.org/jira/browse/ARROW-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche reassigned ARROW-10493:
---------------------------------------------
Assignee: Joris Van den Bossche
> [C++][Parquet] Writing nullable nested strings results in wrong data in file
> ----------------------------------------------------------------------------
>
> Key: ARROW-10493
> URL: https://issues.apache.org/jira/browse/ARROW-10493
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 2.0.0
> Environment: Python 3.6
> Reporter: Christian Lundgren
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.1
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> When I try writing a column of type `struct(string)` that has more elements
> than the write_batch_size, the output will only contain the first batch,
> repeated. The data in batches after the first batch are not written to the
> output.
> I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output
> contains all the data as expected.
>
> This python test case reproduces the problem, the last value in the output is
> "key-0" instead of the expected "key-1024":
>
> {code:python}
> import io
> import pyarrow as pa
> import pyarrow.parquet as pq
> def test_struct_array():
> default_writer_batch_size = 1024
> n_samples = default_writer_batch_size + 1
> keys = [f"key-{i}" for i in range(n_samples)]
> expected = list(keys)
> struct_array = pa.StructArray.from_arrays(
> [pa.array(keys, type=pa.string())],
> names=["string"],
> )
> table = pa.table({"struct": struct_array})
> buf = io.BytesIO()
> pq.write_table(table, buf)
> actual = pq.read_table(buf).flatten()[0].to_pylist()
> assert actual[:1024] == expected[:1024]
> assert actual[-1] == expected[-1], (actual[-1], expected[-1])
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)