Joris Van den Bossche created ARROW-10883:
---------------------------------------------
Summary: [C++][Dataset] Preserve order when writing dataset
Key: ARROW-10883
URL: https://issues.apache.org/jira/browse/ARROW-10883
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
Currently, when writing a dataset, e.g. from a table consisting of a set of
record batches, there is no guarantee that the row order is preserved when
reading the dataset.
Small code example:
{code}
In [1]: import pyarrow.dataset as ds
In [2]: table = pa.table({"a": range(10)})
In [3]: table.to_pandas()
Out[3]:
a
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
In [4]: batches = table.to_batches(max_chunksize=2)
In [5]: ds.write_dataset(batches, "test_dataset_order", format="parquet")
In [6]: ds.dataset("test_dataset_order").to_table().to_pandas()
Out[6]:
a
0 4
1 5
2 8
3 9
4 6
5 7
6 2
7 3
8 0
9 1
{code}
Although this might seem normal in SQL world, typical dataframe users (R,
pandas/dask, etc) will expect a preserved row order.
Some applications might also rely on this, eg with dask you can have a sorted
index column ("divisions" between the partitions) that would get lost this way
(note, the dask parquet writer itself doesn't use
{{pyarrow.dataset.write_dataset}} so isn't impacted by this.)
Some discussion about this started in https://github.com/apache/arrow/pull/8305
(ARROW-9782), which changed to write all fragments to a single file instead of
a file per fragment.
I am not fully sure what the best way to solve this, but IMO at least having
the _option_ to preserve the order would be good.
cc [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)