[
https://issues.apache.org/jira/browse/ARROW-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534261#comment-17534261
]
Daniel Friar commented on ARROW-16506:
--------------------------------------
Thanks [~westonpace], that makes sense. I wonder if it's worth adding to the
docs just to make it clear that this is the case and ordered writes should not
be expected/relied upon?
I do think there are cases where an ordered write is important, particularly
for e.g. time series data where a particular ordering is necessary for
downstream tasks and additional in-memory sorts once the data is loaded may be
expensive.
More of an opinion but generally I think it can be a little surprising to write
from a table/dataframe and read back to discover the order has changed. Adding
sequencing to achieve an ordered write would be an improvement IMO!
> Pyarrow 8.0.0 write_dataset writes data in different order with
> use_threads=True
> --------------------------------------------------------------------------------
>
> Key: ARROW-16506
> URL: https://issues.apache.org/jira/browse/ARROW-16506
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Daniel Friar
> Priority: Major
> Labels: dataset, parquet, pyarrow
>
> In the latest (8.0.0) release the following code snippet seems to write out
> data in a different order for each of the partitions when
> {{use_threads=True}} vs when {{{}use_threads=False{}}}.
> Testing the same snippet with pyarrow 7.0.0 gives the same order regardless
> of whether {{use_threads}} is set to True when the data is written.
>
> {code:java}
> import itertools
> import numpy as np
> import pyarrow.dataset as ds
> import pyarrow as pa
> n_rows, n_cols = 100_000, 20
> def create_dataframe(color, year):
> arr = np.random.randn(n_rows, n_cols)
> df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in
> range(n_cols)])
> df["color"] = color
> df["year"] = year
> df["id"] = np.arange(len(df))
> return df
> partitions = ["red", "green", "blue"]
> years = [2011, 2012, 2013]
> dataframes = [create_dataframe(p, y) for p, y in
> itertools.product(partitions, years)]
> df = pd.concat(dataframes)
> table = pa.Table.from_pandas(df=df)
> ds.write_dataset(
> table,
> "./test",
> format="parquet",
> max_rows_per_group=1_000_000,
> min_rows_per_group=1_000_000,
> existing_data_behavior="overwrite_or_ignore",
> partitioning=ds.partitioning(pa.schema([
> ("color", pa.string()),
> ("year", pa.int64())
> ]), flavor="hive"),
> use_threads=True,
> )
> df_read = pd.read_parquet("./test/color=blue/year=2012")
> df_read.head()[["id"]]
> {code}
>
> Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)