Daniel Friar created ARROW-16506:
------------------------------------
Summary: Pyarrow 8.0.0 write_dataset writes data in different
order with {{use_threads=True}}
Key: ARROW-16506
URL: https://issues.apache.org/jira/browse/ARROW-16506
Project: Apache Arrow
Issue Type: Bug
Reporter: Daniel Friar
In the latest (8.0.0) release the following code snippet seems to write out
data in a different order for each of the partitions when {{use_threads=True}}
vs when {{{}use_threads=False{}}}.
Testing the same snippet with pyarrow gives the same order regardless of
whether {{use_threads}} is set to True when the data is writen.
{code:java}
import itertools
import numpy as np
import pyarrow.dataset as ds
import pyarrow as pa
n_rows, n_cols = 100_000, 20
def create_dataframe(color, year):
arr = np.random.randn(n_rows, n_cols)
df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)])
df["color"] = color
df["year"] = year
df["id"] = np.arange(len(df))
return df
partitions = ["red", "green", "blue"]
years = [2011, 2012, 2013]
dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions,
years)]
df = pd.concat(dataframes)
table = pa.Table.from_pandas(df=df)
ds.write_dataset(
table,
"./test",
format="parquet",
max_rows_per_group=1_000_000,
min_rows_per_group=1_000_000,
existing_data_behavior="overwrite_or_ignore",
partitioning=ds.partitioning(pa.schema([
("color", pa.string()),
("year", pa.int64())
]), flavor="hive"),
use_threads=True,
)
df_read = pd.read_parquet("./test/color=blue/year=2012")
df_read.head()[["id"]]
{code}
Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)