[jira] [Updated] (ARROW-16506) Pyarrow 8.0.0 write_dataset writes data in different order with {{use_threads=True}}

Daniel Friar (Jira) Mon, 09 May 2022 06:52:41 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Friar updated ARROW-16506:
---------------------------------
    Description: 
In the latest (8.0.0) release the following code snippet seems to write out 
data in a different order for each of the partitions when {{use_threads=True}} 
vs when {{{}use_threads=False{}}}.

Testing the same snippet with pyarrow 7.0.0 gives the same order regardless of 
whether {{use_threads}} is set to True when the data is written.

 
{code:java}
import itertools

import numpy as np
import pyarrow.dataset as ds
import pyarrow as pa

n_rows, n_cols = 100_000, 20

def create_dataframe(color, year):
    arr = np.random.randn(n_rows, n_cols)
    df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)])
    df["color"] = color
    df["year"] = year
    df["id"] = np.arange(len(df))
    return df


partitions = ["red", "green", "blue"]
years = [2011, 2012, 2013]
dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions, 
years)]
df = pd.concat(dataframes)

table = pa.Table.from_pandas(df=df)

ds.write_dataset(
    table,
    "./test",
    format="parquet",
    max_rows_per_group=1_000_000,
    min_rows_per_group=1_000_000,
    existing_data_behavior="overwrite_or_ignore",
    partitioning=ds.partitioning(pa.schema([
        ("color", pa.string()),
        ("year", pa.int64())
    ]), flavor="hive"),
    use_threads=True,
)

df_read = pd.read_parquet("./test/color=blue/year=2012")
df_read.head()[["id"]]

{code}
 

Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.

  was:
In the latest (8.0.0) release the following code snippet seems to write out 
data in a different order for each of the partitions when {{use_threads=True}} 
vs when {{{}use_threads=False{}}}.

Testing the same snippet with pyarrow gives the same order regardless of 
whether {{use_threads}} is set to True when the data is writen.

 
{code:java}
import itertools

import numpy as np
import pyarrow.dataset as ds
import pyarrow as pa

n_rows, n_cols = 100_000, 20

def create_dataframe(color, year):
    arr = np.random.randn(n_rows, n_cols)
    df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in range(n_cols)])
    df["color"] = color
    df["year"] = year
    df["id"] = np.arange(len(df))
    return df


partitions = ["red", "green", "blue"]
years = [2011, 2012, 2013]
dataframes = [create_dataframe(p, y) for p, y in itertools.product(partitions, 
years)]
df = pd.concat(dataframes)

table = pa.Table.from_pandas(df=df)

ds.write_dataset(
    table,
    "./test",
    format="parquet",
    max_rows_per_group=1_000_000,
    min_rows_per_group=1_000_000,
    existing_data_behavior="overwrite_or_ignore",
    partitioning=ds.partitioning(pa.schema([
        ("color", pa.string()),
        ("year", pa.int64())
    ]), flavor="hive"),
    use_threads=True,
)

df_read = pd.read_parquet("./test/color=blue/year=2012")
df_read.head()[["id"]]

{code}
 

Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.


> Pyarrow 8.0.0 write_dataset writes data in different order with 
> {{use_threads=True}}
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-16506
>                 URL: https://issues.apache.org/jira/browse/ARROW-16506
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Daniel Friar
>            Priority: Major
>              Labels: dataset, parquet, pyarrow
>
> In the latest (8.0.0) release the following code snippet seems to write out 
> data in a different order for each of the partitions when 
> {{use_threads=True}} vs when {{{}use_threads=False{}}}.
> Testing the same snippet with pyarrow 7.0.0 gives the same order regardless 
> of whether {{use_threads}} is set to True when the data is written.
>  
> {code:java}
> import itertools
> import numpy as np
> import pyarrow.dataset as ds
> import pyarrow as pa
> n_rows, n_cols = 100_000, 20
> def create_dataframe(color, year):
>     arr = np.random.randn(n_rows, n_cols)
>     df = pd.DataFrame(data=arr, columns=[f"column_{i}" for i in 
> range(n_cols)])
>     df["color"] = color
>     df["year"] = year
>     df["id"] = np.arange(len(df))
>     return df
> partitions = ["red", "green", "blue"]
> years = [2011, 2012, 2013]
> dataframes = [create_dataframe(p, y) for p, y in 
> itertools.product(partitions, years)]
> df = pd.concat(dataframes)
> table = pa.Table.from_pandas(df=df)
> ds.write_dataset(
>     table,
>     "./test",
>     format="parquet",
>     max_rows_per_group=1_000_000,
>     min_rows_per_group=1_000_000,
>     existing_data_behavior="overwrite_or_ignore",
>     partitioning=ds.partitioning(pa.schema([
>         ("color", pa.string()),
>         ("year", pa.int64())
>     ]), flavor="hive"),
>     use_threads=True,
> )
> df_read = pd.read_parquet("./test/color=blue/year=2012")
> df_read.head()[["id"]]
> {code}
>  
> Tested on Ubuntu 20.04 with Python 3.8 and arrow versions 8.0.0 and 7.0.0.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-16506) Pyarrow 8.0.0 write_dataset writes data in different order with {{use_threads=True}}

Reply via email to