Caleb Overman created ARROW-15265:
-------------------------------------
Summary: [C++][Python][Dataset] write_dataset with delete_matching
hangs when the number of partitions is too large
Key: ARROW-15265
URL: https://issues.apache.org/jira/browse/ARROW-15265
Project: Apache Arrow
Issue Type: Bug
Reporter: Caleb Overman
I'm attempting to use use the {{existing_data_behavior="delete_matching"}}
option when using {{ds.write_dataset}} to write a hive partitioned parquet file
to S3. This seems to work perfectly fine when the table being written is
creating 7 or fewer partitions, but as soon as the partition column in the
table has an 8th unique value the write completely hangs.
{code:java}
import numpy as np
import pyarrow as pa
from pyarrow import fs
import pyarrow.dataset as ds
bucket = "my-bucket"
s3 = fs.S3FileSystem()
cols_7 = ["a", "b", "c", "d", "e", "f", "g"]
table_7 = pa.table(
{"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)}
)
# succeeds
ds.write_dataset(
data=table_7,
base_dir=f"{bucket}/test7.parquet",
format="parquet",
partitioning=["col1"],
partitioning_flavor="hive",
filesystem=s3,
existing_data_behavior="delete_matching",
)
cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"]
table_8 = pa.table(
{"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)}
)
# this hangs
ds.write_dataset(
data=table_8,
base_dir=f"{bucket}/test8.parquet",
format="parquet",
partitioning=["col1"],
partitioning_flavor="hive",
filesystem=s3,
existing_data_behavior="delete_matching",
) {code}
For the file with 8 partitions, the directory structure is created in S3 but no
actual files are written before hanging.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)