Caleb Overman created ARROW-15265:
-------------------------------------

             Summary: [C++][Python][Dataset] write_dataset with delete_matching 
hangs when the number of partitions is too large
                 Key: ARROW-15265
                 URL: https://issues.apache.org/jira/browse/ARROW-15265
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Caleb Overman


I'm attempting to use use the {{existing_data_behavior="delete_matching"}} 
option when using {{ds.write_dataset}} to write a hive partitioned parquet file 
to S3. This seems to work perfectly fine when the table being written is 
creating 7 or fewer partitions, but as soon as the partition column in the 
table has an 8th unique value the write completely hangs.

 
{code:java}
import numpy as np
import pyarrow as pa
from pyarrow import fs
import pyarrow.dataset as ds

bucket = "my-bucket"
s3 = fs.S3FileSystem()

cols_7 = ["a", "b", "c", "d", "e", "f", "g"]
table_7 = pa.table(
    {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)}
)
# succeeds
ds.write_dataset(
    data=table_7,
    base_dir=f"{bucket}/test7.parquet",
    format="parquet",
    partitioning=["col1"],
    partitioning_flavor="hive",
    filesystem=s3,
    existing_data_behavior="delete_matching",
)

cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"]
table_8 = pa.table(
    {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)}
)
# this hangs
ds.write_dataset(
    data=table_8,
    base_dir=f"{bucket}/test8.parquet",
    format="parquet",
    partitioning=["col1"],
    partitioning_flavor="hive",
    filesystem=s3,
    existing_data_behavior="delete_matching",
) {code}
For the file with 8 partitions, the directory structure is created in S3 but no 
actual files are written before hanging.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to