[
https://issues.apache.org/jira/browse/ARROW-15265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470726#comment-17470726
]
David Li commented on ARROW-15265:
----------------------------------
{{pa.set_io_thread_count(N)}} where {{N > (# partitions)}} unblocks it, though
then I see a different error: {{{}OSError: Path does not exist
'my-bucket/test8.parquet/col1=c'{}}}(where the partition in question changes
every time, i.e. it's not deterministic)
> [C++][Python][Dataset] write_dataset with delete_matching hangs when the
> number of partitions is too large
> ----------------------------------------------------------------------------------------------------------
>
> Key: ARROW-15265
> URL: https://issues.apache.org/jira/browse/ARROW-15265
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 6.0.1
> Reporter: Caleb Overman
> Priority: Major
>
> I'm attempting to use use the {{existing_data_behavior="delete_matching"}}
> option when using {{ds.write_dataset}} to write a hive partitioned parquet
> file to S3. This seems to work perfectly fine when the table being written is
> creating 7 or fewer partitions, but as soon as the partition column in the
> table has an 8th unique value the write completely hangs.
>
> {code:java}
> import numpy as np
> import pyarrow as pa
> from pyarrow import fs
> import pyarrow.dataset as ds
> bucket = "my-bucket"
> s3 = fs.S3FileSystem()
> cols_7 = ["a", "b", "c", "d", "e", "f", "g"]
> table_7 = pa.table(
> {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)}
> )
> # succeeds
> ds.write_dataset(
> data=table_7,
> base_dir=f"{bucket}/test7.parquet",
> format="parquet",
> partitioning=["col1"],
> partitioning_flavor="hive",
> filesystem=s3,
> existing_data_behavior="delete_matching",
> )
> cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"]
> table_8 = pa.table(
> {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)}
> )
> # this hangs
> ds.write_dataset(
> data=table_8,
> base_dir=f"{bucket}/test8.parquet",
> format="parquet",
> partitioning=["col1"],
> partitioning_flavor="hive",
> filesystem=s3,
> existing_data_behavior="delete_matching",
> ) {code}
> For the file with 8 partitions, the directory structure is created in S3 but
> no actual files are written before hanging.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)