Chongkai Zhu created ARROW-7385:
-----------------------------------
Summary: ParquetDataset deadlock with different metadata_nthreads
values
Key: ARROW-7385
URL: https://issues.apache.org/jira/browse/ARROW-7385
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1, 0.14.1, 0.12.1
Reporter: Chongkai Zhu
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
output_folder = "C:\scr\tmp"
weather_df = pd.DataFrame(\{"a": [1, 1, 1, 2, 2, 2, 3, 3, 3], "b": [1, 1, 1,
1, 5, 1, 1, 1, 1], "c": ["c1", "c1", "c1", "c10", "c20", "c30", "c1", "c1",
"c1"], "d": [32, 32, 32, 32, 32, 32, 32, 32, 32] })
table = pa.Table.from_pandas(weather_df)
pq.write_to_dataset(table, root_path=output_folder, partition_cols=["a", "b",
"c"])
h1. works for 1 thread
dataset = pq.ParquetDataset(output_folder, metadata_nthreads=1,
validate_schema=False)
h1. stuck for 2~6 threads (but it may vary from time to time)
dataset = pq.ParquetDataset(output_folder, metadata_nthreads=2,
validate_schema=False)
dataset = pq.ParquetDataset(output_folder, metadata_nthreads=6,
validate_schema=False)
h1. works for 60 thread
dataset = pq.ParquetDataset(output_folder, metadata_nthreads=60,
validate_schema=False)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)