Kyle Barron created ARROW-16613:
-----------------------------------
Summary: [Python][Parquet] pyarrow.parquet.write_metadata with
metadata_collector appears to be O(n^2)
Key: ARROW-16613
URL: https://issues.apache.org/jira/browse/ARROW-16613
Project: Apache Arrow
Issue Type: Improvement
Components: Parquet, Python
Affects Versions: 8.0.0
Reporter: Kyle Barron
Hello!
I've noticed that when writing a `_metadata` file with
`pyarrow.parquet.write_metadata`, it is very slow with a large
`metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that
the concatenation inside `metadata.append_row_groups` is very slow. The writer
first and [iterates over every item of the
list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
and then [concatenates them on each
iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].
Would it be possible to make a vectorized implementation of this? Where
`append_row_groups` accepts a list of `FileMetaData` objects, and where
concatenation happens only once?
Repro (in IPython to use `%time`)
```
from io import BytesIO
import pyarrow as pa
import pyarrow.parquet as pq
def create_example_file_meta_data():
data = {
"str": pa.array(["a", "b", "c", "d"], type=pa.string()),
"uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
"int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
"bool": pa.array([True, True, False, False], type=pa.bool_()),
}
table = pa.table(data)
metadata_collector = []
pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
return table.schema, metadata_collector[0]
schema, meta = create_example_file_meta_data()
metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(),
metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms
metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(),
metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms
metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(),
metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s
metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(),
metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s
```
--
This message was sent by Atlassian Jira
(v8.20.7#820007)