Kyle Barron created ARROW-16613:
-----------------------------------

             Summary: [Python][Parquet] pyarrow.parquet.write_metadata with 
metadata_collector appears to be O(n^2)
                 Key: ARROW-16613
                 URL: https://issues.apache.org/jira/browse/ARROW-16613
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Parquet, Python
    Affects Versions: 8.0.0
            Reporter: Kyle Barron


Hello!

 

I've noticed that when writing a `_metadata` file with 
`pyarrow.parquet.write_metadata`, it is very slow with a large 
`metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that 
the concatenation inside `metadata.append_row_groups` is very slow. The writer 
first and [iterates over every item of the 
list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
 and then [concatenates them on each 
iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].

 

Would it be possible to make a vectorized implementation of this? Where 
`append_row_groups` accepts a list of `FileMetaData` objects, and where 
concatenation happens only once?

 

Repro (in IPython to use `%time`)

```

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq


def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

 

schema, meta = create_example_file_meta_data()

metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms

metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms

metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s

metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s

```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to