alexhudspith opened a new issue, #35498:
URL: https://github.com/apache/arrow/issues/35498
### Describe the bug, including details regarding any error messages,
version, and platform.
On Linux, `pyarrow.parquet.write_to_dataset` shows a large performance
regression in Arrow 12.0 versus 11.0.
The following results were collected using Ubuntu 22.04.2 LTS,
5.15.0-71-generic, Intel Haswell 4-core @ 3.6GHz, 16 GB RAM, Samsung 840 Pro
SSD. They are elapsed times in seconds to write a single int64 column of
integers [0,..., _length_-1] with no compression and no multi-threading:
| Array length | Arrow 11 (s) | Arrow 12 (s) |
|-----------------:|--------:|--------:|
|1,000,000 | 0.1 | 0.1|
| 2,000,000 | 0.2 | 0.4 |
| 4,000,000 | 0.3 | 1.6 |
| 8,000,000 | 0.8 | 6.2 |
| 16,000,000 | 2.3 | 24.4 |
| 32,000,000 | 6.5 | 94.1 |
| 64,000,000 | 13.5 | 371.7 |
The output directory was deleted before each run.
```
"""check.py"""
import sys
import time
import gc
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
def main():
path = '/tmp/test.parquet'
length = 10_000_000 if len(sys.argv) < 2 else int(sys.argv[1])
table = pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])
t0 = time.perf_counter()
pq.write_to_dataset(
table, path, schema=table.schema, use_legacy_dataset=False,
use_threads=False, compression=None
)
duration = time.perf_counter() - t0
print(f'{duration:.2f}s')
if __name__ == '__main__':
main()
```
Running `git bisect` on local builds leads me to this commit:
660d259f525d301f7ff5b90416622698fa8a5e9c: [C++] Add ordered/segmented
aggregation Substrait extension (#34627).
Following that change, Flamegraphs show a lot of additional time spent in
`arrow::util::EnsureAlignment` calling glibc `memcpy`:
Before (ddd0a337174e57cdc80b1ee30dc7e787acfc09f6)

After (660d259f525d301f7ff5b90416622698fa8a5e9c)

### Component(s)
C++, Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]