alexhudspith opened a new issue, #35498:
URL: https://github.com/apache/arrow/issues/35498

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   On Linux, `pyarrow.parquet.write_to_dataset` shows a large performance 
regression in Arrow 12.0 versus 11.0.
   
   The following results were collected using Ubuntu 22.04.2 LTS, 
5.15.0-71-generic, Intel Haswell 4-core @ 3.6GHz, 16 GB RAM, Samsung 840 Pro 
SSD. They are elapsed times in seconds to write a single int64 column of 
integers [0,..., _length_-1] with no compression and no multi-threading:
   
   | Array length | Arrow 11 (s) | Arrow 12 (s) |
   |-----------------:|--------:|--------:|
   |1,000,000 | 0.1 | 0.1|
   | 2,000,000 | 0.2 | 0.4 |
   | 4,000,000 | 0.3 | 1.6 |
   | 8,000,000 | 0.8 | 6.2 |
   | 16,000,000 | 2.3 | 24.4 |
   | 32,000,000 | 6.5 | 94.1 |
   | 64,000,000 | 13.5 | 371.7 |
   
   The output directory was deleted before each run.
   ```
   """check.py"""
   import sys
   import time
   import gc
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   def main():
       path = '/tmp/test.parquet'
       length = 10_000_000 if len(sys.argv) < 2 else int(sys.argv[1])
       table = pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])
       t0 = time.perf_counter()
       pq.write_to_dataset(
           table, path, schema=table.schema, use_legacy_dataset=False, 
use_threads=False, compression=None
       )
       duration = time.perf_counter() - t0
       print(f'{duration:.2f}s')
   
   if __name__ == '__main__':
       main()
   ```
   
   Running `git bisect` on local builds leads me to this commit: 
660d259f525d301f7ff5b90416622698fa8a5e9c: [C++] Add ordered/segmented 
aggregation Substrait extension (#34627).
   
   Following that change, Flamegraphs show a lot of additional time spent in 
`arrow::util::EnsureAlignment` calling glibc `memcpy`:
   
   Before (ddd0a337174e57cdc80b1ee30dc7e787acfc09f6)
   ![good-ddd0a33 
perf](https://user-images.githubusercontent.com/13152260/236944113-e7b6abb3-9449-4ca6-8a4c-ab88c0f9ace9.svg)
   
   After (660d259f525d301f7ff5b90416622698fa8a5e9c)
   ![bad-660d259 
perf](https://user-images.githubusercontent.com/13152260/236944165-fbad2d51-716d-4985-ac1e-a51b18bf76a8.svg)
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to