hamilton-earthscope opened a new pull request, #622: URL: https://github.com/apache/iceberg-go/pull/622
# Partitioned Write Optimizations ## Summary This PR delivers significant performance improvements to the partitioned write throughput in the Iceberg table writer. Through a series of iterative optimizations, we achieved substantial gains in write performance, reduced memory allocations, and improved overall efficiency. ## Performance Results Note: the following results are using the latest commit on `main` of arrow-go (https://github.com/apache/arrow-go/commit/3160eef9c227d94db67bfaf5225a2d6c1f48bc76) The following benchmarks were conducted on Apple M3 Max (darwin/arm64) for the new `BenchmarkPartitionedWriteThroughput` test with 2.5M rows per write operation: ### Incremental Improvements | Change | Time/op | Δ Time | records/sec | Δ records/sec | allocs/op | Δ allocs/op | |--------|---------|---------|-------------|---------------|-----------|-------------| | Base | 2.35 s | - | 1,065,115 | - | 60,076,290 | - | | Change 1 | 1.49 s | -36.5% | 1,677,802 | +57.5% | 35,076,376 | -41.6% | | Change 2 | 1.21 s | -18.7% | 2,064,629 | +23.1% | 25,076,562 | -28.5% | | Change 3 | 1.16 s | -4.5% | 2,161,545 | +4.7% | 22,576,588 | -10.0% | | Change 4 | 1.07 s | -7.4% | 2,334,700 | +8.0% | 20,076,480 | -11.1% | | Change 5 | 654.1 ms | -38.9% | 3,821,892 | +63.7% | 12,577,119 | -37.4% | ### Overall Improvement (Base → Change 5) **2.5M Records per write:** | Metric | Base | Change 5 | Improvement | |--------|------|----------|-------------| | **Time/op** | 2.35 s | 654.1 ms | **-72.1%** ⚡ | | **records/sec** | 1,065,115 | 3,821,892 | **+258.9%** 🚀 | | **allocs/op** | 60,076,290 | 12,577,119 | **-79.1%** 💪 | **100K Records per write:** | Metric | Base | Change 5 | Improvement | |--------|------|----------|-------------| | **Time/op** | 221.3 ms | 137.7 ms | **-37.8%** | | **records/sec** | 451,853 | 726,085 | **+60.7%** | | **allocs/op** | 2,472,792 | 573,012 | **-76.8%** | ## Impact These optimizations significantly improve the performance of partitioned writes in Iceberg tables, making the library more efficient for high-throughput data ingestion scenarios. The improvements scale well with larger workloads, as demonstrated by the dramatic gains in the 2.5M record benchmark results. ## Testing All optimizations were validated through comprehensive benchmarking on darwin/arm64 (Apple M3 Max). The improvements are consistent across different record volumes (100K, 500K, and 2.5M records). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
