zeroshade opened a new pull request, #655: URL: https://github.com/apache/arrow-go/pull/655
### Rationale for this change Writing large byte array values (e.g., 50 values x 50MB each = 2.5GB total) caused a panic due to exceeding max page size. This happened because the writer accumulates the values in batches *before* checking the page size limits: 1. `WriteBatch()` calls `writeValues()` which adds ALL values to the encoder buffer 2. `commitWriteAndCheckPageLimit()` checks if the buffer exceeds the limit 3. **PROBLEM**: At this point, the buffer alraedy contains > 2GB of data if we hit the limit 4. `FlushCurrentPage()` attempts to do `int32(values.Len())` which overflows: `2,500,000,000 -> -1,794,967,296` 5. `bytes.Buffer.Grow(-1,794,967,296)` panics ### What changes are included in this PR? Modified `writeValues()` and `writeValuesSpaced()` for `ByteArray` and `FixedLenByteArray` types to check the buffer size *beore* adding the values and proactively flush when approaching the 2GB limit (parquet uses an int32 for page size). ### Are these changes tested? Yes, new tests are added, including some benchmarks to ensure that the new changes don't cause any performance impacts. ## Performance Impact **TL;DR: <1% overhead for typical workloads, 0% for fixed-size types** ### Benchmarks ``` Benchmark Time Data Throughput ───────────────────────────────────────────────────────────────────── WriteSmallByteArrayValues (100B) 2.19 ms 1 MB 457 MB/s WriteMediumByteArrayValues (10KB) 18.0 ms 10 MB 556 MB/s WriteLargeByteArrayValues (1MB) 137 ms 100 MB 730 MB/s WriteInt32Values (control) 0.15 ms 0.04 MB 267 MB/s (unchanged) ``` ### Impact by Data Type | Data Type | Overhead | Notes | |-----------|----------|-------| | Int32, Int64, Float, Boolean | **0%** | Unchanged code paths | | ByteArray (small, <1KB) | **<1%** | Batched processing | | ByteArray (large, >1MB) | **<0.01%** | I/O dominates, checking negligible | ### Per-Value Overhead | Value Size | Encoding Time | Added Overhead | % Impact | |------------|--------------|----------------|----------| | 100 bytes | 200 ns | ~10 ns | ~5% | | 1 KB | 2,000 ns | ~10 ns | ~0.5% | | 100 KB | 200,000 ns | ~10 ns | ~0.005% | | 1 MB+ | 2,000,000 ns | ~120 ns | ~0.006% | ### Are there any user-facing changes? Only the fix to the previous situation that would panic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
