Hi Aaron, Sorry for the late reply!
Below is the table to roughly estimate memory consumption with different bit widths: | bitWidth | arrays | per-array length | per-array bytes | total bytes | total (binary)| |---------:|-------:|-----------------:|----------------:|---------------:|--------------:| | 32 | 16 | 268,435,520 | 1,073,742,080 | 17,179,873,280 | 16.00 GiB | | 31 | 8 | 268,435,520 | 1,073,742,080 | 8,589,936,640 | 8.00 GiB | | 30 | 4 | 268,435,520 | 1,073,742,080 | 4,294,968,320 | 4.00 GiB | | 29 | 2 | 268,435,520 | 1,073,742,080 | 2,147,484,160 | 2.00 GiB | | 28 | 1 | 268,435,520 | 1,073,742,080 | 1,073,742,080 | 1.00 GiB | | 27 | 1 | 134,217,792 | 536,871,168 | 536,871,168 | 512.00 MiB | | 26 | 1 | 67,108,928 | 268,435,712 | 268,435,712 | 256.00 MiB | | 25 | 1 | 33,554,496 | 134,217,984 | 134,217,984 | 128.00 MiB | | 24 | 1 | 16,777,280 | 67,108,480 | 67,108,480 | 64.00 MiB | | 20 | 1 | 1,048,640 | 4,194,560 | 4,194,560 | 4.00 MiB | | 10 | 1 | 1,088 | 4,352 | 4,352 | 4.25 KiB | | 1 | 1 | 64 | 256 | 256 | 0.25 KiB | I think your proposed change has a slightly different pattern of generated numbers. If we want to keep the original pattern, a quick fix might be lazily generating arrays to reduce the peak memory size. WDYT? Best, Gang On Wed, Jan 7, 2026 at 6:23 AM Aaron Niskode-Dossett via dev < [email protected]> wrote: > The > test > TestByteBitPacking512VectorLE.unpackValuesUsingVectorBitWidth(TestByteBitPacking512VectorLE > is flaky in the Parquet github PR testing environment [1]. > > I gave the error to Codex (the OpenAI coding agent) and asked it to fix the > test. However, since I don't have enough confidence in my own > understanding of the problem or the fix, I have not opened a PR. The fix > can be found on my fork here > < > https://github.com/dossett/parquet-java/commit/7635c8599524aadee1164fc2168801c51390b118 > > > . > > The codex summary of the problem and the fix is this: > > We addressed CI OOMs in TestByteBitPacking512VectorLE > (parquet-encoding-vector) by bounding the test input size while keeping the > same correctness coverage. The original getRangeData could allocate arrays > on the order of hundreds of millions of ints per bit width, which can > consume tens of GB of heap and fail in constrained CI environments. > > The updated test generates a single bounded dataset (min 64, max 2^20 > values) and spans the full legal value range for each bit width (including > the full signed int range for 32‑bit). The vector and scalar pack/unpack > paths are still compared for equality across bit widths, but without the > unbounded memory stress that was causing flakiness. > > I would appreciate any feedback on that or alternatively other ways to > address the flaky test, I found it very frustrating recently when I was > opening several PRs. > > Cheers, Aaron > > [1] Example failure: > > https://github.com/apache/parquet-java/actions/runs/20671204311/job/59352228516?pr=3385 > > -- > Aaron Niskode-Dossett, Data Engineering -- Etsy >
