Hi Aaron and Gang Wu, I have opened a PR to fix this issue by lazily initializing data and reusing temporary arrays. Please take a look.
Issue link: https://github.com/apache/parquet-java/issues/3404 PR link: https://github.com/apache/parquet-java/pull/3405 On 2026/01/20 08:37:20 Gang Wu wrote: > Hi Aaron, > > Sorry for the late reply! > > Below is the table to roughly estimate memory consumption with different > bit widths: > > | bitWidth | arrays | per-array length | per-array bytes | total bytes | > total (binary)| > |---------:|-------:|-----------------:|----------------:|---------------:|--------------:| > | 32 | 16 | 268,435,520 | 1,073,742,080 | 17,179,873,280 | > 16.00 GiB | > | 31 | 8 | 268,435,520 | 1,073,742,080 | 8,589,936,640 | > 8.00 GiB | > | 30 | 4 | 268,435,520 | 1,073,742,080 | 4,294,968,320 | > 4.00 GiB | > | 29 | 2 | 268,435,520 | 1,073,742,080 | 2,147,484,160 | > 2.00 GiB | > | 28 | 1 | 268,435,520 | 1,073,742,080 | 1,073,742,080 | > 1.00 GiB | > | 27 | 1 | 134,217,792 | 536,871,168 | 536,871,168 | > 512.00 MiB | > | 26 | 1 | 67,108,928 | 268,435,712 | 268,435,712 | > 256.00 MiB | > | 25 | 1 | 33,554,496 | 134,217,984 | 134,217,984 | > 128.00 MiB | > | 24 | 1 | 16,777,280 | 67,108,480 | 67,108,480 | > 64.00 MiB | > | 20 | 1 | 1,048,640 | 4,194,560 | 4,194,560 | > 4.00 MiB | > | 10 | 1 | 1,088 | 4,352 | 4,352 | > 4.25 KiB | > | 1 | 1 | 64 | 256 | 256 | > 0.25 KiB | > > I think your proposed change has a slightly different pattern of > generated numbers. > If we want to keep the original pattern, a quick fix might be lazily > generating arrays > to reduce the peak memory size. WDYT? > > Best, > Gang > > > On Wed, Jan 7, 2026 at 6:23 AM Aaron Niskode-Dossett via dev < > [email protected]> wrote: > > > The > > test > > TestByteBitPacking512VectorLE.unpackValuesUsingVectorBitWidth(TestByteBitPacking512VectorLE > > is flaky in the Parquet github PR testing environment [1]. > > > > I gave the error to Codex (the OpenAI coding agent) and asked it to fix the > > test. However, since I don't have enough confidence in my own > > understanding of the problem or the fix, I have not opened a PR. The fix > > can be found on my fork here > > < > > https://github.com/dossett/parquet-java/commit/7635c8599524aadee1164fc2168801c51390b118 > > > > > . > > > > The codex summary of the problem and the fix is this: > > > > We addressed CI OOMs in TestByteBitPacking512VectorLE > > (parquet-encoding-vector) by bounding the test input size while keeping the > > same correctness coverage. The original getRangeData could allocate arrays > > on the order of hundreds of millions of ints per bit width, which can > > consume tens of GB of heap and fail in constrained CI environments. > > > > The updated test generates a single bounded dataset (min 64, max 2^20 > > values) and spans the full legal value range for each bit width (including > > the full signed int range for 32‑bit). The vector and scalar pack/unpack > > paths are still compared for equality across bit widths, but without the > > unbounded memory stress that was causing flakiness. > > > > I would appreciate any feedback on that or alternatively other ways to > > address the flaky test, I found it very frustrating recently when I was > > opening several PRs. > > > > Cheers, Aaron > > > > [1] Example failure: > > > > https://github.com/apache/parquet-java/actions/runs/20671204311/job/59352228516?pr=3385 > > > > -- > > Aaron Niskode-Dossett, Data Engineering -- Etsy > > >
