Hi Aaron and Gang Wu,

I have opened a PR to fix this issue by lazily initializing data and reusing 
temporary arrays. Please take a look.

Issue link: https://github.com/apache/parquet-java/issues/3404

PR link: https://github.com/apache/parquet-java/pull/3405


On 2026/01/20 08:37:20 Gang Wu wrote:
> Hi Aaron,
> 
> Sorry for the late reply!
> 
> Below is the table to roughly estimate memory consumption with different
> bit widths:
> 
> | bitWidth | arrays | per-array length | per-array bytes | total bytes    |
> total (binary)|
> |---------:|-------:|-----------------:|----------------:|---------------:|--------------:|
> | 32       | 16     | 268,435,520      | 1,073,742,080   | 17,179,873,280 |
> 16.00 GiB     |
> | 31       | 8      | 268,435,520      | 1,073,742,080   | 8,589,936,640  |
> 8.00 GiB      |
> | 30       | 4      | 268,435,520      | 1,073,742,080   | 4,294,968,320  |
> 4.00 GiB      |
> | 29       | 2      | 268,435,520      | 1,073,742,080   | 2,147,484,160  |
> 2.00 GiB      |
> | 28       | 1      | 268,435,520      | 1,073,742,080   | 1,073,742,080  |
> 1.00 GiB      |
> | 27       | 1      | 134,217,792      |   536,871,168   |   536,871,168  |
> 512.00 MiB    |
> | 26       | 1      | 67,108,928       |   268,435,712   |   268,435,712  |
> 256.00 MiB    |
> | 25       | 1      | 33,554,496       |   134,217,984   |   134,217,984  |
> 128.00 MiB    |
> | 24       | 1      | 16,777,280       |    67,108,480   |    67,108,480  |
> 64.00 MiB     |
> | 20       | 1      | 1,048,640        |     4,194,560   |     4,194,560  |
> 4.00 MiB      |
> | 10       | 1      | 1,088            |         4,352   |         4,352  |
> 4.25 KiB      |
> | 1        | 1      | 64               |           256   |           256  |
> 0.25 KiB      |
> 
> I think your proposed change has a slightly different pattern of
> generated numbers.
> If we want to keep the original pattern, a quick fix might be lazily
> generating arrays
> to reduce the peak memory size. WDYT?
> 
> Best,
> Gang
> 
> 
> On Wed, Jan 7, 2026 at 6:23 AM Aaron Niskode-Dossett via dev <
> [email protected]> wrote:
> 
> > The
> > test
> > TestByteBitPacking512VectorLE.unpackValuesUsingVectorBitWidth(TestByteBitPacking512VectorLE
> > is flaky in the Parquet github PR testing environment [1].
> >
> > I gave the error to Codex (the OpenAI coding agent) and asked it to fix the
> > test.  However, since I don't have enough confidence in my own
> > understanding of the problem or the fix, I have not opened a PR.  The fix
> > can be found on my fork here
> > <
> > https://github.com/dossett/parquet-java/commit/7635c8599524aadee1164fc2168801c51390b118
> > >
> > .
> >
> > The codex summary of the problem and the fix is this:
> >
> > We addressed CI OOMs in TestByteBitPacking512VectorLE
> > (parquet-encoding-vector) by bounding the test input size while keeping the
> > same correctness coverage. The original getRangeData could allocate arrays
> > on the order of hundreds of millions of ints per bit width, which can
> > consume tens of GB of heap and fail in constrained CI environments.
> >
> > The updated test generates a single bounded dataset (min 64, max 2^20
> > values) and spans the full legal value range for each bit width (including
> > the full signed int range for 32‑bit).  The vector and scalar pack/unpack
> > paths are still compared for equality across bit widths, but without the
> > unbounded memory stress that was causing flakiness.
> >
> > I would appreciate any feedback on that or alternatively other ways to
> > address the flaky test, I found it very frustrating recently when I was
> > opening several PRs.
> >
> > Cheers, Aaron
> >
> > [1] Example failure:
> >
> > https://github.com/apache/parquet-java/actions/runs/20671204311/job/59352228516?pr=3385
> >
> > --
> > Aaron Niskode-Dossett, Data Engineering -- Etsy
> >
> 

Reply via email to