Oh great, thank you for picking this up!

On Thu, Feb 26, 2026 at 9:47 PM Zehua Zou <[email protected]> wrote:

> Hi Aaron and Gang Wu,
>
> I have opened a PR to fix this issue by lazily initializing data and
> reusing temporary arrays. Please take a look.
>
> Issue link: https://github.com/apache/parquet-java/issues/3404
>
> PR link: https://github.com/apache/parquet-java/pull/3405
>
>
> On 2026/01/20 08:37:20 Gang Wu wrote:
> > Hi Aaron,
> >
> > Sorry for the late reply!
> >
> > Below is the table to roughly estimate memory consumption with different
> > bit widths:
> >
> > | bitWidth | arrays | per-array length | per-array bytes | total bytes
>   |
> > total (binary)|
> >
> |---------:|-------:|-----------------:|----------------:|---------------:|--------------:|
> > | 32       | 16     | 268,435,520      | 1,073,742,080   |
> 17,179,873,280 |
> > 16.00 GiB     |
> > | 31       | 8      | 268,435,520      | 1,073,742,080   |
> 8,589,936,640  |
> > 8.00 GiB      |
> > | 30       | 4      | 268,435,520      | 1,073,742,080   |
> 4,294,968,320  |
> > 4.00 GiB      |
> > | 29       | 2      | 268,435,520      | 1,073,742,080   |
> 2,147,484,160  |
> > 2.00 GiB      |
> > | 28       | 1      | 268,435,520      | 1,073,742,080   |
> 1,073,742,080  |
> > 1.00 GiB      |
> > | 27       | 1      | 134,217,792      |   536,871,168   |
>  536,871,168  |
> > 512.00 MiB    |
> > | 26       | 1      | 67,108,928       |   268,435,712   |
>  268,435,712  |
> > 256.00 MiB    |
> > | 25       | 1      | 33,554,496       |   134,217,984   |
>  134,217,984  |
> > 128.00 MiB    |
> > | 24       | 1      | 16,777,280       |    67,108,480   |
> 67,108,480  |
> > 64.00 MiB     |
> > | 20       | 1      | 1,048,640        |     4,194,560   |
>  4,194,560  |
> > 4.00 MiB      |
> > | 10       | 1      | 1,088            |         4,352   |
>  4,352  |
> > 4.25 KiB      |
> > | 1        | 1      | 64               |           256   |
>  256  |
> > 0.25 KiB      |
> >
> > I think your proposed change has a slightly different pattern of
> > generated numbers.
> > If we want to keep the original pattern, a quick fix might be lazily
> > generating arrays
> > to reduce the peak memory size. WDYT?
> >
> > Best,
> > Gang
> >
> >
> > On Wed, Jan 7, 2026 at 6:23 AM Aaron Niskode-Dossett via dev <
> > [email protected]> wrote:
> >
> > > The
> > > test
> > >
> TestByteBitPacking512VectorLE.unpackValuesUsingVectorBitWidth(TestByteBitPacking512VectorLE
> > > is flaky in the Parquet github PR testing environment [1].
> > >
> > > I gave the error to Codex (the OpenAI coding agent) and asked it to
> fix the
> > > test.  However, since I don't have enough confidence in my own
> > > understanding of the problem or the fix, I have not opened a PR.  The
> fix
> > > can be found on my fork here
> > > <
> > >
> https://github.com/dossett/parquet-java/commit/7635c8599524aadee1164fc2168801c51390b118
> > > >
> > > .
> > >
> > > The codex summary of the problem and the fix is this:
> > >
> > > We addressed CI OOMs in TestByteBitPacking512VectorLE
> > > (parquet-encoding-vector) by bounding the test input size while
> keeping the
> > > same correctness coverage. The original getRangeData could allocate
> arrays
> > > on the order of hundreds of millions of ints per bit width, which can
> > > consume tens of GB of heap and fail in constrained CI environments.
> > >
> > > The updated test generates a single bounded dataset (min 64, max 2^20
> > > values) and spans the full legal value range for each bit width
> (including
> > > the full signed int range for 32‑bit).  The vector and scalar
> pack/unpack
> > > paths are still compared for equality across bit widths, but without
> the
> > > unbounded memory stress that was causing flakiness.
> > >
> > > I would appreciate any feedback on that or alternatively other ways to
> > > address the flaky test, I found it very frustrating recently when I was
> > > opening several PRs.
> > >
> > > Cheers, Aaron
> > >
> > > [1] Example failure:
> > >
> > >
> https://github.com/apache/parquet-java/actions/runs/20671204311/job/59352228516?pr=3385
> > >
> > > --
> > > Aaron Niskode-Dossett, Data Engineering -- Etsy
> > >
> >



-- 
Aaron Niskode-Dossett, Data Engineering -- Etsy

Reply via email to