Hi Aaron,

Sorry for the late reply!

Below is the table to roughly estimate memory consumption with different
bit widths:

| bitWidth | arrays | per-array length | per-array bytes | total bytes    |
total (binary)|
|---------:|-------:|-----------------:|----------------:|---------------:|--------------:|
| 32       | 16     | 268,435,520      | 1,073,742,080   | 17,179,873,280 |
16.00 GiB     |
| 31       | 8      | 268,435,520      | 1,073,742,080   | 8,589,936,640  |
8.00 GiB      |
| 30       | 4      | 268,435,520      | 1,073,742,080   | 4,294,968,320  |
4.00 GiB      |
| 29       | 2      | 268,435,520      | 1,073,742,080   | 2,147,484,160  |
2.00 GiB      |
| 28       | 1      | 268,435,520      | 1,073,742,080   | 1,073,742,080  |
1.00 GiB      |
| 27       | 1      | 134,217,792      |   536,871,168   |   536,871,168  |
512.00 MiB    |
| 26       | 1      | 67,108,928       |   268,435,712   |   268,435,712  |
256.00 MiB    |
| 25       | 1      | 33,554,496       |   134,217,984   |   134,217,984  |
128.00 MiB    |
| 24       | 1      | 16,777,280       |    67,108,480   |    67,108,480  |
64.00 MiB     |
| 20       | 1      | 1,048,640        |     4,194,560   |     4,194,560  |
4.00 MiB      |
| 10       | 1      | 1,088            |         4,352   |         4,352  |
4.25 KiB      |
| 1        | 1      | 64               |           256   |           256  |
0.25 KiB      |

I think your proposed change has a slightly different pattern of
generated numbers.
If we want to keep the original pattern, a quick fix might be lazily
generating arrays
to reduce the peak memory size. WDYT?

Best,
Gang


On Wed, Jan 7, 2026 at 6:23 AM Aaron Niskode-Dossett via dev <
[email protected]> wrote:

> The
> test
> TestByteBitPacking512VectorLE.unpackValuesUsingVectorBitWidth(TestByteBitPacking512VectorLE
> is flaky in the Parquet github PR testing environment [1].
>
> I gave the error to Codex (the OpenAI coding agent) and asked it to fix the
> test.  However, since I don't have enough confidence in my own
> understanding of the problem or the fix, I have not opened a PR.  The fix
> can be found on my fork here
> <
> https://github.com/dossett/parquet-java/commit/7635c8599524aadee1164fc2168801c51390b118
> >
> .
>
> The codex summary of the problem and the fix is this:
>
> We addressed CI OOMs in TestByteBitPacking512VectorLE
> (parquet-encoding-vector) by bounding the test input size while keeping the
> same correctness coverage. The original getRangeData could allocate arrays
> on the order of hundreds of millions of ints per bit width, which can
> consume tens of GB of heap and fail in constrained CI environments.
>
> The updated test generates a single bounded dataset (min 64, max 2^20
> values) and spans the full legal value range for each bit width (including
> the full signed int range for 32‑bit).  The vector and scalar pack/unpack
> paths are still compared for equality across bit widths, but without the
> unbounded memory stress that was causing flakiness.
>
> I would appreciate any feedback on that or alternatively other ways to
> address the flaky test, I found it very frustrating recently when I was
> opening several PRs.
>
> Cheers, Aaron
>
> [1] Example failure:
>
> https://github.com/apache/parquet-java/actions/runs/20671204311/job/59352228516?pr=3385
>
> --
> Aaron Niskode-Dossett, Data Engineering -- Etsy
>

Reply via email to