Re: [PR] fix(parquet): bound schema num_children before Vec::with_capacity [arrow-rs]

via GitHub Tue, 05 May 2026 04:22:32 -0700


masumi-ryugo commented on PR #9884:
URL: https://github.com/apache/arrow-rs/pull/9884#issuecomment-4378746608


   Thanks @etseidl, and apologies for the noise on this one — you read it 
correctly.
   
   **Disclosure**: yes, this PR was drafted with AI assistance, and on a 
re-read against [the AI-generated code submission 
guidelines](https://arrow.apache.org/docs/dev/developers/overview.html#ai-generated-code)
 it hits all three of the pitfalls flagged there: overly verbose comments, 
unnecessary test cases, and a fix whose framing didn't match the actual bug. I 
should have caught that before sending. Going forward I'll do the same 
self-review pass against that guideline before opening any AI-assisted PR.
   
   **Substance check on your review**: you're right that the original framing 
was wrong. Walking through it on the actual code:
   
   - `Some(-1) as usize` → `usize::MAX` → `Vec::with_capacity(usize::MAX)` does 
panic with `capacity overflow` from `alloc::raw_vec::handle_error`. **This is 
the real bug**, and `usize::try_from(n)?` is the right fix.
   - `Some(i32::MAX) as usize` → `Vec::with_capacity(i32::MAX)` for 
`SchemaElement` (≈96 bytes) requests ~200 GB. On 64-bit Linux that's well under 
`isize::MAX` so the `alloc_guard` check passes; with default overcommit the 
allocation succeeds lazily and **never panics on main**. So the `i32::MAX` test 
wasn't exercising a real codepath, and the remaining-elements bound was 
protecting against a panic that the platform doesn't actually produce.
   - The 22-byte fuzzer repro: as you noted, doesn't error on main. It was 
tagged as a regression test based on libfuzzer/ASan reporting it as OOM, but 
those report any large allocation as OOM regardless of whether the native 
allocator would have succeeded. Not a regression test for the codebase's 
behavior; my mistake to label it that way.
   
   **Force-pushed `dc1793f`** (was `5847354`). Diff is now 
`parquet/src/schema/types.rs +19/-1`:
   
   - Applied your suggested edit verbatim — single-line change to use 
`Vec::with_capacity(usize::try_from(n)?)`.
   - Removed the 10-line comment block.
   - Removed the remaining-elements bound check.
   - Removed the `i32::MAX` overflow test and the 22-byte fuzzer-repro test 
(neither tested what they claimed).
   - Kept `test_parquet_schema_from_array_rejects_negative_num_children` — this 
one does fail on `5847354`'s parent (panics with `capacity overflow`) and 
passes here.
   - PR title and commit message rewritten to match the actual fix.
   
   `cargo test -p parquet --release --lib schema::`: 105 passed. `cargo fmt 
--check`, `cargo clippy --no-deps`: clean.
   
   If you'd like the remaining-elements consistency check (catching 
`num_children` that's positive but exceeds the schema-array tail), I'm happy to 
send it as a follow-up PR with a real failure case — i.e., a multi-element 
schema where the inner recursion currently produces a less-clear error than an 
upfront bound would. Otherwise, glad to drop it entirely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(parquet): bound schema num_children before Vec::with_capacity [arrow-rs]

Reply via email to