Re: [PR] fix(parquet): bound data page byte size for large variable-width values [arrow-rs]

via GitHub Thu, 04 Jun 2026 06:24:56 -0700


adriangb commented on PR #9972:
URL: https://github.com/apache/arrow-rs/pull/9972#issuecomment-4622552203


   @alamb I'm pretty convinced the benchmark regressions are just noise.
   
   For example: `main` vs. `main` in 
https://github.com/apache/arrow-rs/pull/9972#issuecomment-4618263660 
`string/parquet_2` has an 18% swing. I ran a lot of testing on the exact same 
hardware as these bench runners and found that running any benchmark in 
isolation produces a <1% change on main vs. this branch. I measured instruction 
count and there was only a ~0.3% increase in instructions, not enough to cause 
a 70% performance impact. Running the entire suite I *was* able to reproduce 
the regressions, although not consistently. This led me to investigate 
allocator impact.
   
   The conclusion was that this is caused by rolling the dice on allocations, 
see https://github.com/apache/arrow-rs/pull/10068. These benches are all 
allocation heavy so any "bad luck" in ordering / timing of allocations (re-use 
vs. page fault) can result in big swings in results. With 
https://github.com/apache/arrow-rs/pull/10068 applied I see no regressions here 
even running the full suite.
   
   My suggestion is that we don't over-index on this and move forward with this 
change if everything else looks good. If you are happy with 
https://github.com/apache/arrow-rs/pull/10068 we can also merge that and re-run 
the benches, but I'm afraid that would delay the release and I do think it'd be 
nice to include this change in the release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(parquet): bound data page byte size for large variable-width values [arrow-rs]

Reply via email to