Brilliant, thanks for the reminder. I just merged the PR.

Kind regards,
Fokko

On 2026/05/11 18:43:00 Steve Loughran wrote:
> Fokko, could you merge my benchmark test PR, you've already approved it.
> Benchmark only as Neelesh picked up the only performance change I'd one in
> my pr
> 
> https://github.com/apache/parquet-java/pull/3452
> 
> thanks
> 
> On Mon, 11 May 2026 at 17:12, Fokko Driesprong <[email protected]> wrote:
> 
> > Thanks Ismaël for working on this. I did a first round of reviews with
> > great interest, and I'll do another one soon.
> >
> > I noticed that there is some overlap with the work by André (
> > https://github.com/apache/parquet-java/issues?q=is%3Apr+is%3Aopen+author%3Aarouel)
> > maybe it would be good to align the effort.
> >
> > Thanks!
> >
> > Kind regards,
> > Fokko
> >
> > On 2026/04/29 10:13:43 Steve Loughran wrote:
> > > there's a JMH comparer tool at
> > https://github.com/JohnTortugo/jmh-tabulate
> > > ...
> > >
> > > Even though it comes from an AWS engineer I did review that code for
> > > security, and even  got claude to (dynamically) generate the config file
> > > needed to run the project in a chroot-style sandbox on macos. Only
> > tangible
> > > risk is the chart.js file, and now that's cryptographically locked down.
> > >
> > > https://github.com/steveloughran/jmh-tabulate/tree/hardened
> > >
> > > Nobody should be pulling head dependencies from NPM repos, hard coded
> > > version numbers can be subverted by new tags. Hash codes are the only
> > thing
> > > to trust for something you run on file://
> > > Even if you bypass the sandbox, the .html file generated does enforce
> > > chart.js version integrity. So all should be good.
> > >
> > > Given all that, what do your numbers look like?
> > >
> > >
> > >
> > >
> > > On Wed, 29 Apr 2026 at 08:28, Ismaël Mejía <[email protected]> wrote:
> > >
> > > > Hi dev@,
> > > >
> > > > I’ve been working on performance improvements across the main
> > > > encoding/decoding hot paths of Apache Parquet Java. I presented this
> > > > work during last week’s Parquet community sync and I am sharing a
> > > > summary here for broader visibility, in line with Apache best
> > > > practices.
> > > >
> > > > Using AI assisted tools and JMH, I expanded the existing coverage of
> > > > microbenchmarks covering critical hot paths. I then iterated on a
> > > > series of optimizations, validated for correctness, and reviewed with
> > > > other AI tools. The results are promising.
> > > >
> > > > The improvements focus on eliminating per-value overhead in the hot
> > > > loops without changing the file format or public API. Key changes:
> > > >
> > > > - Plain INT32/LONG: bulk System.arraycopy instead of per-value
> > > > ByteBuffer.putInt (~4x encode, ~3x decode)
> > > > - ByteStreamSplit: zero-allocation batch scatter/gather (3-5x encode,
> > 2x
> > > > decode)
> > > > - Dictionary encoding: custom open-addressing hash map replacing
> > > > java.util.HashMap (up to 80x for low-cardinality string columns)
> > > > - RLE dictionary index decoder: direct ByteBuffer access bypassing
> > > > InputStream
> > > > - New batch read APIs: readIntegers()/readLongs() for vectorized
> > consumers
> > > >
> > > > End-to-end file read/write throughput improves by ~13–14% on average
> > > > across codecs in my test suite (Java 11, AMD EPYC). Full JMH results
> > > > (303 benchmarks) and a more detailed write-up will follow.
> > > >
> > > > Most changes have been grouped and tracked under the following issue,
> > > > which provides background and links to the related pull requests
> > > > https://github.com/apache/parquet-java/issues/3530
> > > >
> > > > The first set of pull requests is ready for review. Feedback and
> > > > comments from Java committers would be greatly appreciated.
> > > >
> > > > Thanks,
> > > > Ismaël
> > > >
> > > > ps. Kudos to Fokko Driesprong who already started reviewing some of
> > them.
> > > >
> > >
> >
> 

Reply via email to