Fokko, could you merge my benchmark test PR, you've already approved it.
Benchmark only as Neelesh picked up the only performance change I'd one in
my pr

https://github.com/apache/parquet-java/pull/3452

thanks

On Mon, 11 May 2026 at 17:12, Fokko Driesprong <[email protected]> wrote:

> Thanks Ismaël for working on this. I did a first round of reviews with
> great interest, and I'll do another one soon.
>
> I noticed that there is some overlap with the work by André (
> https://github.com/apache/parquet-java/issues?q=is%3Apr+is%3Aopen+author%3Aarouel)
> maybe it would be good to align the effort.
>
> Thanks!
>
> Kind regards,
> Fokko
>
> On 2026/04/29 10:13:43 Steve Loughran wrote:
> > there's a JMH comparer tool at
> https://github.com/JohnTortugo/jmh-tabulate
> > ...
> >
> > Even though it comes from an AWS engineer I did review that code for
> > security, and even  got claude to (dynamically) generate the config file
> > needed to run the project in a chroot-style sandbox on macos. Only
> tangible
> > risk is the chart.js file, and now that's cryptographically locked down.
> >
> > https://github.com/steveloughran/jmh-tabulate/tree/hardened
> >
> > Nobody should be pulling head dependencies from NPM repos, hard coded
> > version numbers can be subverted by new tags. Hash codes are the only
> thing
> > to trust for something you run on file://
> > Even if you bypass the sandbox, the .html file generated does enforce
> > chart.js version integrity. So all should be good.
> >
> > Given all that, what do your numbers look like?
> >
> >
> >
> >
> > On Wed, 29 Apr 2026 at 08:28, Ismaël Mejía <[email protected]> wrote:
> >
> > > Hi dev@,
> > >
> > > I’ve been working on performance improvements across the main
> > > encoding/decoding hot paths of Apache Parquet Java. I presented this
> > > work during last week’s Parquet community sync and I am sharing a
> > > summary here for broader visibility, in line with Apache best
> > > practices.
> > >
> > > Using AI assisted tools and JMH, I expanded the existing coverage of
> > > microbenchmarks covering critical hot paths. I then iterated on a
> > > series of optimizations, validated for correctness, and reviewed with
> > > other AI tools. The results are promising.
> > >
> > > The improvements focus on eliminating per-value overhead in the hot
> > > loops without changing the file format or public API. Key changes:
> > >
> > > - Plain INT32/LONG: bulk System.arraycopy instead of per-value
> > > ByteBuffer.putInt (~4x encode, ~3x decode)
> > > - ByteStreamSplit: zero-allocation batch scatter/gather (3-5x encode,
> 2x
> > > decode)
> > > - Dictionary encoding: custom open-addressing hash map replacing
> > > java.util.HashMap (up to 80x for low-cardinality string columns)
> > > - RLE dictionary index decoder: direct ByteBuffer access bypassing
> > > InputStream
> > > - New batch read APIs: readIntegers()/readLongs() for vectorized
> consumers
> > >
> > > End-to-end file read/write throughput improves by ~13–14% on average
> > > across codecs in my test suite (Java 11, AMD EPYC). Full JMH results
> > > (303 benchmarks) and a more detailed write-up will follow.
> > >
> > > Most changes have been grouped and tracked under the following issue,
> > > which provides background and links to the related pull requests
> > > https://github.com/apache/parquet-java/issues/3530
> > >
> > > The first set of pull requests is ready for review. Feedback and
> > > comments from Java committers would be greatly appreciated.
> > >
> > > Thanks,
> > > Ismaël
> > >
> > > ps. Kudos to Fokko Driesprong who already started reviewing some of
> them.
> > >
> >
>

Reply via email to