Brilliant, thanks for the reminder. I just merged the PR. Kind regards, Fokko
On 2026/05/11 18:43:00 Steve Loughran wrote: > Fokko, could you merge my benchmark test PR, you've already approved it. > Benchmark only as Neelesh picked up the only performance change I'd one in > my pr > > https://github.com/apache/parquet-java/pull/3452 > > thanks > > On Mon, 11 May 2026 at 17:12, Fokko Driesprong <[email protected]> wrote: > > > Thanks Ismaël for working on this. I did a first round of reviews with > > great interest, and I'll do another one soon. > > > > I noticed that there is some overlap with the work by André ( > > https://github.com/apache/parquet-java/issues?q=is%3Apr+is%3Aopen+author%3Aarouel) > > maybe it would be good to align the effort. > > > > Thanks! > > > > Kind regards, > > Fokko > > > > On 2026/04/29 10:13:43 Steve Loughran wrote: > > > there's a JMH comparer tool at > > https://github.com/JohnTortugo/jmh-tabulate > > > ... > > > > > > Even though it comes from an AWS engineer I did review that code for > > > security, and even got claude to (dynamically) generate the config file > > > needed to run the project in a chroot-style sandbox on macos. Only > > tangible > > > risk is the chart.js file, and now that's cryptographically locked down. > > > > > > https://github.com/steveloughran/jmh-tabulate/tree/hardened > > > > > > Nobody should be pulling head dependencies from NPM repos, hard coded > > > version numbers can be subverted by new tags. Hash codes are the only > > thing > > > to trust for something you run on file:// > > > Even if you bypass the sandbox, the .html file generated does enforce > > > chart.js version integrity. So all should be good. > > > > > > Given all that, what do your numbers look like? > > > > > > > > > > > > > > > On Wed, 29 Apr 2026 at 08:28, Ismaël Mejía <[email protected]> wrote: > > > > > > > Hi dev@, > > > > > > > > I’ve been working on performance improvements across the main > > > > encoding/decoding hot paths of Apache Parquet Java. I presented this > > > > work during last week’s Parquet community sync and I am sharing a > > > > summary here for broader visibility, in line with Apache best > > > > practices. > > > > > > > > Using AI assisted tools and JMH, I expanded the existing coverage of > > > > microbenchmarks covering critical hot paths. I then iterated on a > > > > series of optimizations, validated for correctness, and reviewed with > > > > other AI tools. The results are promising. > > > > > > > > The improvements focus on eliminating per-value overhead in the hot > > > > loops without changing the file format or public API. Key changes: > > > > > > > > - Plain INT32/LONG: bulk System.arraycopy instead of per-value > > > > ByteBuffer.putInt (~4x encode, ~3x decode) > > > > - ByteStreamSplit: zero-allocation batch scatter/gather (3-5x encode, > > 2x > > > > decode) > > > > - Dictionary encoding: custom open-addressing hash map replacing > > > > java.util.HashMap (up to 80x for low-cardinality string columns) > > > > - RLE dictionary index decoder: direct ByteBuffer access bypassing > > > > InputStream > > > > - New batch read APIs: readIntegers()/readLongs() for vectorized > > consumers > > > > > > > > End-to-end file read/write throughput improves by ~13–14% on average > > > > across codecs in my test suite (Java 11, AMD EPYC). Full JMH results > > > > (303 benchmarks) and a more detailed write-up will follow. > > > > > > > > Most changes have been grouped and tracked under the following issue, > > > > which provides background and links to the related pull requests > > > > https://github.com/apache/parquet-java/issues/3530 > > > > > > > > The first set of pull requests is ready for review. Feedback and > > > > comments from Java committers would be greatly appreciated. > > > > > > > > Thanks, > > > > Ismaël > > > > > > > > ps. Kudos to Fokko Driesprong who already started reviewing some of > > them. > > > > > > > > > >
