Re: [PR] AVRO-4069: Remove Reader String Cache from Generic Datum Reader [avro]

via GitHub Sat, 01 Feb 2025 20:52:01 -0800


belugabehr commented on PR #3194:
URL: https://github.com/apache/avro/pull/3194#issuecomment-2629242178


   I am taking another pass at this one. The performance improvements are just 
too significant to pass up (11%+ in my local benchmarhks). Here are my 
benchmarks:
   
   ```
   # JMH version: 1.35
   # VM version: JDK 17.0.13, OpenJDK 64-Bit Server VM, 
17.0.13+11-Ubuntu-2ubuntu124.04
   # VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
   
   # This is from AVRO-4069 (the worst result I measured for this branch)
   Result 
"com.szymon_kaluza.protobuf_avro.benchmark.AvroBench.serializeAndDeserializeSmall":
     889717.049 ±(99.9%) 9103.527 ops/s [Average]
     (min, avg, max) = (860194.914, 889717.049, 909429.461), stdev = 13625.731
     CI (99.9%): [880613.522, 898820.576] (assumes normal distribution)
     
     
   # This is from master (the best result I measured for this branch)
   Result 
"com.szymon_kaluza.protobuf_avro.benchmark.AvroBench.serializeAndDeserializeSmall":
     796084.662 ±(99.9%) 12190.142 ops/s [Average]
     (min, avg, max) = (764448.008, 796084.662, 850984.160), stdev = 18245.632
     CI (99.9%): [783894.521, 808274.804] (assumes normal distribution)
   ```
   
   The reason I even investigated this is because Visual VM profiling was 
lighting up on `IdentityHashMap` usage. By migrating to HashMap, I was able to 
achieve most of the performance gains. Also, another thing that I hadn't 
considered before is that `IdentityHashMap` uses object reference-equality for 
insertions and lookups. This means that the Map can look different for each 
instantiation of Avro because obviously the reference values will be different 
every time it runs. This made benchmarking kind of tricky and unreliable as 
well. If the keys just happened to have a lot of collisions, then performance 
would suffer, alternatively, benchmarking improved if the keys happen to be 
laid out more evenly in the bucket space. Yet another reason to simply get off 
this implementation.
   
   There is one caveat in the JavaDocs:
   
   > For many JRE implementations and operation mixes, this class will yield 
better performance than 
[HashMap](https://docs.oracle.com/javase/8/docs/api/java/util/HashMap.html) 
(which uses chaining rather than linear-probing).
   
   * https://docs.oracle.com/javase/8/docs/api/java/util/IdentityHashMap.html
   
   I get around this by exploiting the fact that the number of items in the map 
is fixed as instantiation. Therefore, I use a Map size sufficiently large such 
that there are no key collisions and therefore no chaining involved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] AVRO-4069: Remove Reader String Cache from Generic Datum Reader [avro]

Reply via email to