[ 
https://issues.apache.org/jira/browse/AVRO-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682085#comment-16682085
 ] 

ASF GitHub Bot commented on AVRO-2247:
--------------------------------------

unchuckable commented on issue #354: AVRO-2247 - improved java reading 
performance with new reader
URL: https://github.com/apache/avro/pull/354#issuecomment-437531886
 
 
   Thanks for your feedback, raymie. Hadn't expected to receive that much 
input, but I'll try to address your points:
   
   * I moved the feature switch from `GenericData.createDatumReader` to 
`GenericDatumReader.read`. (even tho I don't quite get why the factory method 
isn't the default way to instantiate a new reader, can someone help me 
understand?). This way, the newly written code is subject to ALL unit tests of 
the project. (Admittedly, I had to disable the `ReusingArrayReader` due to some 
problems there, will address that later on if desired). With little adjustments 
(most of which having been fixing the Exception types and messages), all tests 
seem to pass. If there's further test frameworks to use, please do let me know.
   
   * I am well aware of the deviations in performance measurements. That's why 
everyone should take `Perf.java` results with a grain of salt. Most tests there 
are VERY sensitive on even small changes and have a too large spread. Actually, 
for performance measurement, a module that makes use of JMH or similar would be 
preferable (ideally one that not only measures time, but also allocation 
activity). Also, structures of different depth and diversity should be checked.
   
   * It's nice to see you having pursued similar ideas with your branch. I 
think the main difference is that I am trying to do all verifications ONCE at 
Reader creation, while your take still makes use of the existing `DatumReader`, 
which requires some avoidable lookups at read time (things like 
`SpecificData.getClass()`, etc ) and `ResolvingDecoder`, which still relies on 
parser information for nearly every read operation. My approach is 
**replacing** usage of `ResolvingDecoder` and `DatumReader`, using a different 
approach that makes all necessary decisions only once where possible, storing 
results in instance variables instead of maps (hoping to affect performance 
positively that way, see note above). Downside of my approach in its current 
form is that it only works for Generic and Specific records. Have not looked 
into what needs to be changed in order to use the other kinds of data 
(reflective, protobuf, thrift), but I consider generic and specific records to 
be the most important use case.
   
   * Sadly, the mechanism I present does not take advantage of the generated 
reader code of AVRO-2090, but offers performance benefits in a similar extent 
and works for all kinds of generic and specific records
   
   * The big benefit of the current approach is the strong speedup when dealing 
with default values. Maybe a huge part of the gain could be achieved with a 
smaller change, I do agree.
   
   I would be very grateful for some feedback on whether you consider the 
current approach I present worth spending more time on or whether there are 
more/other things that would keep it from being considered beneficial for the 
project.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Improve Java reading performance with a new reader
> --------------------------------------------------
>
>                 Key: AVRO-2247
>                 URL: https://issues.apache.org/jira/browse/AVRO-2247
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Martin Jubelgas
>            Priority: Major
>             Fix For: 1.9.0
>
>         Attachments: Perf-Comparison.md
>
>
> Complementary to AVRO-2090, I have been working on decoding of Avro objects 
> in Java and am suggesting a new implementation of a DatumReader that improves 
> read performance for both generic and specific records by approximately 20% 
> (and even more in cases of nested objects with defaults, a case I encounter a 
> lot in practical use).
> Key concept is to create a detailed execution plan once at DatumReader. This 
> execution plan contains all required defaulting/lookup values so they need 
> not be looked up during object traversal while reading.
> The reader implementation can be enabled and disabled per GenericData 
> instance. The system default is set via the system variable 
> "org.apache.avro.fastread" (defaults to "false").
> Attached a performance comparison of the existing implementation with the 
> proposed one. Will open a pull request with respective code in a bit (not 
> including interoperability with the optimizations of AVRO-2090 yet). Please 
> let me know your opinion of whether this is worth pursuing further.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to