[jira] [Created] (AVRO-2274) Improve resolving performance when schemas don't change

Raymie Stata (JIRA) Sun, 25 Nov 2018 16:46:04 -0800

Raymie Stata created AVRO-2274:
----------------------------------

             Summary: Improve resolving performance when schemas don't change
                 Key: AVRO-2274
                 URL: https://issues.apache.org/jira/browse/AVRO-2274
             Project: Apache Avro
          Issue Type: Improvement
          Components: java
            Reporter: Raymie Stata
            Assignee: Raymie Stata



Decoding optimizations based on the observation that schemas don't change very 
much.  We add special-case paths to optimize the case where a _sub_schema of 
the reader and the writer are the same.  The specific cases are:

* In the case of an enumeration, if the reader and writer are the same, then we 
can simply return the tag written by the writer rather than "adjust" it as if 
it might have been re-ordered.  In fact, we can do this (directly return the 
tag written by the writer) as long as the reader-schema is an "extension" of 
the writer's in that it may have added new symbols but hasn't renumbered any of 
the writer's symbols.  Enumerations that either don't change at all or are 
"extended" as defined here are the common ways to extend enumerations.  (Our 
tests show this optimization improves performance by about 3%.)

* When the reader and writer subschemas are both unions, resolution is 
expensive: we have an outer union preceded by a "writer-union action", but each 
branch of this outer union consist of union-adjust actions, which are heavy 
weight.  We optimize this case when the reader and writer unions are the same: 
we fall back on the standard grammar used for a union, avoiding all these 
adjustments.  Since unions are commonly used to encode "nullable" fields in 
Avro, and nullability rarely changes as a schema evolves, this optimization 
should help many users.  (Our tests show this optimization improves performance 
by 25-30%, a significant win.)

* The "custom code" generated for reading records has to read fields in a loop 
that uses a switch statement to deal with writers that may have re-ordered 
fields.  In most cases, however, fields have not been reordered (esp. in more 
complex records with many record sub-schemas).  So we've added a new method to 
ResolvingDecoder called readFieldOrderIfDiff, which is a variant of the 
existing readFieldOrder.  If the field order has indeed changed, then 
readFieldOrderIfDiff returns the new field order, just like readFieldOrder 
does.  However, if the field-order hasn't changed, then readFieldOrderIfDiff 
returns null.  We then modified the generation of custom-decoders for records 
to add a special-case path that simply reads the record's fields in order, 
without incurring the overhead of the loop or the switch statement.  (Our tests 
show this optimization improves performance by 8-9%, on top of the 35-40% 
produced by the original custom-coder optimization.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AVRO-2274) Improve resolving performance when schemas don't change

Reply via email to