GenericData.resolveUnion() performance improvement
--------------------------------------------------

                 Key: AVRO-946
                 URL: https://issues.apache.org/jira/browse/AVRO-946
             Project: Avro
          Issue Type: Improvement
          Components: java
    Affects Versions: 1.6.0
            Reporter: Hernan Otero


Due to the sequential nature of today's implementation of 
GenericData.resolveUnion() (used when serializing an object):

{code}
  public int resolveUnion(Schema union, Object datum) {
    int i = 0;
    for (Schema type : union.getTypes()) {
      if (instanceOf(type, datum))
        return i;
      i++;
    }
    throw new UnresolvedUnionException(union, datum);
  }
{code}

it showed up when we were doing some serialization performance analysis.  A 
simple optimization can be implemented by keeping a map within the UnionSchema 
object (in fact, this could actually be a perfect hash map given the potential 
values in the map are known in advance).  The optimization is obviously most 
notable when a Union within the schema contains many types (in our particular 
use case, more than 40 in some cases).  In this scenario, we observed a 25% 
improvement by using an identity hash map.

Even though using an identity map provides a significant boost, we have 
observed an even further improvement (and removed some of the restrictions of 
relying on object identity) by using a perfect hash map on the schema names (an 
extra 15% on top of that in some cases).  This implementation, unfortunately, 
is not something we could contribute at this point, but we thought it'd be a 
good idea to allow users to provide alternative implementations of the indexing 
behavior, such as adding the following static method to Schema:

{code}
public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
{
  unionIndexCacheFactory = factory;
}
{code}

This is what the interface and identity hash map-based implementation would 
look like:

{code}
  /**
   * A factory interface for creating UnionTypeIndexCache instances.
   */
  public static interface UnionIndexCacheFactory
  {
      UnionIndexCache createUnionIndexCache(List<Schema> types);

      /**
       * Used for caching schema indices within a union.
       */
      public static interface UnionIndexCache
      {
          void setTypeIndex(Schema schema, int index);

          int getTypeIndex(Schema schema);
      }

  }

  private static class IdentityMapUnionIndexCacheFactory implements 
UnionIndexCacheFactory
  {
      @Override
      public UnionIndexCache createUnionIndexCache(List<Schema> types)
      {
          return new UnionIndexCache()
          {
              private final IdentityHashMap<Schema, Integer> schemaToIndex = 
new IdentityHashMap<Schema, Integer>();

              @Override
              public void setTypeIndex(Schema schema, int index)
              {
                  schemaToIndex.put(schema, index);
              }

              @Override
              public int getTypeIndex(Schema schema)
              {
                  Integer index = schemaToIndex.get(schema);
                  return index == null ? -1 : index;
              }
          };
      }
  }
{code}

I will attach a patch later today or early tomorrow.

Thanks in advance,

Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to