[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Doug Cutting (Commented) (JIRA) Fri, 28 Oct 2011 09:43:58 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13138511#comment-13138511
 ]


Doug Cutting commented on AVRO-946:
-----------------------------------

I'd prefer this not be a global setting for all union schemas in the JVM.

A longer-term approach might be to make UnionSchema a public extensible class, 
and provide a visitor/copier API for schemas so that one can easily create a 
version of a schema replacing the implementations of some elements, like unions.

A good near-term approach might be to add this functionality to 
GenericDatumWriter.  A MultiKeyMap should provide good performance 
(http://s.apache.org/c1J).  Note that one can override the hash function used 
by MultiKeyMap to make it identity or even a perfect hash.  (For a given 
GenericDatumWriter the schema is fixed so all unions contained in it can be 
enumerated.)

To be more concrete, instead of calling GenericData.resolveUnion(), 
GenericDatumWriter() could have its own version of ResolveUnion the uses a 
MultiKeyMap cache indexed by the union Schema and the value's type.  On misses, 
the cache can be populated by calling GenericData.resolveUnion().  The hash 
function can be overidden to be identity for both keys.

Might something like this work for you?  I think it should perform similarly to 
directly storing the cache in the union Schema.
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>
> Due to the sequential nature of today's implementation of 
> GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A 
> simple optimization can be implemented by keeping a map within the 
> UnionSchema object (in fact, this could actually be a perfect hash map given 
> the potential values in the map are known in advance).  The optimization is 
> obviously most notable when a Union within the schema contains many types (in 
> our particular use case, more than 40 in some cases).  In this scenario, we 
> observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have 
> observed an even further improvement (and removed some of the restrictions of 
> relying on object identity) by using a perfect hash map on the schema names 
> (an extra 15% on top of that in some cases).  This implementation, 
> unfortunately, is not something we could contribute at this point, but we 
> thought it'd be a good idea to allow users to provide alternative 
> implementations of the indexing behavior, such as adding the following static 
> method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory 
> factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would 
> look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements 
> UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = 
> new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Reply via email to