[ 
https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated AVRO-946:
------------------------------

    Attachment: AVRO-946.patch

It would certainly be nice to cache things directly in the union schema.  I 
don't think we ought to have representation-specific stuff in Schema, but 
perhaps we can add a representation-independent random-access table to each 
union schema.

The value of Schema.getFullName() for each branch in a union is unique within 
that union.  A Map<String,Schema> in each union might be useful.  Here's a 
patch that adds such a thing and uses it to implement resolveUnion().  Might 
this work?
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch, AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of 
> GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A 
> simple optimization can be implemented by keeping a map within the 
> UnionSchema object (in fact, this could actually be a perfect hash map given 
> the potential values in the map are known in advance).  The optimization is 
> obviously most notable when a Union within the schema contains many types (in 
> our particular use case, more than 40 in some cases).  In this scenario, we 
> observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have 
> observed an even further improvement (and removed some of the restrictions of 
> relying on object identity) by using a perfect hash map on the schema names 
> (an extra 15% on top of that in some cases).  This implementation, 
> unfortunately, is not something we could contribute at this point, but we 
> thought it'd be a good idea to allow users to provide alternative 
> implementations of the indexing behavior, such as adding the following static 
> method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory 
> factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would 
> look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements 
> UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = 
> new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to