[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Hernan Otero (Commented) (JIRA) Mon, 31 Oct 2011 12:41:55 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140484#comment-13140484
 ]


Hernan Otero commented on AVRO-946:
-----------------------------------

I've implemented your proposed solution using the following data structure:

{code}IdentityHashMap<Schema, WeakHashMap<Schema, Integer>>{code}

The first (identity) map's key being the Union itself, the second's (weak) 
being the datum's schema.

As a side note, I did try implementing this using MultiKeyMap delegating to a 
ReferenceMap, but was unable to override equalsKey() on the MultiKeyMap due to 
package accessibility constraints for AbstractHashedMap.HashEntry (it may well 
be I misunderstood MultiKeyMap's extensibility features as I don't have much 
experience with it, please feel free to suggest an alternative data structure).

I implemented the feature by adding 3 protected methods (for creating, caching 
and looking up the entries within the cache), so subclasses of 
GenericDatumWriter should be able to provide alternative implementations if 
required:

{code}
protected void createUnionIndexCacheMap();
protected int getCachedUnionIndex(Schema union, Schema datumSchema);
protected void putCachedUnionIndex(Schema union, Schema datumSchema, int index);
{code}

Here are some stats I collected.  The test uses a single record with 3 fields 
of the same union type.  This union type has 40 different types to choose from. 
 "best/worst" describe scenarios from the point of view of today's (sequential) 
implementation:

* best.  Fields are set with the first type within the union.
* worst.  Fields are set with the last type within the union.

These are the results I obtained:

{code}
-----------------------------------------------------
| Scenario |   Time    |       Rate         | Cache |
-----------------------------------------------------
| best     | 11.1 secs |  180,685 loops/sec | true  |
| best     | 12.1 secs |  164,935 loops/sec | false |
| worst    | 11.2 secs |  178,364 loops/sec | true  |
| worst    | 19.4 secs |  102,912 loops/sec | false |
-----------------------------------------------------
{code}

As expected, using the cache shows little difference between the two scenarios 
(but manages to be faster even in the best-case scenario!), whereas today's 
(sequential) implementation is significantly slower in the worst case scenario.

I will submit the patch in a few minutes.

Thanks,

Hernan
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>
> Due to the sequential nature of today's implementation of 
> GenericData.resolveUnion() (used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A 
> simple optimization can be implemented by keeping a map within the 
> UnionSchema object (in fact, this could actually be a perfect hash map given 
> the potential values in the map are known in advance).  The optimization is 
> obviously most notable when a Union within the schema contains many types (in 
> our particular use case, more than 40 in some cases).  In this scenario, we 
> observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have 
> observed an even further improvement (and removed some of the restrictions of 
> relying on object identity) by using a perfect hash map on the schema names 
> (an extra 15% on top of that in some cases).  This implementation, 
> unfortunately, is not something we could contribute at this point, but we 
> thought it'd be a good idea to allow users to provide alternative 
> implementations of the indexing behavior, such as adding the following static 
> method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory 
> factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would 
> look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements 
> UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = 
> new IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement

Reply via email to