[
https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140484#comment-13140484
]
Hernan Otero commented on AVRO-946:
-----------------------------------
I've implemented your proposed solution using the following data structure:
{code}IdentityHashMap<Schema, WeakHashMap<Schema, Integer>>{code}
The first (identity) map's key being the Union itself, the second's (weak)
being the datum's schema.
As a side note, I did try implementing this using MultiKeyMap delegating to a
ReferenceMap, but was unable to override equalsKey() on the MultiKeyMap due to
package accessibility constraints for AbstractHashedMap.HashEntry (it may well
be I misunderstood MultiKeyMap's extensibility features as I don't have much
experience with it, please feel free to suggest an alternative data structure).
I implemented the feature by adding 3 protected methods (for creating, caching
and looking up the entries within the cache), so subclasses of
GenericDatumWriter should be able to provide alternative implementations if
required:
{code}
protected void createUnionIndexCacheMap();
protected int getCachedUnionIndex(Schema union, Schema datumSchema);
protected void putCachedUnionIndex(Schema union, Schema datumSchema, int index);
{code}
Here are some stats I collected. The test uses a single record with 3 fields
of the same union type. This union type has 40 different types to choose from.
"best/worst" describe scenarios from the point of view of today's (sequential)
implementation:
* best. Fields are set with the first type within the union.
* worst. Fields are set with the last type within the union.
These are the results I obtained:
{code}
-----------------------------------------------------
| Scenario | Time | Rate | Cache |
-----------------------------------------------------
| best | 11.1 secs | 180,685 loops/sec | true |
| best | 12.1 secs | 164,935 loops/sec | false |
| worst | 11.2 secs | 178,364 loops/sec | true |
| worst | 19.4 secs | 102,912 loops/sec | false |
-----------------------------------------------------
{code}
As expected, using the cache shows little difference between the two scenarios
(but manages to be faster even in the best-case scenario!), whereas today's
(sequential) implementation is significantly slower in the worst case scenario.
I will submit the patch in a few minutes.
Thanks,
Hernan
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
> Key: AVRO-946
> URL: https://issues.apache.org/jira/browse/AVRO-946
> Project: Avro
> Issue Type: Improvement
> Components: java
> Affects Versions: 1.6.0
> Reporter: Hernan Otero
>
> Due to the sequential nature of today's implementation of
> GenericData.resolveUnion() (used when serializing an object):
> {code}
> public int resolveUnion(Schema union, Object datum) {
> int i = 0;
> for (Schema type : union.getTypes()) {
> if (instanceOf(type, datum))
> return i;
> i++;
> }
> throw new UnresolvedUnionException(union, datum);
> }
> {code}
> it showed up when we were doing some serialization performance analysis. A
> simple optimization can be implemented by keeping a map within the
> UnionSchema object (in fact, this could actually be a perfect hash map given
> the potential values in the map are known in advance). The optimization is
> obviously most notable when a Union within the schema contains many types (in
> our particular use case, more than 40 in some cases). In this scenario, we
> observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have
> observed an even further improvement (and removed some of the restrictions of
> relying on object identity) by using a perfect hash map on the schema names
> (an extra 15% on top of that in some cases). This implementation,
> unfortunately, is not something we could contribute at this point, but we
> thought it'd be a good idea to allow users to provide alternative
> implementations of the indexing behavior, such as adding the following static
> method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory
> factory)
> {
> unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would
> look like:
> {code}
> /**
> * A factory interface for creating UnionTypeIndexCache instances.
> */
> public static interface UnionIndexCacheFactory
> {
> UnionIndexCache createUnionIndexCache(List<Schema> types);
> /**
> * Used for caching schema indices within a union.
> */
> public static interface UnionIndexCache
> {
> void setTypeIndex(Schema schema, int index);
> int getTypeIndex(Schema schema);
> }
> }
> private static class IdentityMapUnionIndexCacheFactory implements
> UnionIndexCacheFactory
> {
> @Override
> public UnionIndexCache createUnionIndexCache(List<Schema> types)
> {
> return new UnionIndexCache()
> {
> private final IdentityHashMap<Schema, Integer> schemaToIndex =
> new IdentityHashMap<Schema, Integer>();
> @Override
> public void setTypeIndex(Schema schema, int index)
> {
> schemaToIndex.put(schema, index);
> }
> @Override
> public int getTypeIndex(Schema schema)
> {
> Integer index = schemaToIndex.get(schema);
> return index == null ? -1 : index;
> }
> };
> }
> }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira