[
https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hernan Otero updated AVRO-946:
------------------------------
Attachment: AVRO-946.patch
Patch with suggested implementation.
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
> Key: AVRO-946
> URL: https://issues.apache.org/jira/browse/AVRO-946
> Project: Avro
> Issue Type: Improvement
> Components: java
> Affects Versions: 1.6.0
> Reporter: Hernan Otero
> Attachments: AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of
> GenericData.resolveUnion() (used when serializing an object):
> {code}
> public int resolveUnion(Schema union, Object datum) {
> int i = 0;
> for (Schema type : union.getTypes()) {
> if (instanceOf(type, datum))
> return i;
> i++;
> }
> throw new UnresolvedUnionException(union, datum);
> }
> {code}
> it showed up when we were doing some serialization performance analysis. A
> simple optimization can be implemented by keeping a map within the
> UnionSchema object (in fact, this could actually be a perfect hash map given
> the potential values in the map are known in advance). The optimization is
> obviously most notable when a Union within the schema contains many types (in
> our particular use case, more than 40 in some cases). In this scenario, we
> observed a 25% improvement by using an identity hash map.
> Even though using an identity map provides a significant boost, we have
> observed an even further improvement (and removed some of the restrictions of
> relying on object identity) by using a perfect hash map on the schema names
> (an extra 15% on top of that in some cases). This implementation,
> unfortunately, is not something we could contribute at this point, but we
> thought it'd be a good idea to allow users to provide alternative
> implementations of the indexing behavior, such as adding the following static
> method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory
> factory)
> {
> unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would
> look like:
> {code}
> /**
> * A factory interface for creating UnionTypeIndexCache instances.
> */
> public static interface UnionIndexCacheFactory
> {
> UnionIndexCache createUnionIndexCache(List<Schema> types);
> /**
> * Used for caching schema indices within a union.
> */
> public static interface UnionIndexCache
> {
> void setTypeIndex(Schema schema, int index);
> int getTypeIndex(Schema schema);
> }
> }
> private static class IdentityMapUnionIndexCacheFactory implements
> UnionIndexCacheFactory
> {
> @Override
> public UnionIndexCache createUnionIndexCache(List<Schema> types)
> {
> return new UnionIndexCache()
> {
> private final IdentityHashMap<Schema, Integer> schemaToIndex =
> new IdentityHashMap<Schema, Integer>();
> @Override
> public void setTypeIndex(Schema schema, int index)
> {
> schemaToIndex.put(schema, index);
> }
> @Override
> public int getTypeIndex(Schema schema)
> {
> Integer index = schemaToIndex.get(schema);
> return index == null ? -1 : index;
> }
> };
> }
> }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira