GenericData.resolveUnion() performance improvement
--------------------------------------------------
Key: AVRO-946
URL: https://issues.apache.org/jira/browse/AVRO-946
Project: Avro
Issue Type: Improvement
Components: java
Affects Versions: 1.6.0
Reporter: Hernan Otero
Due to the sequential nature of today's implementation of
GenericData.resolveUnion() (used when serializing an object):
{code}
public int resolveUnion(Schema union, Object datum) {
int i = 0;
for (Schema type : union.getTypes()) {
if (instanceOf(type, datum))
return i;
i++;
}
throw new UnresolvedUnionException(union, datum);
}
{code}
it showed up when we were doing some serialization performance analysis. A
simple optimization can be implemented by keeping a map within the UnionSchema
object (in fact, this could actually be a perfect hash map given the potential
values in the map are known in advance). The optimization is obviously most
notable when a Union within the schema contains many types (in our particular
use case, more than 40 in some cases). In this scenario, we observed a 25%
improvement by using an identity hash map.
Even though using an identity map provides a significant boost, we have
observed an even further improvement (and removed some of the restrictions of
relying on object identity) by using a perfect hash map on the schema names (an
extra 15% on top of that in some cases). This implementation, unfortunately,
is not something we could contribute at this point, but we thought it'd be a
good idea to allow users to provide alternative implementations of the indexing
behavior, such as adding the following static method to Schema:
{code}
public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
{
unionIndexCacheFactory = factory;
}
{code}
This is what the interface and identity hash map-based implementation would
look like:
{code}
/**
* A factory interface for creating UnionTypeIndexCache instances.
*/
public static interface UnionIndexCacheFactory
{
UnionIndexCache createUnionIndexCache(List<Schema> types);
/**
* Used for caching schema indices within a union.
*/
public static interface UnionIndexCache
{
void setTypeIndex(Schema schema, int index);
int getTypeIndex(Schema schema);
}
}
private static class IdentityMapUnionIndexCacheFactory implements
UnionIndexCacheFactory
{
@Override
public UnionIndexCache createUnionIndexCache(List<Schema> types)
{
return new UnionIndexCache()
{
private final IdentityHashMap<Schema, Integer> schemaToIndex =
new IdentityHashMap<Schema, Integer>();
@Override
public void setTypeIndex(Schema schema, int index)
{
schemaToIndex.put(schema, index);
}
@Override
public int getTypeIndex(Schema schema)
{
Integer index = schemaToIndex.get(schema);
return index == null ? -1 : index;
}
};
}
}
{code}
I will attach a patch later today or early tomorrow.
Thanks in advance,
Hernan Otero
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira