[
https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060937#comment-13060937
]
James Baldassari commented on AVRO-853:
---------------------------------------
I've recently run into this hash code performance issue as well. Scott, there
may be a way to work around the issue of propagating changes up the reference
graph. We would have to calculate hash codes by computing the hash code for
any "local" state, which could be cached, and then adding to that local hash
code the hash codes of all child objects, which could also be cached
independently. For example, a RecordSchema has 'aliases', 'doc', 'name',
'props', and 'isError', for its local state, and the hash code for those values
could be cached. Then to calculate the rest of the RecordSchema hash code we
invoke hashCode() on all Field instances in the RecordSchema. Each Field
instance knows whether it has been modified or not, and so it can either return
a cached hash code or recalculate the hash code for its local state, and so on.
So we would still need to traverse the whole schema graph every time the hash
code is requested, but we might achieve some performance gains by caching
values at each node in the graph. Does that make sense?
Also, I took a look at the patch, and I have a couple of comments in addition
to Doug's and Scott's:
Is it safe to assume that a calculated hash code will never be 0? Maybe null
would be a safer choice for the default/invalidated value.
Thread-safety is actually an issue here. Since addProp() and getProp() are
synchronized we have to assume that Schema is intended to be used by multiple
threads. The worst-case scenario for the unsynchronized hash code cache is
worse than just having 2 threads calculate the hash code at once. If one
thread is modifying the schema while another is calling hashCode(), it could
result in a temporary inconsistency between hashCode() and equals(). This
could lead to some confusing problems with hash maps/sets. If the caching is
abstracted up into the Schema base class as Doug suggests, it would be fairly
simple to synchronize access to the cache. This could be done with a
synchronized method/block or by using something like a ReadWriteLock, which
would probably have better performance characteristics for a
read-frequently-write-infrequently use case such as this. In fact, there may
already be a synchronization issue with Schema because the properties map is
not accessed in a synchronized way in equals() or hashCode()...
> Cache hash codes in Schema and Field
> ------------------------------------
>
> Key: AVRO-853
> URL: https://issues.apache.org/jira/browse/AVRO-853
> Project: Avro
> Issue Type: Improvement
> Components: java
> Affects Versions: 1.5.1
> Reporter: Douglas Kaminsky
> Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to
> store/retrieve fields and schemas in hash-based data structures (eg.
> HashMap). Since all fields and schemas are immutable (with the exception of
> RecordSchema allowing deferred setting of Fields) it makes sense to cache the
> hash code on the object instead of recalculating every time the hashCode
> method gets called.
> (Are there other mutable Schema sub-types that I'm not thinking about?)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira