[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

James Baldassari (JIRA) Wed, 06 Jul 2011 17:07:43 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060937#comment-13060937
 ]


James Baldassari commented on AVRO-853:
---------------------------------------

I've recently run into this hash code performance issue as well.  Scott, there 
may be a way to work around the issue of propagating changes up the reference 
graph.  We would have to calculate hash codes by computing the hash code for 
any "local" state, which could be cached, and then adding to that local hash 
code the hash codes of all child objects, which could also be cached 
independently.  For example, a RecordSchema has 'aliases', 'doc', 'name', 
'props', and 'isError', for its local state, and the hash code for those values 
could be cached.  Then to calculate the rest of the RecordSchema hash code we 
invoke hashCode() on all Field instances in the RecordSchema.  Each Field 
instance knows whether it has been modified or not, and so it can either return 
a cached hash code or recalculate the hash code for its local state, and so on. 
 So we would still need to traverse the whole schema graph every time the hash 
code is requested, but we might achieve some performance gains by caching 
values at each node in the graph.  Does that make sense?

Also, I took a look at the patch, and I have a couple of comments in addition 
to Doug's and Scott's:

Is it safe to assume that a calculated hash code will never be 0?  Maybe null 
would be a safer choice for the default/invalidated value.

Thread-safety is actually an issue here.  Since addProp() and getProp() are 
synchronized we have to assume that Schema is intended to be used by multiple 
threads.  The worst-case scenario for the unsynchronized hash code cache is 
worse than just having 2 threads calculate the hash code at once.  If one 
thread is modifying the schema while another is calling hashCode(), it could 
result in a temporary inconsistency between hashCode() and equals().  This 
could lead to some confusing problems with hash maps/sets.  If the caching is 
abstracted up into the Schema base class as Doug suggests, it would be fairly 
simple to synchronize access to the cache.  This could be done with a 
synchronized method/block or by using something like a ReadWriteLock, which 
would probably have better performance characteristics for a 
read-frequently-write-infrequently use case such as this.  In fact, there may 
already be a synchronization issue with Schema because the properties map is 
not accessed in a synchronized way in equals() or hashCode()...


> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to 
> store/retrieve fields and schemas in hash-based data structures (eg. 
> HashMap). Since all fields and schemas are immutable (with the exception of 
> RecordSchema allowing deferred setting of Fields) it makes sense to cache the 
> hash code on the object instead of recalculating every time the hashCode 
> method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Reply via email to