[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Scott Carey (JIRA) Thu, 07 Jul 2011 09:55:42 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061438#comment-13061438
 ]


Scott Carey commented on AVRO-853:
----------------------------------

@Douglas

Good point, we can simplify the hash code functions on complicated members like 
Props and Aliases.   We can either ignore props, or only use a simple to 
compute portion of it:  the size.

How should equals with Aliases work?

Are the three below schemas equivalent?  

{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], 
"aliases":["foo2"]}
C: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}], 
"aliases":["foo"]}
{code}

Keep in mind that equals must be transitive, if A == B and B == C implies C == 
A,  and symmetric C.equals(A) must be true if A.equals(C) is true.
In the above, aliases allow A == B == C.

But this represents a problem for other cases:
{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], 
"aliases":["foo2"]}
C2: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}]}
{code}

Aliases allow A == B, and B == C2, but A != C2.  Therefore, we can use aliases 
in equality only two ways:
# not at all
# exact match only

This means that either 
* Ignore aliases: A == B, B != C, A != C
* Exct match only: A != B, B != C, A != C

I vote for ignoring Aliases in equality checks as we currently do, and having a 
different version of equals for checking for the ability to transform one 
schema to another using aliases  "alias promotion".

This is an assymetric process that does not have the transitive property.  A 
promotesTo B, B promotesTo C2, A !promotesTo C2


I also suspect that we should remove props from equals().  I think those behave 
similar to aliases.

Are the four schemas below different?  Should they differ across languages or 
do they represent different data?
{code}
A: {"type":"array", "items":"int"}
B: {"type":"array", "items":"int", "java.typehint":"java.util.List"}
C: {"type":"array", "items":"int", "java.typehint":"intarray"}
{code}

One could argue that these are equal (the serialized form is the same, and the 
extra properties are only specific to one language implementation).
The props here are just specialized documentation.

I think we have two consistent choices:
* Schemas are equal only if all aliases, props, and doc fields match exactly -- 
in other words if toString() prints the same result.
* Schemas are equal based on name, type, and structure alone.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to 
> store/retrieve fields and schemas in hash-based data structures (eg. 
> HashMap). Since all fields and schemas are immutable (with the exception of 
> RecordSchema allowing deferred setting of Fields) it makes sense to cache the 
> hash code on the object instead of recalculating every time the hashCode 
> method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field

Reply via email to