Tie Liu created AVRO-1428:
-----------------------------

             Summary: Schema.computeHash() to add if check to avoid unnecessary 
hashcode computation
                 Key: AVRO-1428
                 URL: https://issues.apache.org/jira/browse/AVRO-1428
             Project: Avro
          Issue Type: Improvement
          Components: java
            Reporter: Tie Liu


In current Schma.java we have following implementation:
  public final int hashCode() {
    if (hashCode == NO_HASHCODE)
      hashCode = computeHash();
    return hashCode;
  }

  int computeHash() { return getType().hashCode() + props.hashCode(); }

  While hashCode is doing the checking of "if (hashCode == NO_HASHCODE)", the 
computeHash method is not. But the computeHash method is being called from 
Schema$Field.hashCode and the subclasses hashCode implementation like following:
public int hashCode() { return name.hashCode() + schema.computeHash(); }  
//this is from Schema$Field class

  This is causing the the calculation of hashCode getting called unnecessarily 
extensively. The proposed changed is to add the "if" check inside the 
computeHash method instead:

    int computeHash()
    {
        if (hashCode == NO_HASHCODE)
        {
            hashCode = getType().hashCode() + props.hashCode();
        }
        return hashCode;
}


We did a simple test to compare the performance difference, below is a summary 
of the heap snapshot of comparing the difference:

As a test I wrote a small program that creates a HashMap<Schema.Field, 
Integer>() and enters a loop simply identifying whether various Schema.Field 
instances are keys in the map. Obviously this is a pathological test case, but 
when running with the current implementation of Schema.Field it has (in about 
30 seconds) used up nearly 8 GBytes of heap in instantiating intermediate 
objects associated with calling Schema.computeHash():

Heap
PSYoungGen      total 17432576K, used 8666481K [0x0000000340000000, 
0x0000000800000000, 0x0000000800000000)
  eden space 14942208K, 58% used 
[0x0000000340000000,0x0000000550f5c650,0x00000006d0000000)
  from space 2490368K, 0% used 
[0x0000000768000000,0x0000000768000000,0x0000000800000000)
  to   space 2490368K, 0% used 
[0x00000006d0000000,0x00000006d0000000,0x0000000768000000)
ParOldGen       total 1048576K, used 0K [0x0000000300000000, 
0x0000000340000000, 0x0000000340000000)
  object space 1048576K, 0% used 
[0x0000000300000000,0x0000000300000000,0x0000000340000000)
PSPermGen       total 21504K, used 5782K [0x00000002fae00000, 
0x00000002fc300000, 0x0000000300000000)
  object space 21504K, 26% used 
[0x00000002fae00000,0x00000002fb3a5818,0x00000002fc300000)

When running with the modified implementation (and no other change) all the 
object allocation vanishes:

Heap
PSYoungGen      total 17432576K, used 896532K [0x0000000340000000, 
0x0000000800000000, 0x0000000800000000)
  eden space 14942208K, 6% used 
[0x0000000340000000,0x0000000376b852d0,0x00000006d0000000)
  from space 2490368K, 0% used 
[0x0000000768000000,0x0000000768000000,0x0000000800000000)
  to   space 2490368K, 0% used 
[0x00000006d0000000,0x00000006d0000000,0x0000000768000000)
ParOldGen       total 1048576K, used 0K [0x0000000300000000, 
0x0000000340000000, 0x0000000340000000)
  object space 1048576K, 0% used 
[0x0000000300000000,0x0000000300000000,0x0000000340000000)
PSPermGen       total 21504K, used 5768K [0x00000002fae00000, 
0x00000002fc300000, 0x0000000300000000)
  object space 21504K, 26% used 
[0x00000002fae00000,0x00000002fb3a2240,0x00000002fc300000)

As a side-effect the test runs x3 faster with the modified hashCode() 
implementation.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to