Re: [PR] [CELEBORN-1121] Improve WorkerInfo#hashCode method [incubator-celeborn]

via GitHub Mon, 13 Nov 2023 23:05:46 -0800


onebox-li commented on code in PR #2086:
URL: 
https://github.com/apache/incubator-celeborn/pull/2086#discussion_r1392079035



##########
common/src/main/scala/org/apache/celeborn/common/meta/WorkerInfo.scala:
##########
@@ -232,8 +235,21 @@ class WorkerInfo(
   }
 
   override def hashCode(): Int = {
-    val state = Seq(host, rpcPort, pushPort, fetchPort, replicatePort)
-    state.map(_.hashCode()).foldLeft(0)((a, b) => 31 * a + b)
+    var h = hash
+    if (h == 0 || isZeroHash) {
+      val state = Array(host, rpcPort, pushPort, fetchPort, replicatePort)
+      var i = 0
+      while (i < state.length) {
+        h = 31 * h + state(i).hashCode()
+        i = i + 1
+      }
+      if (h == 0) {
+        isZeroHash = true
+      } else {
+        hash = h
+      }
+    }
+    h

Review Comment:
   Thanks @mridulm. I actually thought about that before and have a test shown 
as below.
   ```
       val host = generateRandomIPv4Address
       val rpcPort: Integer = Random.nextInt(65535)
       val pushPort: Integer = Random.nextInt(65535)
       val fetchPort: Integer = Random.nextInt(65535)
       val replicatePort: Integer = Random.nextInt(65535)
       val infoSeq = Seq(host, rpcPort, pushPort, fetchPort, replicatePort)
       val infoArray = Array(host, rpcPort, pushPort, fetchPort, replicatePort)
   
       // origin
       var startTime = System.nanoTime()
       infoSeq.map(_.hashCode()).foldLeft(0)((a, b) => 31 * a + b)
       var endTime = System.nanoTime()
       val originTime = endTime - startTime
       println("origin costs %d ns, %.2fx".format(originTime, 
originTime.toDouble / originTime))
   
       System.gc()
       // for seq
       startTime = System.nanoTime()
       var forHash1 = 0
       for (element <- infoSeq) {
         forHash1 = 31 * forHash1 + element.hashCode()
       }
       endTime = System.nanoTime()
       val forSeqTime = endTime - startTime
       println("for seq costs %d ns, %.2fx".format(forSeqTime, 
originTime.toDouble / forSeqTime))
   
       System.gc()
       // for array
       startTime = System.nanoTime()
       var forHash2 = 0
       for (element <- infoArray) {
         forHash2 = 31 * forHash2 + element.hashCode()
       }
       endTime = System.nanoTime()
       val forArrayTime = endTime - startTime
       println("for array costs %d ns, %.2fx".format(forArrayTime, 
originTime.toDouble / forArrayTime))
   
       System.gc()
       // while Seq
       startTime = System.nanoTime()
       var whileHash1 = 0
       var i = 0
       while (i < infoSeq.size) {
         whileHash1 = 31 * whileHash1 + infoSeq(i).hashCode()
         i = i + 1
       }
       endTime = System.nanoTime()
       val whileSeqTime = endTime - startTime
       println("while seq costs %d ns, %.2fx".format(whileSeqTime, 
originTime.toDouble / whileSeqTime))
   
       System.gc()
       // while array
       startTime = System.nanoTime()
       var whileHash2 = 0
       i = 0
       while (i < infoArray.length) {
         whileHash2 = 31 * whileHash2 + infoArray(i).hashCode()
         i = i + 1
       }
       endTime = System.nanoTime()
       val whileArrayTime = endTime - startTime
       println("while array costs %d ns, %.2fx".format(whileArrayTime, 
originTime.toDouble / whileArrayTime))
   
       System.gc()
       // objects hash
       startTime = System.nanoTime()
       Objects.hash(host, rpcPort, pushPort, fetchPort, replicatePort)
       endTime = System.nanoTime()
       val objectHashTime = endTime - startTime
       println("object hash costs %d ns, %.2fx".format(objectHashTime, 
originTime.toDouble / objectHashTime))
   
       System.gc()
       // direct hash
       startTime = System.nanoTime()
       var result = host.hashCode()
       result = 31 * result + rpcPort.hashCode()
       result = 31 * result + pushPort.hashCode()
       result = 31 * result + fetchPort.hashCode()
       result = 31 * result + replicatePort.hashCode()
       endTime = System.nanoTime()
       val directHashTime = endTime - startTime
       println("directly hash costs %d ns, %.2fx".format(directHashTime, 
originTime.toDouble / directHashTime))
   
       System.gc()
       // direct hash2
       startTime = System.nanoTime()
       var result2 = 31 * (31 * (31 * (31 * host.hashCode() + 
rpcPort.hashCode()) + pushPort.hashCode()) + fetchPort.hashCode()) + 
replicatePort.hashCode()
       endTime = System.nanoTime()
       val directHashTime2 = endTime - startTime
       println("directly hash2 costs %d ns, %.2fx".format(directHashTime2, 
originTime.toDouble / directHashTime2))
   ```
   And I get the following result:
   ```
   origin costs 1762813 ns, 1.00x
   for seq costs 1158450 ns, 1.52x
   for array costs 5844185 ns, 0.30x
   while seq costs 95599 ns, 18.44x
   while array costs 8210 ns, 214.72x
   object hash costs 37889 ns, 46.53x
   directly hash costs 40092 ns, 43.97x
   directly hash2 costs 6498 ns, 271.29x
   ```
   The way mentioned here is called `directly hash`, seems it is worse than 
that of while array. For the best performance `directly hash2` which expand all 
calculations,  it is a little better than while array. I am also a little 
confused about the reason here. I think it may be some compiler optimization. 
And between `directly hash2` and `while array` , I prefer while array because 
it's more readable.
   
   > we dont need to cache WorkerInfo's hashCode (it is just a few cheap 
arithmetic computations)
   
   I also think the cache can be removed. The calculation here is really cheap. 
Introducing the if branch may be worse here.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-1121] Improve WorkerInfo#hashCode method [incubator-celeborn]

Reply via email to