onebox-li commented on code in PR #2086:
URL:
https://github.com/apache/incubator-celeborn/pull/2086#discussion_r1392079035
##########
common/src/main/scala/org/apache/celeborn/common/meta/WorkerInfo.scala:
##########
@@ -232,8 +235,21 @@ class WorkerInfo(
}
override def hashCode(): Int = {
- val state = Seq(host, rpcPort, pushPort, fetchPort, replicatePort)
- state.map(_.hashCode()).foldLeft(0)((a, b) => 31 * a + b)
+ var h = hash
+ if (h == 0 || isZeroHash) {
+ val state = Array(host, rpcPort, pushPort, fetchPort, replicatePort)
+ var i = 0
+ while (i < state.length) {
+ h = 31 * h + state(i).hashCode()
+ i = i + 1
+ }
+ if (h == 0) {
+ isZeroHash = true
+ } else {
+ hash = h
+ }
+ }
+ h
Review Comment:
Thanks @mridulm. I actually thought about that before and have a test shown
as below.
```
val host = generateRandomIPv4Address
val rpcPort: Integer = Random.nextInt(65535)
val pushPort: Integer = Random.nextInt(65535)
val fetchPort: Integer = Random.nextInt(65535)
val replicatePort: Integer = Random.nextInt(65535)
val infoSeq = Seq(host, rpcPort, pushPort, fetchPort, replicatePort)
val infoArray = Array(host, rpcPort, pushPort, fetchPort, replicatePort)
// origin
var startTime = System.nanoTime()
infoSeq.map(_.hashCode()).foldLeft(0)((a, b) => 31 * a + b)
var endTime = System.nanoTime()
val originTime = endTime - startTime
println("origin costs %d ns, %.2fx".format(originTime,
originTime.toDouble / originTime))
System.gc()
// for seq
startTime = System.nanoTime()
var forHash1 = 0
for (element <- infoSeq) {
forHash1 = 31 * forHash1 + element.hashCode()
}
endTime = System.nanoTime()
val forSeqTime = endTime - startTime
println("for seq costs %d ns, %.2fx".format(forSeqTime,
originTime.toDouble / forSeqTime))
System.gc()
// for array
startTime = System.nanoTime()
var forHash2 = 0
for (element <- infoArray) {
forHash2 = 31 * forHash2 + element.hashCode()
}
endTime = System.nanoTime()
val forArrayTime = endTime - startTime
println("for array costs %d ns, %.2fx".format(forArrayTime,
originTime.toDouble / forArrayTime))
System.gc()
// while Seq
startTime = System.nanoTime()
var whileHash1 = 0
var i = 0
while (i < infoSeq.size) {
whileHash1 = 31 * whileHash1 + infoSeq(i).hashCode()
i = i + 1
}
endTime = System.nanoTime()
val whileSeqTime = endTime - startTime
println("while seq costs %d ns, %.2fx".format(whileSeqTime,
originTime.toDouble / whileSeqTime))
System.gc()
// while array
startTime = System.nanoTime()
var whileHash2 = 0
i = 0
while (i < infoArray.length) {
whileHash2 = 31 * whileHash2 + infoArray(i).hashCode()
i = i + 1
}
endTime = System.nanoTime()
val whileArrayTime = endTime - startTime
println("while array costs %d ns, %.2fx".format(whileArrayTime,
originTime.toDouble / whileArrayTime))
System.gc()
// objects hash
startTime = System.nanoTime()
Objects.hash(host, rpcPort, pushPort, fetchPort, replicatePort)
endTime = System.nanoTime()
val objectHashTime = endTime - startTime
println("object hash costs %d ns, %.2fx".format(objectHashTime,
originTime.toDouble / objectHashTime))
System.gc()
// direct hash
startTime = System.nanoTime()
var result = host.hashCode()
result = 31 * result + rpcPort.hashCode()
result = 31 * result + pushPort.hashCode()
result = 31 * result + fetchPort.hashCode()
result = 31 * result + replicatePort.hashCode()
endTime = System.nanoTime()
val directHashTime = endTime - startTime
println("directly hash costs %d ns, %.2fx".format(directHashTime,
originTime.toDouble / directHashTime))
System.gc()
// direct hash2
startTime = System.nanoTime()
var result2 = 31 * (31 * (31 * (31 * host.hashCode() +
rpcPort.hashCode()) + pushPort.hashCode()) + fetchPort.hashCode()) +
replicatePort.hashCode()
endTime = System.nanoTime()
val directHashTime2 = endTime - startTime
println("directly hash2 costs %d ns, %.2fx".format(directHashTime2,
originTime.toDouble / directHashTime2))
```
And I get the following result:
```
origin costs 1762813 ns, 1.00x
for seq costs 1158450 ns, 1.52x
for array costs 5844185 ns, 0.30x
while seq costs 95599 ns, 18.44x
while array costs 8210 ns, 214.72x
object hash costs 37889 ns, 46.53x
directly hash costs 40092 ns, 43.97x
directly hash2 costs 6498 ns, 271.29x
```
The way mentioned here is called `directly hash`, seems it is worse than
that of while array. For the best performance `directly hash2` which expand all
calculations, it is a little better than while array. I am also a little
confused about the reason here. I think it may be some compiler optimization.
And between `directly hash2` and `while array` , I prefer while array because
it's more readable.
> we dont need to cache WorkerInfo's hashCode (it is just a few cheap
arithmetic computations)
I also think the cache can be removed. The calculation here is really cheap.
Introducing the if branch may be worse here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]