Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19369
> Yes, if you have evidence this is a hotspot, then this does look like a
valid fix.
I don't think it's a hotspot (otherwise it would have probably been
reported long before). I do think however that there is no reason to stick to
the old implementation and produce garbage on each `hashCode` call.
On our side, the issue popped up while profiling GC pauses on the drive
during the shuffle. The pauses were caused by an unrelated issue, but we did
notice a surprisingly large amount (~10% of all objects on the heap) of
`char[]` and `String` for a job which primarily did double operations. Note
that the percentage is in terms of the number of objects, not their size.
You can reproduce the issue with a toy example which doesn't do any
`String` allocations.
```
val sc = SparkSession.builder()
.appName("demo")
.master("local[*]")
.getOrCreate()
.sparkContext
val rdd = sc.range(0, 1L << 63, numSlices = 1000).map(_ =>
ThreadLocalRandom.current.nextLong())
.map(n => n % 1000 -> n)
.persist(StorageLevel.DISK_ONLY)
rdd.count()
while (true) {
rdd.reduceByKey(_ + _).collect()
}
```
Run it and monitor the heap with `jmap`. Here's what I get after a couple
of minutes
```
map -histo $PID | head -n 10
num #instances #bytes class name
----------------------------------------------
1: 1895743 137954064 [C
2: 111418 65757664 [B
3: 944978 45248496 [Ljava.lang.Object;
4: 831 27243504
[Lorg.apache.spark.unsafe.memory.MemoryBlock;
5: 99471 22199824 [I
6: 865400 20769600 java.lang.String
7: 761323 18271752 scala.Tuple2
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]