Github user superbobry commented on the issue:

    https://github.com/apache/spark/pull/19369
  
    > Yes, if you have evidence this is a hotspot, then this does look like a 
valid fix. 
    
    I don't think it's a hotspot (otherwise it would have probably been 
reported long before). I do think however that there is no reason to stick to 
the old implementation and produce garbage on each `hashCode` call. 
    
    On our side, the issue popped up while profiling GC pauses on the drive 
during the shuffle. The pauses were caused by an unrelated issue, but we did 
notice a surprisingly large amount (~10% of all objects on the heap) of 
`char[]` and `String` for a job which primarily did double operations. Note 
that the percentage is in terms of the number of objects, not their size.
    
    You can reproduce the issue with a toy example which doesn't do any 
`String` allocations.
    
    ```
    val sc = SparkSession.builder()
        .appName("demo")
        .master("local[*]")
        .getOrCreate()
        .sparkContext
    
    val rdd = sc.range(0, 1L << 63, numSlices = 1000).map(_ => 
ThreadLocalRandom.current.nextLong())
        .map(n => n % 1000 -> n)
        .persist(StorageLevel.DISK_ONLY)
    rdd.count()
    while (true) {
      rdd.reduceByKey(_ + _).collect()
    }
    ```
    
    Run it and monitor the heap with `jmap`. Here's what I get after a couple 
of minutes
    
    ```
    map -histo $PID | head -n 10   
    
     num     #instances         #bytes  class name
    ----------------------------------------------
       1:       1895743      137954064  [C
       2:        111418       65757664  [B
       3:        944978       45248496  [Ljava.lang.Object;
       4:           831       27243504  
[Lorg.apache.spark.unsafe.memory.MemoryBlock;
       5:         99471       22199824  [I
       6:        865400       20769600  java.lang.String
       7:        761323       18271752  scala.Tuple2                            
                                                                                
             
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to