Github user advancedxy commented on the pull request:

    https://github.com/apache/spark/pull/4783#issuecomment-96685117
  
    Ping @shivaram, @rxin. I finally got some time to look at the code.
    The ExternalSorterSuite kept failing because there was no spilling when we 
just insert 100000 elements. 
    I don't know how 100000 was calculated, maybe you guys can give some 
related info. Before this pr, The size of  an integer instance is 24 bytes on a 
64-bit JVM with UseCompressedOops on. Now, It's a correct 16 bytes. The 
ExternalSorterSuit inserts integers into buffers, so a reduced size of integer 
class results a smaller total estimated size, hence no spilling. 
    
    Another Note: When I was digging the ExternalSorterSuite and 
ExternalSorter, I noticed something wrong. The ExternalSorter uses the 
SizeTrackingPairBuffer, SizeTrackingPairBuffer[(Int, Int), Int] to be 
specifically in the test suite. The SizeTrackingPairBuffer[(Int, Int), Int] 
increases only 40 bytes per insertion(estimated by SizeEstimator). The 40 bytes 
 consisted of 24 bytes of Tuple2 and 16 bytes Integer. I changes the insertion 
number based on this fact (24 + 24 per insertion to 24 + 16 per insertion, 48 * 
100000 = 40 * 120000). However 24 bytes of Tuple2 is not right, It should be 
the specialized version of Tuple2[Int, Int] which is 32 bytes. It's not 
SizeEstimator's fault as It can only get the Tuple2 class rather than 
scala.Tuple2$mcII$sp. I doubt it may be something related to the scala 
compiler, I will look into this problem or someone more familiar with scala 
compiler should.
    
    As for this pr, I think it's good to go. Hope we can get it merged before 
1.4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to