Github user f7753 commented on the issue:

    https://github.com/apache/spark/pull/14239
  
    @tgravescs  Thank you.
    
    Currently, I'm not load all the data into memory, I use parameters 
`spark.shuffle.prepare.open ` to switch this mechanism off/on and 
`spark.shuffle.prepare.count` to control the block number to cache. So here 
gives the user the privilege to control the MEM used for the pre-fetch block 
based on their machine conditions.
    
    OS cache may do not have much impact on this(If my understanding is wrong, 
please correct me, thanks), since the shuffle block produced by map side will 
not be read more than one time in a normal job. Once the shuffle block consumed 
by the reduce side, it would be of no use, so it may be in the write buffer. If 
there is enough memory, this would not make the reading process more slow, and 
if not, we can use the limited memory to pre load the data. While transfer 
process succeed,  release the mem buffer to load the data the next 
`FetchRequest` contains, until all the data has been send to the reduce side.
    
    I have implement this and tested based on the branch 1.4 and 1.6, using 
Intel Hibench4.0 terasort 1TB data size, I got about 30% performance 
enhancements, on a cluster which has 5 node, each node has 96GB Mem,CPU is 
Xeon E5 v3 , 7200RPM Disk.
    
    Here we may search some paper and refer to them to make it more consummate 
. e.g. ‘HPMR: Prefetching and pre-shuffling in shared MapReduce computation 
environment ’
    
    Thanks for you feedback, any work you want me to co-operate would be my 
pleasure, I love Spark so much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to