CodingCat opened a new pull request, #2358:
URL: https://github.com/apache/incubator-celeborn/pull/2358

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] 
Your PR title ...'.
     - Be sure to keep the PR description updated to reflect all changes.
     - Please write your PR title to summarize what this PR proposes.
     - If possible, provide a concise example to reproduce the issue for a 
faster review.
   -->
   
   ### What changes were proposed in this pull request?
   
   while SortBasedWriter has less memory footprint than HashBasedWriter, it 
suffers from performance issue when we have  many partitions and the write 
buffer is filled with small chunks of data quickly 
   
    for example, if sort buffer size is 32K, you have 4 partitions and 128K 
data in total, the data distribution is like partition A, B, C, D, each time it 
comes with 8K per partition.... in this case, you need to compress and send 
small 8K chunk 4 times per partition for , the cost would become very high. If 
you use hashbasedwriter, it doesn't have this problem since the push only 
happens when the per-partition buffer is full. Of course , larger sort buffer 
size can mitigate the issue, but tuning sort buffer size per job is a tedious 
work 
   
   this PR introduces a new feature that we measure total size of pushed bytes 
and pushed count as well as the "should-pushed" bytes and counts (should-push 
means that , the data we pushed is larger than CLIENT_PUSH_BUFFER_MAX_SIZE (in 
another word, we will trigger a push regardless of writer type))
   
   when actualPushedBytes/actualPushedCounts > (1 + Threshold) * 
(ShouldPushBytes/ShouldPushCounts), we will enlarge the sort buffer size by 1X 
to try to buffer more data before pushing  (the max size of sortBuffer would be 
capped at # of partitions * CLIENT_PUSH_BUFFER_MAX_SIZE)
    
   ### Why are the changes needed?
   
   to reduce perf cost in sortbased writer
   
   ### Does this PR introduce _any_ user-facing change?
   
   no, but have 2 extra configurations
   
   ### How was this patch tested?
   
   in prod of our company and also unit test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to