CodingCat opened a new pull request, #2358:
URL: https://github.com/apache/incubator-celeborn/pull/2358
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX]
Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a
faster review.
-->
### What changes were proposed in this pull request?
while SortBasedWriter has less memory footprint than HashBasedWriter, it
suffers from performance issue when we have many partitions and the write
buffer is filled with small chunks of data quickly
for example, if sort buffer size is 32K, you have 4 partitions and 128K
data in total, the data distribution is like partition A, B, C, D, each time it
comes with 8K per partition.... in this case, you need to compress and send
small 8K chunk 4 times per partition for , the cost would become very high. If
you use hashbasedwriter, it doesn't have this problem since the push only
happens when the per-partition buffer is full. Of course , larger sort buffer
size can mitigate the issue, but tuning sort buffer size per job is a tedious
work
this PR introduces a new feature that we measure total size of pushed bytes
and pushed count as well as the "should-pushed" bytes and counts (should-push
means that , the data we pushed is larger than CLIENT_PUSH_BUFFER_MAX_SIZE (in
another word, we will trigger a push regardless of writer type))
when actualPushedBytes/actualPushedCounts > (1 + Threshold) *
(ShouldPushBytes/ShouldPushCounts), we will enlarge the sort buffer size by 1X
to try to buffer more data before pushing (the max size of sortBuffer would be
capped at # of partitions * CLIENT_PUSH_BUFFER_MAX_SIZE)
### Why are the changes needed?
to reduce perf cost in sortbased writer
### Does this PR introduce _any_ user-facing change?
no, but have 2 extra configurations
### How was this patch tested?
in prod of our company and also unit test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]