[
https://issues.apache.org/jira/browse/FLINK-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728278#comment-16728278
]
zhijiang commented on FLINK-10661:
----------------------------------
[~NicoK], thanks for your reply! I should describe it more directly. :)
The floating buffers are requested based on sender's backlog, and the receiver
always tries to announce {{backlog+initial_credit}} to senders in order to make
the transport smoothly. The {{initial_credit}} is from the parameter
{{taskmanager.network.memory.buffers-per-channel}} currently.
I think we should define a separate parameter for this extra credit, because if
we tune the {{per-channel}} parameter as 1, then the overhead 1 extra credit
more than backlog might not enough for making the transport smoothly. In other
words, the sender may need wait for credits before registering sub partition
available for transfer.
In this case, the out queue usage is 100%, but the input queue usage may not
reach 100%. If we have a separate parameter to tune the extra credits, then it
can help for more control. For example, if the {{per-channel}} is 1, then we
might try to announce {{backlog+3}} credits each time.
> Initial credit should be configured in a separate parameter
> -----------------------------------------------------------
>
> Key: FLINK-10661
> URL: https://issues.apache.org/jira/browse/FLINK-10661
> Project: Flink
> Issue Type: Sub-task
> Components: Network
> Affects Versions: 1.5.4, 1.6.1
> Reporter: zhijiang
> Assignee: zhijiang
> Priority: Minor
>
> In credit-based network flow control, the required credits on receiver side
> are calculated by backlog plus initial credit which is equal to the value in
> parameter {{taskmanager.network.memory.buffers-per-channel}}. We plus the
> initial credit as backlog overhead in order to decrease the possibility of
> waiting credits on sender side. The best result is concurrent work between
> sender and receiver, not block each other.
>
> We found a bad case in some rebalance or rescale scenarios, the outqueue
> usage reaches 100% on sender side, but the inqueue usage is about 50% or
> less. That means the credit announcement is not enough for sender side
> although there are still many free credit resources on receiver side. So it
> is not reasonable resulting in wasting resources.
>
> It would be better if we can adjust the credit overhead to debug the
> performance online. And it needs another separate parameter to define initial
> credit not messed with {{taskmanager.network.memory.buffers-per-channel}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)