[jira] [Comment Edited] (FLINK-16404) Avoid caching buffers for blocked input channels before barrier alignment

jinghaihang (Jira) Mon, 29 Mar 2021 03:06:16 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309158#comment-17309158
 ]


jinghaihang edited comment on FLINK-16404 at 3/29/21, 10:05 AM:
----------------------------------------------------------------

Hi,Zhijiang[~zjwang] ，I encountered a problem：

With the checkpoint interval of the same size(3min), the Flink 1.12 version of 
the job checkpoint time-consuming increase and production failure, the Flink1.9 
job is running normally。

[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/With-the-checkpoint-interval-of-the-same-size-the-Flink-1-12-version-of-the-job-checkpoint-time-consy-td42471.html]

As you said:" But considering the checkpoint interval not too short in general, 
so the above side effect can be ignored in practice. We can further verify it 
via existing micro-benchmark."

Could my question be related to this? 


was (Author: assassinj):
Hi,Zhijiang，I encountered a problem：

With the checkpoint interval of the same size(3min), the Flink 1.12 version of 
the job checkpoint time-consuming increase and production failure, the Flink1.9 
job is running normally。

[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/With-the-checkpoint-interval-of-the-same-size-the-Flink-1-12-version-of-the-job-checkpoint-time-consy-td42471.html]

As you said:" But considering the checkpoint interval not too short in general, 
so the above side effect can be ignored in practice. We can further verify it 
via existing micro-benchmark."

Could my question be related to this? 

> Avoid caching buffers for blocked input channels before barrier alignment
> -------------------------------------------------------------------------
>
>                 Key: FLINK-16404
>                 URL: https://issues.apache.org/jira/browse/FLINK-16404
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Network
>            Reporter: Zhijiang
>            Assignee: Yingjie Cao
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>         Attachments: image-2021-02-22-15-27-57-983.png, 
> image-2021-02-22-15-29-55-096.png, image-2021-02-22-15-30-03-318.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> One motivation of this issue is for reducing the in-flight data in the case 
> of back pressure to speed up checkpoint. The current default exclusive 
> buffers per channel is 2. If we reduce it to 0 and increase somewhat floating 
> buffers for compensation, it might cause deadlock problem because all the 
> floating buffers might be requested away by some blocked input channels and 
> never recycled until barrier alignment.
> In order to solve above deadlock concern, we can make some logic changes on 
> both sender and receiver sides.
>  * Sender side: It should revoke previous received credit after sending 
> checkpoint barrier, that means it would not send any following buffers until 
> receiving new credits.
>  * Receiver side: The respective channel releases the requested floating 
> buffers if barrier is received from the network. After barrier alignment, it 
> would request floating buffers for the channels with positive backlog, and 
> notify the sender side of available credits. Then the sender can continue 
> transporting the buffers.
> Based on above changes, we can also remove the `BufferStorage` component 
> completely, because the receiver would never reading buffers for blocked 
> channels. Another possible benefit is that the floating buffers might be more 
> properly made use of before barrier alignment.
> The only side effect would bring somehow cold setup after barrier alignment. 
> That means the sender side has to wait for credit feedback to transport data 
> just after alignment, which would impact on delay and network throughput. But 
> considering the checkpoint interval not too short in general, so the above 
> side effect can be ignored in practice. We can further verify it via existing 
> micro-benchmark.
> After this ticket done, we still can not set exclusive buffers to zero ATM, 
> there exists another deadlock issue which would be solved separately in 
> another ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-16404) Avoid caching buffers for blocked input channels before barrier alignment

Reply via email to