[
https://issues.apache.org/jira/browse/FLINK-14472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958617#comment-16958617
]
Piotr Nowojski commented on FLINK-14472:
----------------------------------------
This issue should actually be already fixed [~kkrugler].
As [~zjwang] mentioned the current way of checking the backpressure is very
fragile and has been broken at least couple of times in the past (usually fixed
before the release) and for the same reasons it doesn't support if a Task is
backpressured by some non Task thread.
The correctness issue issue was mostly solved (accidentally) by introducing the
single threaded (mailbox) model FLINK-12477 of the task execution (we moved
execution of the processing time timers FLINK-12481 and emitting results from
{{AsyncWaitOperator}} FLINK-12958 to the task thread), but it still persists
for the sources.
Re-implementing the back-pressure monitor as proposed in this ticket will make
it more stable, more efficient and should fix this issue once and for all of
the remaining cases.
> Implement back-pressure monitor with non-blocking outputs
> ---------------------------------------------------------
>
> Key: FLINK-14472
> URL: https://issues.apache.org/jira/browse/FLINK-14472
> Project: Flink
> Issue Type: Task
> Components: Runtime / Network
> Reporter: zhijiang
> Assignee: Yingjie Cao
> Priority: Minor
> Fix For: 1.10.0
>
>
> Currently back-pressure monitor relies on detecting task threads that are
> stuck in `requestBufferBuilderBlocking`. There are actually two cases to
> cause back-pressure ATM:
> * There are no available buffers in `LocalBufferPool` and all the given
> quotas from global pool are also exhausted. Then we need to wait for buffer
> recycling to `LocalBufferPool`.
> * No available buffers in `LocalBufferPool`, but the quota has not been used
> up. While requesting buffer from global pool, it is blocked because of no
> available buffers in global pool. Then we need to wait for buffer recycling
> to global pool.
> We try to implement the non-blocking network output in FLINK-14396, so the
> back pressure monitor should be adjusted accordingly after the non-blocking
> output is used in practice.
> In detail we try to avoid the current monitor way by analyzing the task
> thread stack, which has some drawbacks discussed before:
> * If the `requestBuffer` is not triggered by task thread, the current
> monitor is invalid in practice.
> * The current monitor is heavy-weight and fragile because it needs to
> understand more details of LocalBufferPool implementation.
> We could provide a transparent method for the monitor caller to get the
> backpressure result directly, and hide the implementation details in the
> LocalBufferPool.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)