Hi,

Yes, you can use `isBackPressured` to monitor a task's back-pressure.
However keep in mind:
a) You are going to miss some nice way to visualize this information, which
is present in 1.13's WebUI.
b) `isBackPressured` is a sampling based metric. If your job has varying
load, for example all windows firing at the same processing time, every
couple of seconds, causing intermittent back-pressure, this metric will
show it randomly as `true` or `false`.
c) `isBackPressured` is slightly less accurate compared to
`backPressuredTimeMsPerSecond`. There are some corner cases when for a
brief amount of time it can return `true`, while a task is still running,
while the time based metrics work in a different much more accurate way.

About back porting the patches, if you want to create a custom Flink build
it should be do-able. There will be some conflicts for sure, so you will
need to understand Flink's code.

Best,
Piotrek

śr., 7 kwi 2021 o 02:32 Lu Niu <qqib...@gmail.com> napisał(a):

> Hi, Piotr
>
> Thanks for replying!
>
> We don't have a plan to upgrade to 1.13 in short term. We are using flink
> 1.11 and I notice there is a metric called isBackpressured. Is that enough
> to solve 1? If not, would backporting patches regarding
> backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
> work? And do you have an estimate of how difficult it is?
>
>
> Best
> Lu
>
>
>
> On Tue, Apr 6, 2021 at 12:18 AM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
>
> > Hi,
> >
> > Lately we overhauled the backpressure detection [1] and a screenshot
> > preview of those efforts is attached here [2]. I encourage you to check
> the
> > 1.13 RC0 build and how the current mechanism works for you [3]. To
> support
> > those WebUI changes we have added a couple of new metrics:
> > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and
> idleTimeMsPerSecond.
> >
> > 1. I believe that solves 1.
> > 2. This still requires a bit of manual investigation. Once you locate
> > backpressuring task, you can check the detail subtask stats to check if
> all
> > parallel instances are uniformly backpressured/busy or not. If you would
> > like to add a hint "it looks like you have a data skew in Task XYZ ",
> that
> > I believe could be added to the WebUI.
> > 3. The tricky part is how to display this kind of information. Currently
> I
> > would recommend just export/report
> > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond
> > metrics for every task to an external system and  display them for
> example
> > in Graphana.
> >
> > The blog post you are referencing is quite outdated, especially with
> those
> > new changes from 1.13. I'm hoping to write a new one pretty soon.
> >
> > Piotrek
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-14712
> > [2]
> >
> >
> https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926
> > [3]
> >
> >
> http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3c1d2412ce-d4d0-ed50-6181-1b610e16d...@apache.org%3E
> >
> > pon., 5 kwi 2021 o 23:20 Lu Niu <qqib...@gmail.com> napisał(a):
> >
> > > Hi, Flink dev
> > >
> > > Lately, we want to develop some tools to:
> > > 1. show backpressure operator without manual operation
> > > 2. Provide suggestions to mitigate back pressure after checking data
> > skew,
> > > external service RPC etc.
> > > 3. Show back pressure history
> > >
> > > Could anyone share their experience with such tooling?
> > > Also, I notice backpressure monitoring and detection is mentioned
> across
> > > multiple places. Could someone help to explain how these connect to
> each
> > > other? Maybe some of them are outdated? Thanks!
> > >
> > > 1. The official doc introduces monitoring back pressure through web UI.
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
> > > 2. In https://flink.apache.org/2019/07/23/flink-network-stack-2.html,
> it
> > > says outPoolUsage, inPoolUsage metrics can be used to determine back
> > > pressure.
> > > 3. Latest flink version introduces metrics called “isBackPressured"
> But I
> > > didn't find related documentation on usage.
> > >
> > > Best
> > > Lu
> > >
> >
>

Reply via email to