Hi, Yes, you can use `isBackPressured` to monitor a task's back-pressure. However keep in mind: a) You are going to miss some nice way to visualize this information, which is present in 1.13's WebUI. b) `isBackPressured` is a sampling based metric. If your job has varying load, for example all windows firing at the same processing time, every couple of seconds, causing intermittent back-pressure, this metric will show it randomly as `true` or `false`. c) `isBackPressured` is slightly less accurate compared to `backPressuredTimeMsPerSecond`. There are some corner cases when for a brief amount of time it can return `true`, while a task is still running, while the time based metrics work in a different much more accurate way.
About back porting the patches, if you want to create a custom Flink build it should be do-able. There will be some conflicts for sure, so you will need to understand Flink's code. Best, Piotrek śr., 7 kwi 2021 o 02:32 Lu Niu <qqib...@gmail.com> napisał(a): > Hi, Piotr > > Thanks for replying! > > We don't have a plan to upgrade to 1.13 in short term. We are using flink > 1.11 and I notice there is a metric called isBackpressured. Is that enough > to solve 1? If not, would backporting patches regarding > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond > work? And do you have an estimate of how difficult it is? > > > Best > Lu > > > > On Tue, Apr 6, 2021 at 12:18 AM Piotr Nowojski <pnowoj...@apache.org> > wrote: > > > Hi, > > > > Lately we overhauled the backpressure detection [1] and a screenshot > > preview of those efforts is attached here [2]. I encourage you to check > the > > 1.13 RC0 build and how the current mechanism works for you [3]. To > support > > those WebUI changes we have added a couple of new metrics: > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and > idleTimeMsPerSecond. > > > > 1. I believe that solves 1. > > 2. This still requires a bit of manual investigation. Once you locate > > backpressuring task, you can check the detail subtask stats to check if > all > > parallel instances are uniformly backpressured/busy or not. If you would > > like to add a hint "it looks like you have a data skew in Task XYZ ", > that > > I believe could be added to the WebUI. > > 3. The tricky part is how to display this kind of information. Currently > I > > would recommend just export/report > > backPressuredTimeMsPerSecond, busyTimeMsPerSecond and idleTimeMsPerSecond > > metrics for every task to an external system and display them for > example > > in Graphana. > > > > The blog post you are referencing is quite outdated, especially with > those > > new changes from 1.13. I'm hoping to write a new one pretty soon. > > > > Piotrek > > > > [1] https://issues.apache.org/jira/browse/FLINK-14712 > > [2] > > > > > https://issues.apache.org/jira/browse/FLINK-14814?focusedCommentId=17256926&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17256926 > > [3] > > > > > http://mail-archives.apache.org/mod_mbox/flink-user/202104.mbox/%3c1d2412ce-d4d0-ed50-6181-1b610e16d...@apache.org%3E > > > > pon., 5 kwi 2021 o 23:20 Lu Niu <qqib...@gmail.com> napisał(a): > > > > > Hi, Flink dev > > > > > > Lately, we want to develop some tools to: > > > 1. show backpressure operator without manual operation > > > 2. Provide suggestions to mitigate back pressure after checking data > > skew, > > > external service RPC etc. > > > 3. Show back pressure history > > > > > > Could anyone share their experience with such tooling? > > > Also, I notice backpressure monitoring and detection is mentioned > across > > > multiple places. Could someone help to explain how these connect to > each > > > other? Maybe some of them are outdated? Thanks! > > > > > > 1. The official doc introduces monitoring back pressure through web UI. > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html > > > 2. In https://flink.apache.org/2019/07/23/flink-network-stack-2.html, > it > > > says outPoolUsage, inPoolUsage metrics can be used to determine back > > > pressure. > > > 3. Latest flink version introduces metrics called “isBackPressured" > But I > > > didn't find related documentation on usage. > > > > > > Best > > > Lu > > > > > >