[GitHub] [flink] alpinegizmo commented on a change in pull request #15811: [FLINK-22253][docs] Update back pressure monitoring docs with new WebUI changes

GitBox Mon, 03 May 2021 05:37:10 -0700


alpinegizmo commented on a change in pull request #15811:
URL: https://github.com/apache/flink/pull/15811#discussion_r625045499




##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a 
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will 
have the same value
+of `busyTimeMsPerSecond` around `500ms`.

Review comment:
       ```suggestion
   Keep this in mind if your job has a varying load. For example, a subtask 
with a constant load of 50% and another subtask that is alternating every 
second between fully loaded and idling will both have the same value
   of `busyTimeMsPerSecond`: around `500ms`. 
   ```

##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.

Review comment:
       ```suggestion
   These metrics are being updated every couple of seconds, and the reported 
value represents the average time
   that subtask was back pressured (or idle or busy) during the last couple of 
seconds.
   ```

##########
File path: docs/content/docs/ops/state/checkpoints.md
##########
@@ -194,13 +190,10 @@ independent of the end-to-end latency. Be aware unaligned 
checkpointing
 adds to I/O to the state backends, so you shouldn't use it when the I/O to
 the state backend is actually the bottleneck during checkpointing.
 
-Note that unaligned checkpoints is a brand-new feature that currently has the
+Note that unaligned checkpoints is a new feature that currently has the

Review comment:
       ```suggestion
   Note that unaligned checkpointing is a new feature that currently has the
   ```

##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a 
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will 
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
 
+Internally, back pressure is judged based on the availability of output 
buffers.
+If there is no available buffer (at least one) for output, then it indicates 
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
 
 ## Example
 
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all 
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the 
raw values, tasks are
+also color codded to make the investigation easier.
 
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
 
-This means that the JobManager triggered a back pressure sample of the running 
tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue. Fully back pressured tasks are black, while 100% busy 
tasks are colored red.
+All values in between are represented as shades between those three colors.
 
-Note that clicking the row, you trigger the sample for all subtasks of this 
operator.
+### Back Pressure Status
 
-{{< img src="/fig/back_pressure_sampling_in_progress.png" 
class="img-responsive" >}}
+In *Back Pressure* tab next to the job overview you can find more detailed 
metrics.
 
-### Back Pressure Status
+{{< img src="/fig/back_pressure_subtasks.png" class="img-responsive" >}}
 
-If you see status **OK** for the tasks, there is no indication of back 
pressure. **HIGH** on the other hand means that the tasks are back pressured.
+If you see status **OK** for the subtasks, there is no indication of back 
pressure. **HIGH** on the
+other hand means that the subtasks are back pressured. Where status is defined 
in the following way:
 
-{{< img src="/fig/back_pressure_sampling_ok.png" class="img-responsive" >}}
+- **OK**: 0% <= back pressured <= 10%
+- **LOW**: 10% < back pressured <= 50%
+- **HIGH**: 50% < back pressured <= 100%
 
-{{< img src="/fig/back_pressure_sampling_high.png" class="img-responsive" >}}
+Additionally you can find the percentage each subtask is back pressured, idle 
or busy.

Review comment:
       ```suggestion
   Additionally, you can find the percentage of time each subtask is back 
pressured, idle, or busy.
   ```

##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a 
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will 
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
 
+Internally, back pressure is judged based on the availability of output 
buffers.
+If there is no available buffer (at least one) for output, then it indicates 
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
 
 ## Example
 
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all 
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the 
raw values, tasks are
+also color codded to make the investigation easier.
 
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
 
-This means that the JobManager triggered a back pressure sample of the running 
tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue. Fully back pressured tasks are black, while 100% busy 
tasks are colored red.
+All values in between are represented as shades between those three colors.
 
-Note that clicking the row, you trigger the sample for all subtasks of this 
operator.
+### Back Pressure Status
 
-{{< img src="/fig/back_pressure_sampling_in_progress.png" 
class="img-responsive" >}}
+In *Back Pressure* tab next to the job overview you can find more detailed 
metrics.
 
-### Back Pressure Status
+{{< img src="/fig/back_pressure_subtasks.png" class="img-responsive" >}}
 
-If you see status **OK** for the tasks, there is no indication of back 
pressure. **HIGH** on the other hand means that the tasks are back pressured.
+If you see status **OK** for the subtasks, there is no indication of back 
pressure. **HIGH** on the
+other hand means that the subtasks are back pressured. Where status is defined 
in the following way:

Review comment:
       ```suggestion
   For subtasks whose status is **OK**, there is no indication of back 
pressure. **HIGH**, on the
   other hand, means that a subtask is back pressured. Status is defined in the 
following way:
   ```

##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a 
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will 
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
 
+Internally, back pressure is judged based on the availability of output 
buffers.
+If there is no available buffer (at least one) for output, then it indicates 
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.

Review comment:
       ```suggestion
   Idleness, on the other hand, is determined by whether or not there is input 
available.
   ```

##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a 
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will 
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
 
+Internally, back pressure is judged based on the availability of output 
buffers.
+If there is no available buffer (at least one) for output, then it indicates 
that there is back pressure for the task.

Review comment:
       ```suggestion
   If a task has no available output buffers, then that task is considered back 
pressured.
   ```

##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a 
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will 
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
 
+Internally, back pressure is judged based on the availability of output 
buffers.
+If there is no available buffer (at least one) for output, then it indicates 
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
 
 ## Example
 
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all 
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the 
raw values, tasks are
+also color codded to make the investigation easier.
 
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
 
-This means that the JobManager triggered a back pressure sample of the running 
tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue. Fully back pressured tasks are black, while 100% busy 
tasks are colored red.
+All values in between are represented as shades between those three colors.
 
-Note that clicking the row, you trigger the sample for all subtasks of this 
operator.
+### Back Pressure Status
 
-{{< img src="/fig/back_pressure_sampling_in_progress.png" 
class="img-responsive" >}}
+In *Back Pressure* tab next to the job overview you can find more detailed 
metrics.

Review comment:
       ```suggestion
   In the *Back Pressure* tab next to the job overview you can find more 
detailed metrics.
   ```

##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a 
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will 
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
 
+Internally, back pressure is judged based on the availability of output 
buffers.
+If there is no available buffer (at least one) for output, then it indicates 
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
 
 ## Example
 
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all 
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the 
raw values, tasks are
+also color codded to make the investigation easier.
 
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
 
-This means that the JobManager triggered a back pressure sample of the running 
tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue. Fully back pressured tasks are black, while 100% busy 
tasks are colored red.

Review comment:
       ```suggestion
   Idling tasks are blue, fully back pressured tasks are black, and fully busy 
tasks are colored red.
   ```

##########
File path: docs/content/docs/ops/state/checkpoints.md
##########
@@ -219,10 +212,9 @@ state. To support rescaling, watermarks should be stored 
per key-group in a
 union-state. We most likely will implement this approach as a general solution 
 (didn't make it into Flink 1.11.0).
 
-In the upcoming release(s), Flink will address these limitations and will
-provide a fine-grained way to trigger unaligned checkpoints only for the 
-in-flight data that moves slowly with timeout mechanism. These options will
-decrease the pressure on I/O in the state backends and eventually allow
-unaligned checkpoints to become the default checkpointing. 
+After enabling unaligned checkpoints, you can also specify the alignment 
timeout via
+`CheckpointConfig.setAlignmentTimeout(Duration)` or 
`execution.checkpointing.alignment-timeout` via
+the configuration file. When activate, checkpoints will start as aligned, but 
if the alignment for
+some task exceeds this timeout, this checkpoint will time out to unaligned 
checkpoint.

Review comment:
       ```suggestion
   `CheckpointConfig.setAlignmentTimeout(Duration)` or 
`execution.checkpointing.alignment-timeout` in
   the configuration file. When activated, each checkpoint will still begin as 
an aligned checkpoint, but if the alignment for some task exceeds this timeout, 
then the checkpoint will proceed as an unaligned checkpoint.
   ```

##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a 
task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for 
`Source`, this means that `Sink` is consuming data slower than `Source` is 
producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of 
your running tasks. The JobManager triggers repeated calls to 
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three 
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to 
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- 
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
 -->
-
-Internally, back pressure is judged based on the availability of output 
buffers. If there is no available buffer (at least one) for output, then it 
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in 
order to determine back pressure. The ratio you see in the web interface tells 
you how many of these samples were indicating back pressure, e.g. `0.01` 
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web 
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following 
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are 
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back 
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine 
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value 
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of 
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a 
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will 
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
 
+Internally, back pressure is judged based on the availability of output 
buffers.
+If there is no available buffer (at least one) for output, then it indicates 
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
 
 ## Example
 
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all 
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the 
raw values, tasks are
+also color codded to make the investigation easier.

Review comment:
       ```suggestion
   The WebUI aggregates the maximum value of the back pressure and busy metrics 
from all of the subtasks and presents those aggregated values inside the 
JobGraph. Besides displaying the raw values, tasks are
   also color-coded to make the investigation easier.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] alpinegizmo commented on a change in pull request #15811: [FLINK-22253][docs] Update back pressure monitoring docs with new WebUI changes

Reply via email to