alpinegizmo commented on a change in pull request #15811:
URL: https://github.com/apache/flink/pull/15811#discussion_r625045499
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
Review comment:
```suggestion
Keep this in mind if your job has a varying load. For example, a subtask
with a constant load of 50% and another subtask that is alternating every
second between fully loaded and idling will both have the same value
of `busyTimeMsPerSecond`: around `500ms`.
```
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
Review comment:
```suggestion
These metrics are being updated every couple of seconds, and the reported
value represents the average time
that subtask was back pressured (or idle or busy) during the last couple of
seconds.
```
##########
File path: docs/content/docs/ops/state/checkpoints.md
##########
@@ -194,13 +190,10 @@ independent of the end-to-end latency. Be aware unaligned
checkpointing
adds to I/O to the state backends, so you shouldn't use it when the I/O to
the state backend is actually the bottleneck during checkpointing.
-Note that unaligned checkpoints is a brand-new feature that currently has the
+Note that unaligned checkpoints is a new feature that currently has the
Review comment:
```suggestion
Note that unaligned checkpointing is a new feature that currently has the
```
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
+Internally, back pressure is judged based on the availability of output
buffers.
+If there is no available buffer (at least one) for output, then it indicates
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
## Example
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the
raw values, tasks are
+also color codded to make the investigation easier.
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
-This means that the JobManager triggered a back pressure sample of the running
tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue. Fully back pressured tasks are black, while 100% busy
tasks are colored red.
+All values in between are represented as shades between those three colors.
-Note that clicking the row, you trigger the sample for all subtasks of this
operator.
+### Back Pressure Status
-{{< img src="/fig/back_pressure_sampling_in_progress.png"
class="img-responsive" >}}
+In *Back Pressure* tab next to the job overview you can find more detailed
metrics.
-### Back Pressure Status
+{{< img src="/fig/back_pressure_subtasks.png" class="img-responsive" >}}
-If you see status **OK** for the tasks, there is no indication of back
pressure. **HIGH** on the other hand means that the tasks are back pressured.
+If you see status **OK** for the subtasks, there is no indication of back
pressure. **HIGH** on the
+other hand means that the subtasks are back pressured. Where status is defined
in the following way:
-{{< img src="/fig/back_pressure_sampling_ok.png" class="img-responsive" >}}
+- **OK**: 0% <= back pressured <= 10%
+- **LOW**: 10% < back pressured <= 50%
+- **HIGH**: 50% < back pressured <= 100%
-{{< img src="/fig/back_pressure_sampling_high.png" class="img-responsive" >}}
+Additionally you can find the percentage each subtask is back pressured, idle
or busy.
Review comment:
```suggestion
Additionally, you can find the percentage of time each subtask is back
pressured, idle, or busy.
```
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
+Internally, back pressure is judged based on the availability of output
buffers.
+If there is no available buffer (at least one) for output, then it indicates
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
## Example
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the
raw values, tasks are
+also color codded to make the investigation easier.
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
-This means that the JobManager triggered a back pressure sample of the running
tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue. Fully back pressured tasks are black, while 100% busy
tasks are colored red.
+All values in between are represented as shades between those three colors.
-Note that clicking the row, you trigger the sample for all subtasks of this
operator.
+### Back Pressure Status
-{{< img src="/fig/back_pressure_sampling_in_progress.png"
class="img-responsive" >}}
+In *Back Pressure* tab next to the job overview you can find more detailed
metrics.
-### Back Pressure Status
+{{< img src="/fig/back_pressure_subtasks.png" class="img-responsive" >}}
-If you see status **OK** for the tasks, there is no indication of back
pressure. **HIGH** on the other hand means that the tasks are back pressured.
+If you see status **OK** for the subtasks, there is no indication of back
pressure. **HIGH** on the
+other hand means that the subtasks are back pressured. Where status is defined
in the following way:
Review comment:
```suggestion
For subtasks whose status is **OK**, there is no indication of back
pressure. **HIGH**, on the
other hand, means that a subtask is back pressured. Status is defined in the
following way:
```
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
+Internally, back pressure is judged based on the availability of output
buffers.
+If there is no available buffer (at least one) for output, then it indicates
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
Review comment:
```suggestion
Idleness, on the other hand, is determined by whether or not there is input
available.
```
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
+Internally, back pressure is judged based on the availability of output
buffers.
+If there is no available buffer (at least one) for output, then it indicates
that there is back pressure for the task.
Review comment:
```suggestion
If a task has no available output buffers, then that task is considered back
pressured.
```
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
+Internally, back pressure is judged based on the availability of output
buffers.
+If there is no available buffer (at least one) for output, then it indicates
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
## Example
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the
raw values, tasks are
+also color codded to make the investigation easier.
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
-This means that the JobManager triggered a back pressure sample of the running
tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue. Fully back pressured tasks are black, while 100% busy
tasks are colored red.
+All values in between are represented as shades between those three colors.
-Note that clicking the row, you trigger the sample for all subtasks of this
operator.
+### Back Pressure Status
-{{< img src="/fig/back_pressure_sampling_in_progress.png"
class="img-responsive" >}}
+In *Back Pressure* tab next to the job overview you can find more detailed
metrics.
Review comment:
```suggestion
In the *Back Pressure* tab next to the job overview you can find more
detailed metrics.
```
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
+Internally, back pressure is judged based on the availability of output
buffers.
+If there is no available buffer (at least one) for output, then it indicates
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
## Example
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the
raw values, tasks are
+also color codded to make the investigation easier.
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
-This means that the JobManager triggered a back pressure sample of the running
tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue. Fully back pressured tasks are black, while 100% busy
tasks are colored red.
Review comment:
```suggestion
Idling tasks are blue, fully back pressured tasks are black, and fully busy
tasks are colored red.
```
##########
File path: docs/content/docs/ops/state/checkpoints.md
##########
@@ -219,10 +212,9 @@ state. To support rescaling, watermarks should be stored
per key-group in a
union-state. We most likely will implement this approach as a general solution
(didn't make it into Flink 1.11.0).
-In the upcoming release(s), Flink will address these limitations and will
-provide a fine-grained way to trigger unaligned checkpoints only for the
-in-flight data that moves slowly with timeout mechanism. These options will
-decrease the pressure on I/O in the state backends and eventually allow
-unaligned checkpoints to become the default checkpointing.
+After enabling unaligned checkpoints, you can also specify the alignment
timeout via
+`CheckpointConfig.setAlignmentTimeout(Duration)` or
`execution.checkpointing.alignment-timeout` via
+the configuration file. When activate, checkpoints will start as aligned, but
if the alignment for
+some task exceeds this timeout, this checkpoint will time out to unaligned
checkpoint.
Review comment:
```suggestion
`CheckpointConfig.setAlignmentTimeout(Duration)` or
`execution.checkpointing.alignment-timeout` in
the configuration file. When activated, each checkpoint will still begin as
an aligned checkpoint, but if the alignment for some task exceeds this timeout,
then the checkpoint will proceed as an unaligned checkpoint.
```
##########
File path: docs/content/docs/ops/monitoring/back_pressure.md
##########
@@ -37,50 +37,47 @@ If you see a **back pressure warning** (e.g. `High`) for a
task, this means that
Take a simple `Source -> Sink` job as an example. If you see a warning for
`Source`, this means that `Sink` is consuming data slower than `Source` is
producing. `Sink` is back pressuring the upstream operator `Source`.
-## Sampling Back Pressure
+## Task performance metrics
-Back pressure monitoring works by repeatedly taking back pressure samples of
your running tasks. The JobManager triggers repeated calls to
`Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three
metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to
process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!--
https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing
-->
-
-Internally, back pressure is judged based on the availability of output
buffers. If there is no available buffer (at least one) for output, then it
indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in
order to determine back pressure. The ratio you see in the web interface tells
you how many of these samples were indicating back pressure, e.g. `0.01`
indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web
interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following
configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are
deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back
pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine
back pressure (DEFAULT: 50, 50 ms).
+Those metrics are being updated every couple of seconds and the reported value
presents an average time
+that subtask has been back pressured (or idle or busy) in that last couple of
seconds.
+Keep this in mind if your job has a varying load. Both, a subtask that has a
constant load of 50% and a
+subtask that is alternating every second between fully loaded and idling, will
have the same value
+of `busyTimeMsPerSecond` around `500ms`.
+Internally, back pressure is judged based on the availability of output
buffers.
+If there is no available buffer (at least one) for output, then it indicates
that there is back pressure for the task.
+Idleness is judged on the other hand on the availability of the task's input.
## Example
-You can find the *Back Pressure* tab next to the job overview.
+WebUI is aggregating maximal value of back pressure and busy metrics from all
of the subtasks and is
+presenting those aggregated values inside the JobGraph. Besides displaying the
raw values, tasks are
+also color codded to make the investigation easier.
Review comment:
```suggestion
The WebUI aggregates the maximum value of the back pressure and busy metrics
from all of the subtasks and presents those aggregated values inside the
JobGraph. Besides displaying the raw values, tasks are
also color-coded to make the investigation easier.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]