GitHub user uce opened a pull request:
https://github.com/apache/flink/pull/1453
[FLINK-3131] Expose checkpoint metrics
- Adds `long getStateSize()` to `StateHandle` and `KvStateSnapshot`.
Everything except test classes and `LazyDbKvState` implement this.
`LazyDbKvState` could implement it correctly, but currently the state is
serialized lazily, which means that the state size is not known (currently set
as 0) when creating the state handle.
- Adds simple statistics tracking to the checkpoint coordinator. This is
not using the accumulators, because I wanted more fine-grained control. I think
we can expand the system internal accumulators to accommodate these use cases
better. It is also possible to retro fit this on the accumulators, if you want
to.
- Adds the following web runtime monitor handlers:
* `/jobs/:jobid/checkpoints` for completed checkpoint statistics for the
job with the history
* `/jobs/:jobid/vertices/:vertexid/checkpoints` for per operator
statistics including subtasks
- Adds the web frontend HTML/Javascript (screenshots below)
This feature can be disabled via `jobmanager.web.checkpoints.disable`. I
think this is good practice, because it is attached to one of the most critical
parts of the system.
The maximum history size (see screenshot) for job level statistics can be
configured via `jobmanager.web.checkpoints.history`. Current default is 10.
Maybe a little too high?
---
- **Checkpoints Tab** (Overview and Operators):

- **History** (configurable):

- **Subtasks**:

- **Terminated job**:

Jobs without checkpoints just show `No checkpoints` currently.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/uce/flink 3131-checkpoint_metrics
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1453.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1453
----
commit aa12f3c7bb6ac43b91d5926087d7c181958c95cb
Author: Ufuk Celebi <[email protected]>
Date: 2015-12-14T18:40:10Z
[FLINK-3131] [contrib, runtime, streaming-java] Add long getStateSize() to
StateHandle and KvStateSnapshot
In order to report the state sizes, we need to expose them. All state
backends
currently available backends know the state size. Only the LazyDbKvState
does
not expose it at the moment, because it serializes the data lazily. This
can be
changed in a follow-up fix.
commit 2dae2a8ee98ca08cba4925f15110f1d9de2c1831
Author: Ufuk Celebi <[email protected]>
Date: 2015-12-14T19:12:59Z
[FLINK-3131] [core, runtime] Add checkpoint statistics tracker
Adds a simple tracker of checkpoint statistics.
commit 53feb2a1a008f08218d05b91af4853ad18574fa2
Author: Ufuk Celebi <[email protected]>
Date: 2015-12-14T19:13:59Z
[FLINK-3131] [runtime-web] Add checkpoint statistics handlers
commit 47f89d5d24ae2fb6c314205531d696b985acb508
Author: Ufuk Celebi <[email protected]>
Date: 2015-12-14T19:48:03Z
[FLINK-3131] [runtime-web] Add checkpoint statistics to web frontend
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---