GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/1453

    [FLINK-3131] Expose checkpoint metrics

    - Adds `long getStateSize()` to `StateHandle` and `KvStateSnapshot`. 
Everything except test classes and `LazyDbKvState` implement this. 
`LazyDbKvState` could implement it correctly, but currently the state is 
serialized lazily, which means that the state size is not known (currently set 
as 0) when creating the state handle.
    
    - Adds simple statistics tracking to the checkpoint coordinator. This is 
not using the accumulators, because I wanted more fine-grained control. I think 
we can expand the system internal accumulators to accommodate these use cases 
better. It is also possible to retro fit this on the accumulators, if you want 
to.
    - Adds the following web runtime monitor handlers:
      * `/jobs/:jobid/checkpoints` for completed checkpoint statistics for the 
job with the history
      * `/jobs/:jobid/vertices/:vertexid/checkpoints` for per operator 
statistics including subtasks
    
    - Adds the web frontend HTML/Javascript (screenshots below)
    
    This feature can be disabled via `jobmanager.web.checkpoints.disable`. I 
think this is good practice, because it is attached to one of the most critical 
parts of the system.
    
    The maximum history size (see screenshot) for job level statistics can be 
configured via `jobmanager.web.checkpoints.history`. Current default is 10. 
Maybe a little too high?
    
    ---
    
    - **Checkpoints Tab** (Overview and Operators): 
    ![screen shot 2015-12-15 at 00 45 
41](https://cloud.githubusercontent.com/assets/1756620/11797953/e4f17c84-a2c6-11e5-86b1-040a4e1bff12.png)
    - **History** (configurable):
     ![screen shot 2015-12-15 at 00 45 
51](https://cloud.githubusercontent.com/assets/1756620/11797957/f2f87940-a2c6-11e5-82ce-5c5fcf8b1ca1.png)
    - **Subtasks**: 
    ![screen shot 2015-12-15 at 00 46 
08](https://cloud.githubusercontent.com/assets/1756620/11797963/0d105fd2-a2c7-11e5-9a90-458bd0b7fdc4.png)
    - **Terminated job**: 
    ![screen shot 2015-12-15 at 00 46 
44](https://cloud.githubusercontent.com/assets/1756620/11797969/1b0999a0-a2c7-11e5-9826-723f12e997d9.png)
    
    Jobs without checkpoints just show `No checkpoints` currently.
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 3131-checkpoint_metrics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1453
    
----
commit aa12f3c7bb6ac43b91d5926087d7c181958c95cb
Author: Ufuk Celebi <[email protected]>
Date:   2015-12-14T18:40:10Z

    [FLINK-3131] [contrib, runtime, streaming-java] Add long getStateSize() to 
StateHandle and KvStateSnapshot
    
    In order to report the state sizes, we need to expose them. All state 
backends
    currently available backends know the state size. Only the LazyDbKvState 
does
    not expose it at the moment, because it serializes the data lazily. This 
can be
    changed in a follow-up fix.

commit 2dae2a8ee98ca08cba4925f15110f1d9de2c1831
Author: Ufuk Celebi <[email protected]>
Date:   2015-12-14T19:12:59Z

    [FLINK-3131] [core, runtime] Add checkpoint statistics tracker
    
    Adds a simple tracker of checkpoint statistics.

commit 53feb2a1a008f08218d05b91af4853ad18574fa2
Author: Ufuk Celebi <[email protected]>
Date:   2015-12-14T19:13:59Z

    [FLINK-3131] [runtime-web] Add checkpoint statistics handlers

commit 47f89d5d24ae2fb6c314205531d696b985acb508
Author: Ufuk Celebi <[email protected]>
Date:   2015-12-14T19:48:03Z

    [FLINK-3131] [runtime-web] Add checkpoint statistics to web frontend

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to