GitHub user zentol opened a pull request:
https://github.com/apache/flink/pull/5161
7608
This PR is an extension of #4826.
The additions address issues that I found while trying it out. Some of
these unfortunately may be rather controversial, hence why I'm curious what
others think.
The first addition is to register the metric on the job level with
user-defined variables for proper integration with tag-based reporters. As
described in #4826 the latency metric doesn't fit as a operator/task metric
since it describes a relation between 2 operators.
The second problem i found was that the latency marker uses
`StreamConfig#getVertexID` to retrieve the source/operator vertexID. This is
however not the actual runtime ID, but just a counter that is incremented when
the job is generated, and thus utterly pointless and impossible to correlate
with a job's operators. This requires serializing 2 longs instead of an int
along with creating an `OperatorID` during deserialization, making the latency
measuring more costly.
The third and last change that I made is removing the case distinction
between operators and sinks. Previously, operators would ignore the source
subtask index and tread all subtasks equally. Only sinks measured latency
separately for the source subtask.
Despite the performance benefits I find this design rather questionable:
* latency metric names for operators&sinks are inconsistent, and there is
in fact no way to tell whether something is an operator or sink without access
to the job source code which means to query the metric you have to resort to
trial-and-error
* there are operators with sink semantics (like the write-ahead sink), so
we aren't consistent in terms of "all sinks have detailed latency metrics" and
are leaking implementation details to the outside
* considering sink latency metrics as more important is rather arbitrary
and may very well not fit a user's requirements
Instead, we may want to think about is adding a separate switch for details
latency metrics.
/cc @aljoscha @rmetzger @yew1eb
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zentol/flink 7608
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/5161.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5161
----
commit b39227d270dc1bbfb7b420796428e67fd421b969
Author: yew1eb <[email protected]>
Date: 2017-10-14T09:19:49Z
[FLINK-7608][metrics] Rework latency metric
commit 7fc82e809b1c68784aa80c29e50e36c5b13917ff
Author: zentol <[email protected]>
Date: 2017-12-12T14:44:39Z
Additions
----
---