GitHub user zentol opened a pull request:

    https://github.com/apache/flink/pull/5161

    7608

    This PR is an extension of #4826.
    
    The additions address issues that I found while trying it out. Some of 
these unfortunately may be rather controversial, hence why I'm curious what 
others think.
    
    The first addition is to register the metric on the job level with 
user-defined variables for proper integration with tag-based reporters. As 
described in #4826 the latency metric doesn't fit as a operator/task metric 
since it describes a relation between 2 operators.
    
    The second problem i found was that the latency marker uses 
`StreamConfig#getVertexID` to retrieve the source/operator vertexID. This is 
however not the actual runtime ID, but just a counter that is incremented when 
the job is generated, and thus utterly pointless and impossible to correlate 
with a job's operators. This requires serializing 2 longs instead of an int 
along with creating an `OperatorID` during deserialization, making the latency 
measuring more costly.
    
    The third and last change that I made is removing the case distinction 
between operators and sinks. Previously, operators would ignore the source 
subtask index and tread all subtasks equally. Only sinks measured latency 
separately for the source subtask.
    Despite the performance benefits I find this design rather questionable:
    * latency metric names for operators&sinks are inconsistent, and there is 
in fact no way to tell whether something is an operator or sink without access 
to the job source code which means to query the metric you have to resort to 
trial-and-error
    * there are operators with sink semantics (like the write-ahead sink), so 
we aren't consistent in terms of "all sinks have detailed latency metrics" and 
are leaking implementation details to the outside
    * considering sink latency metrics as more important is rather arbitrary 
and may very well not fit a user's requirements
    Instead, we may want to think about is adding a separate switch for details 
latency metrics.
    
    /cc @aljoscha @rmetzger @yew1eb 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zentol/flink 7608

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5161
    
----
commit b39227d270dc1bbfb7b420796428e67fd421b969
Author: yew1eb <[email protected]>
Date:   2017-10-14T09:19:49Z

    [FLINK-7608][metrics] Rework latency metric

commit 7fc82e809b1c68784aa80c29e50e36c5b13917ff
Author: zentol <[email protected]>
Date:   2017-12-12T14:44:39Z

    Additions

----


---

Reply via email to