[jira] [Commented] (FLINK-10226) Latency metrics can choke job-manager

Chesnay Schepler (JIRA) Mon, 27 Aug 2018 05:43:28 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593595#comment-16593595
 ]


Chesnay Schepler commented on FLINK-10226:
------------------------------------------

echoing my post from the ML:

There are a few separate issues in here that we should tackle/investigate in 
parallel.


       Improve storage of latency metrics

Given how absurdly the number of latency metrics scale with # of operators / 
parallelism
it makes sense to introduce a special case here. I'm not quite sure yet how 
easily this can
be done though, as this isn't just about storage but also about transmission 
from the TM -> JM,
which is just as inefficient as the storage.


       Configurable granularity for latency metrics

The main reason for the scaling issue above is that we track latency from each 
operator subtask
to each source subtask. If we only accounted for the source ID instead we would 
significantly
reduce the number of metrics, while still providing some insight into latency.
Here's a comparison for the number of individual data points in the MetricStore,
for 2 sources, 6 subsequent operators, parallelism=100:

Current: 1.320.000
SourceID-only: 13.200


       Separate dispatcher thread-pool from REST API / metrics

We currently use the same thread-pool for inserting metrics / processing REST 
requests
that is also used for the Dispatcher RPC, i.e., intra-cluster communication.
To better isolate the Dispatcher we should provide separate thread-pools to both
components to prevent worst-case scenarios in the future.


       Find the bottleneck

I've run some preliminary benchmarks and the MetricStore itself appears to be 
fast enough
to handle these loads, so the search continues... 

> Latency metrics can choke job-manager
> -------------------------------------
>
>                 Key: FLINK-10226
>                 URL: https://issues.apache.org/jira/browse/FLINK-10226
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: Jozef Vilcek
>            Priority: Major
>
> With Flink 1.5.0 my Apache Beam job was not runnable unless I turned off 
> latencyTracking feature. That job generated huge amount of latency metrics + 
> histogram aggregates which updating occupied job-manager too much and cluster 
> did fall appart.
> This was discussed on mailing list:
> [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Flink-cluster-crashing-going-from-1-4-0-gt-1-5-3-td23941.html]
> The purpose of the ticket is reason about how to improve this and on which 
> end. I am currently not sure what is the root cause:
> a) Beam-To-Flink translation does generate too much of of "noise operators"
> b) Flink does not handle latencyTracking well for large jobs 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10226) Latency metrics can choke job-manager

Reply via email to