[ 
https://issues.apache.org/jira/browse/FLINK-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553247#comment-16553247
 ] 

Gary Yao commented on FLINK-9159:
---------------------------------

[~till.rohrmann]

Find below the config keys that I had a look at and their default values.

||Config Key||Default Value||
|slotmanager.request-timeout|10 m|
|slotmanager.taskmanager-timeout|30 s|
|slot.request.timeout|5 m|
|slot.idle.timeout|50 s|
|taskmanager.registration.timeout|5 m|
|mesos.failover-timeout|10 m|
|resourcemanager.job.timeout|5 m|
|heartbeat.timeout|50s|
|heartbeat.interval|10s|

Recommendations: 

The value for mesos.failover-timeout is too low. The value specifies _"amount 
of time (in seconds) that the master will wait for the  scheduler to failover 
before it tears down the framework by killing all its tasks/executors."_ For 
production systems, the recommended value is 1 week.

Between slotmanager.request-timeout and slot.request.timeout effectively the 
minimum of both values will be used. One of them should be removed or at least 
both should be set to the same value.

Some of the timeouts, e.g., slotmanager.taskmanager-timeout, are measured using 
{{System.currentTimeMillis()}}. If the stars align, e.g., during DST clock 
changes, this can lead to resources not being freed. 

> Sanity check default timeout values
> -----------------------------------
>
>                 Key: FLINK-9159
>                 URL: https://issues.apache.org/jira/browse/FLINK-9159
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Assignee: Gary Yao
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.2, 1.6.0
>
>
> Check that the default timeout values for resource release are sanely chosen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to