[ 
https://issues.apache.org/jira/browse/FLINK-9159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553247#comment-16553247
 ] 

Gary Yao edited comment on FLINK-9159 at 7/23/18 6:41 PM:
----------------------------------------------------------

[~till.rohrmann]

Find below the config keys that I had a look at and their default values.

||Config Key||Default Value||
|slotmanager.request-timeout|10 m|
|slotmanager.taskmanager-timeout|30 s|
|slot.request.timeout|5 m|
|slot.idle.timeout|50 s|
|taskmanager.registration.timeout|5 m|
|mesos.failover-timeout|10 m|
|resourcemanager.job.timeout|5 m|
|heartbeat.timeout|50s|
|heartbeat.interval|10s|

Recommendations: 

The value for {{mesos.failover-timeout}} is too low. The value specifies the 
_"amount of time (in seconds) that the master will wait for the  scheduler to 
failover before it tears down the framework by killing all its 
tasks/executors."_ For production systems, the recommended value is 1 week.

Between slotmanager.request-timeout and slot.request.timeout effectively the 
minimum of both values will be used. One of them should be removed or at least 
both should be set to the same value.

Some of the timeouts, e.g., slotmanager.taskmanager-timeout, are measured using 
{{System.currentTimeMillis()}}. If the stars align, e.g., during DST clock 
changes, this can lead to resources not being freed. 


was (Author: gjy):
[~till.rohrmann]

Find below the config keys that I had a look at and their default values.

||Config Key||Default Value||
|slotmanager.request-timeout|10 m|
|slotmanager.taskmanager-timeout|30 s|
|slot.request.timeout|5 m|
|slot.idle.timeout|50 s|
|taskmanager.registration.timeout|5 m|
|mesos.failover-timeout|10 m|
|resourcemanager.job.timeout|5 m|
|heartbeat.timeout|50s|
|heartbeat.interval|10s|

Recommendations: 

The value for {{mesos.failover-timeout}} is too low. The value specifies 
_"amount of time (in seconds) that the master will wait for the  scheduler to 
failover before it tears down the framework by killing all its 
tasks/executors."_ For production systems, the recommended value is 1 week.

Between slotmanager.request-timeout and slot.request.timeout effectively the 
minimum of both values will be used. One of them should be removed or at least 
both should be set to the same value.

Some of the timeouts, e.g., slotmanager.taskmanager-timeout, are measured using 
{{System.currentTimeMillis()}}. If the stars align, e.g., during DST clock 
changes, this can lead to resources not being freed. 

> Sanity check default timeout values
> -----------------------------------
>
>                 Key: FLINK-9159
>                 URL: https://issues.apache.org/jira/browse/FLINK-9159
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Assignee: Gary Yao
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.2, 1.6.0
>
>
> Check that the default timeout values for resource release are sanely chosen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to