[jira] [Commented] (FLINK-15924) Detect and log blocking main thread operations

Yang Wang (Jira) Wed, 05 Feb 2020 19:00:55 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031219#comment-17031219
 ]


Yang Wang commented on FLINK-15924:
-----------------------------------

This changes is really on point. As you say, it is not easy to guarantee that 
all the I/O blocking operations are not executed in the main thread. So it will 
be good if we could add some {{WARN}} log for the long rpc call. Also we could 
add some RPC processing metrics. For example, in Hadoop Yarn Resourcemanager, 
it have the following logs and metrics.

 
{code:java}
2020-02-06 10:22:34,469 WARN [IPC Server handler 22 on 8030] 
org.apache.hadoop.ipc.Server: Slow RPC : allocate took 1 milliseconds to 
process from client Call#468249 Retry#0 
org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
xx.xx.xx.xx:27064
{code}
 

{code:java}
name: "Hadoop:service=ResourceManager,name=RpcActivityForPort8025",
modelerType: "RpcActivityForPort8025",
tag.port: "8025",
tag.Context: "rpc",
tag.NumOpenConnectionsPerUser: "{"hadoop":17}",
tag.Hostname: "z05c19394.sqa.zth.tbsite.net",
RpcQueueTime10sNumOps: 103,
RpcQueueTime10s50thPercentileLatency: 1,
RpcQueueTime10s75thPercentileLatency: 3,
RpcQueueTime10s90thPercentileLatency: 4,
RpcQueueTime10s95thPercentileLatency: 4,
RpcQueueTime10s99thPercentileLatency: 5,
RpcProcessingTime10sNumOps: 103,
RpcProcessingTime10s50thPercentileLatency: 3,
RpcProcessingTime10s75thPercentileLatency: 6,
RpcProcessingTime10s90thPercentileLatency: 8,
RpcProcessingTime10s95thPercentileLatency: 9,
RpcProcessingTime10s99thPercentileLatency: 10,
DeferredRpcProcessingTime10sNumOps: 0,
DeferredRpcProcessingTime10s50thPercentileLatency: 0,
DeferredRpcProcessingTime10s75thPercentileLatency: 0,
DeferredRpcProcessingTime10s90thPercentileLatency: 0,
DeferredRpcProcessingTime10s95thPercentileLatency: 0,
DeferredRpcProcessingTime10s99thPercentileLatency: 0,
ReceivedBytes: 86356893642,
SentBytes: 1808072466,
RpcQueueTimeNumOps: 41511724,
RpcQueueTimeAvgTime: 1.4071428571428573,
RpcProcessingTimeNumOps: 41511724,
RpcProcessingTimeAvgTime: 3.414285714285714,
... ...
{code}



> Detect and log blocking main thread operations
> ----------------------------------------------
>
>                 Key: FLINK-15924
>                 URL: https://issues.apache.org/jira/browse/FLINK-15924
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Till Rohrmann
>            Priority: Major
>             Fix For: 1.11.0
>
>
> When using the {{RpcEndpoint}} it is important that all operations which run 
> on the main thread are never blocking. We have seen in the past that it is 
> quite hard to always catch blocking operations in reviews and sometimes these 
> changes caused instabilities in Flink. Once this happens it is not trivial to 
> find the culprit which is responsible for the blocking operation.
> One way to make debugging easier is to add a monitor which detects and logs 
> if a {{RpcEndpoint}} operation takes longer than {{n}} seconds for example. 
> Depending on the overhead of this monitor one could even think about enabling 
> it only via a special configuration (e.g. debug mode).
> A proper class to introduce this monitor could be the {{AkkaRpcActor}} which 
> is responsible for executing main thread operations. Whenever we schedule an 
> operation, we could start a timeout which if triggered and the operation has 
> not been completed will log a warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15924) Detect and log blocking main thread operations

Reply via email to