[
https://issues.apache.org/jira/browse/FLINK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031219#comment-17031219
]
Yang Wang commented on FLINK-15924:
-----------------------------------
This changes is really on point. As you say, it is not easy to guarantee that
all the I/O blocking operations are not executed in the main thread. So it will
be good if we could add some {{WARN}} log for the long rpc call. Also we could
add some RPC processing metrics. For example, in Hadoop Yarn Resourcemanager,
it have the following logs and metrics.
{code:java}
2020-02-06 10:22:34,469 WARN [IPC Server handler 22 on 8030]
org.apache.hadoop.ipc.Server: Slow RPC : allocate took 1 milliseconds to
process from client Call#468249 Retry#0
org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from
xx.xx.xx.xx:27064
{code}
{code:java}
name: "Hadoop:service=ResourceManager,name=RpcActivityForPort8025",
modelerType: "RpcActivityForPort8025",
tag.port: "8025",
tag.Context: "rpc",
tag.NumOpenConnectionsPerUser: "{"hadoop":17}",
tag.Hostname: "z05c19394.sqa.zth.tbsite.net",
RpcQueueTime10sNumOps: 103,
RpcQueueTime10s50thPercentileLatency: 1,
RpcQueueTime10s75thPercentileLatency: 3,
RpcQueueTime10s90thPercentileLatency: 4,
RpcQueueTime10s95thPercentileLatency: 4,
RpcQueueTime10s99thPercentileLatency: 5,
RpcProcessingTime10sNumOps: 103,
RpcProcessingTime10s50thPercentileLatency: 3,
RpcProcessingTime10s75thPercentileLatency: 6,
RpcProcessingTime10s90thPercentileLatency: 8,
RpcProcessingTime10s95thPercentileLatency: 9,
RpcProcessingTime10s99thPercentileLatency: 10,
DeferredRpcProcessingTime10sNumOps: 0,
DeferredRpcProcessingTime10s50thPercentileLatency: 0,
DeferredRpcProcessingTime10s75thPercentileLatency: 0,
DeferredRpcProcessingTime10s90thPercentileLatency: 0,
DeferredRpcProcessingTime10s95thPercentileLatency: 0,
DeferredRpcProcessingTime10s99thPercentileLatency: 0,
ReceivedBytes: 86356893642,
SentBytes: 1808072466,
RpcQueueTimeNumOps: 41511724,
RpcQueueTimeAvgTime: 1.4071428571428573,
RpcProcessingTimeNumOps: 41511724,
RpcProcessingTimeAvgTime: 3.414285714285714,
... ...
{code}
> Detect and log blocking main thread operations
> ----------------------------------------------
>
> Key: FLINK-15924
> URL: https://issues.apache.org/jira/browse/FLINK-15924
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.10.0
> Reporter: Till Rohrmann
> Priority: Major
> Fix For: 1.11.0
>
>
> When using the {{RpcEndpoint}} it is important that all operations which run
> on the main thread are never blocking. We have seen in the past that it is
> quite hard to always catch blocking operations in reviews and sometimes these
> changes caused instabilities in Flink. Once this happens it is not trivial to
> find the culprit which is responsible for the blocking operation.
> One way to make debugging easier is to add a monitor which detects and logs
> if a {{RpcEndpoint}} operation takes longer than {{n}} seconds for example.
> Depending on the overhead of this monitor one could even think about enabling
> it only via a special configuration (e.g. debug mode).
> A proper class to introduce this monitor could be the {{AkkaRpcActor}} which
> is responsible for executing main thread operations. Whenever we schedule an
> operation, we could start a timeout which if triggered and the operation has
> not been completed will log a warning.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)