[
https://issues.apache.org/jira/browse/FLINK-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
vinoyang updated FLINK-15924:
-----------------------------
Description:
When using the {{RpcEndpoint}} it is important that all operations which run on
the main thread are never blocking. We have seen in the past that it is quite
hard to always catch blocking operations in reviews and sometimes these changes
caused instabilities in Flink. Once this happens it is not trivial to find the
culprit which is responsible for the blocking operation.
One way to make debugging easier is to add a monitor which detects and logs if
a {{RpcEndpoint}} operation takes longer than {{n}} seconds for example.
Depending on the overhead of this monitor one could even think about enabling
it only via a special configuration (e.g. debug mode).
A proper class to introduce this monitor could be the {{AkkaRpcActor}} which is
responsible for executing main thread operations. Whenever we schedule an
operation, we could start a timeout which if triggered and the operation has
not been completed will log a warning.
was:
When using the {{RpcEndpoint}} it is important that all operations which run on
the main thread are never blocking. We have seen in the past that it is quite
hard to always catch blocking operations in reviews and sometimes these changes
caused instabilities in Flink. Once this happens it is not trivial to find the
culprit which is responsible for the blocking operation.
One way to make debugging easier is to add a monitor which detects and logs if
a {{RpcEndpoint}} operation takes longer than {{n}} seconds for example.
Depending on the overhead of this monitor one could even think about enabling
it only via a special configuration (e.g. debug mode).
A proper class to introduce this monitor could be the {{AkkRpcActor}} which is
responsible for executing main thread operations. Whenever we schedule an
operation, we could start a timeout which if triggered and the operation has
not been completed will log a warning.
> Detect and log blocking main thread operations
> ----------------------------------------------
>
> Key: FLINK-15924
> URL: https://issues.apache.org/jira/browse/FLINK-15924
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.10.0
> Reporter: Till Rohrmann
> Priority: Major
> Fix For: 1.11.0
>
>
> When using the {{RpcEndpoint}} it is important that all operations which run
> on the main thread are never blocking. We have seen in the past that it is
> quite hard to always catch blocking operations in reviews and sometimes these
> changes caused instabilities in Flink. Once this happens it is not trivial to
> find the culprit which is responsible for the blocking operation.
> One way to make debugging easier is to add a monitor which detects and logs
> if a {{RpcEndpoint}} operation takes longer than {{n}} seconds for example.
> Depending on the overhead of this monitor one could even think about enabling
> it only via a special configuration (e.g. debug mode).
> A proper class to introduce this monitor could be the {{AkkaRpcActor}} which
> is responsible for executing main thread operations. Whenever we schedule an
> operation, we could start a timeout which if triggered and the operation has
> not been completed will log a warning.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)