[
https://issues.apache.org/jira/browse/FLINK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010312#comment-17010312
]
Zhu Zhu commented on FLINK-15448:
---------------------------------
I just met another case that one can even no be able to find the host of a
pending/failed TM in logs (FLINK-15499).
So I think it would be helpful to print the host of a TM not only in the task
deploying stages.
Composing the host info into the ResourceID looks to me a better design than
spreading host around with ResourceID.
There can be 2 defects though:
1. redundancy logs
2. ResourceID size would double and the size of certain RPCs (like heartbeat)
may increase. This is a common issue for the work to associate other IDs with
meanings, like ExecutionAttemptID and IntermediateResultPartitionID
Regarding ResourceID, these 2 defects should not be critical.
[~trohrmann] shall we replace ResourceID with a extended class like
TaskManagerID? I think using a general ResourceID for both TM/RM/JM is making
it not that nonintuitive in development. And with it we can also limit the
change to for the extended class at the moment.
> Log host informations for TaskManager failures.
> -----------------------------------------------
>
> Key: FLINK-15448
> URL: https://issues.apache.org/jira/browse/FLINK-15448
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.9.1
> Reporter: Victor Wong
> Assignee: Victor Wong
> Priority: Minor
>
> With Flink on Yarn, sometimes we ran into an exception like this:
> {code:java}
> java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id
> container_xxxx timed out.
> {code}
> We'd like to find out the host of the lost TaskManager to log into it for
> more details, we have to check the previous logs for the host information,
> which is a little time-consuming.
> Maybe we can add more descriptive information to ResourceID of Yarn
> containers, e.g. "container_xxx@host_name:port_number".
> Here's the demo:
> {code:java}
> class ResourceID {
> final String resourceId;
> final String details;
> public ResourceID(String resourceId) {
> this.resourceId = resourceId;
> this.details = resourceId;
> }
> public ResourceID(String resourceId, String details) {
> this.resourceId = resourceId;
> this.details = details;
> }
> public String toString() {
> return details;
> }
> }
> // in flink-yarn
> private void startTaskExecutorInContainer(Container container) {
> final String containerIdStr = container.getId().toString();
> final String containerDetail = container.getId() + "@" +
> container.getNodeId();
> final ResourceID resourceId = new ResourceID(containerIdStr,
> containerDetail);
> ...
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)