[ 
https://issues.apache.org/jira/browse/FLINK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009825#comment-17009825
 ] 

Till Rohrmann commented on FLINK-15448:
---------------------------------------

Thanks for starting this discussion [~victor-wong]. I think you brought up a 
quite important point which is that Flink uses a lot of ids in its logs which 
are hard to decipher for the user. One example are the {{ResourceIDs}} and 
{{InstanceIDs}} of the {{TaskExecutors}}. Another example is the usage of 
{{SlotRequestId}} for various logical slots and {{AllocationID}}. I believe 
Flink's logs can be cleaned up big time by associating more meaning with ids.

Often it is a good idea to compose ids from higher level ids in order to attach 
meaning to them. So for example, the {{InstanceID}} could be composed of the 
{{ResourceID}} plus a monotonically increasing value. Of course, the initial 
part of the id should be descriptive as well. Here I agree with [~victor-wong] 
that it would be super helpful if for the {{ResourceID}} it could be something 
to identify the machine on which it is running. Of course, this might cause 
that we log an information redundantly in the logs.

I see the point that [~xintongsong] raised with breaking existing solutions if 
we change the ids. This is indeed not optimal but I would be willing to do it 
in order to improve the readability of the logs for our users. Technically, we 
never gave guarantees for the logs.

If we touch the {{ResourceID}} now, then we should also have a plan for the 
other distributed components which use the {{ResourceID}} to identify 
themselves.

> Log host informations for TaskManager failures.
> -----------------------------------------------
>
>                 Key: FLINK-15448
>                 URL: https://issues.apache.org/jira/browse/FLINK-15448
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.1
>            Reporter: Victor Wong
>            Assignee: Victor Wong
>            Priority: Minor
>
> With Flink on Yarn, sometimes we ran into an exception like this:
> {code:java}
> java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id 
> container_xxxx  timed out.
> {code}
> We'd like to find out the host of the lost TaskManager to log into it for 
> more details, we have to check the previous logs for the host information, 
> which is a little time-consuming.
> Maybe we can add more descriptive information to ResourceID of Yarn 
> containers, e.g. "container_xxx@host_name:port_number".
> Here's the demo:
> {code:java}
> class ResourceID {
>   final String resourceId;
>   final String details;
>   public ResourceID(String resourceId) {
>     this.resourceId = resourceId;
>     this.details = resourceId;
>   }
>   public ResourceID(String resourceId, String details) {
>     this.resourceId = resourceId;
>     this.details = details;
>   }
>   public String toString() {
>     return details;
>   }     
> }
> // in flink-yarn
> private void startTaskExecutorInContainer(Container container) {
>   final String containerIdStr = container.getId().toString();
>   final String containerDetail = container.getId() + "@" + 
> container.getNodeId();  
>   final ResourceID resourceId = new ResourceID(containerIdStr, 
> containerDetail);
>   ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to