[
https://issues.apache.org/jira/browse/HADOOP-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peeyush Bishnoi updated HADOOP-4937:
------------------------------------
Attachment: hadoop-4937-1.txt
Thanks! Hemanth for necessary information. Attaching new patch according as per
your comments.
--
> [HOD] Include ringmaster RPC port information in the notes attribute
> --------------------------------------------------------------------
>
> Key: HADOOP-4937
> URL: https://issues.apache.org/jira/browse/HADOOP-4937
> Project: Hadoop Core
> Issue Type: New Feature
> Components: contrib/hod
> Reporter: Hemanth Yamijala
> Assignee: Peeyush Bishnoi
> Attachments: hadoop-4937-1.txt, hadoop-4937.txt
>
>
> In large cluster deployments, due to node failures, it sometimes happens that
> HOD clusters get allocated, but not deallocated even after the idleness limit
> of the cluster (the time for which no jobs are run) exceeds. One of the main
> reasons for this is the ringmaster process which is responsible for tracking
> and cleaning an idle cluster (of which it is a part) itself goes down. To
> handle such scenarios it makes sense to centrally track the ringmaster nodes
> for suspicious clusters. But since the information about which port the
> ringmaster is bound to is not centrally available, this becomes impossible to
> monitor.
> This issue is an enhancement request to include ringmaster RPC port
> information along with the JT and NN info as part of the resource manager's
> notes attribute so that it can be used by any monitoring processes built
> around it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.