[ 
https://issues.apache.org/jira/browse/HADOOP-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601595#action_12601595
 ] 

Hemanth Yamijala commented on HADOOP-3464:
------------------------------------------

Few comments:

- When ringmaster fails, we are printing out the errors as a array of strings 
in a single line. For better readability, they should be printed one per line.
- When ringmaster fails due to problems with hadoop pkgs, the error message is 
not helpful. It says something like int cannot be NoneType or some such. This 
should be improved.
- We use ringmaster.addMasterParams to report errors from the hodrings. This is 
confusing. We should define a new API, something like setHodRingError and 
report errors back using that RPC.
- The PID of the hodring process is part of the 'host' reporting the error. It 
appears this is important, as removing the PID caused the functionality to 
break. However, when we print these messages to the client, the name is printed 
as hostname_pid, which does not make too much sense. So, we can try and see if 
the pid part can be avoided.
- At few places we are constructing an XML-RPC client object. If already 
constructed, can be reuse this ?
- When hodrings fail due to a config error, we don't report this back. This is 
because error reporting happens only if the getCommand method has been called 
by a hodring. In case of config errors, getCommand is not called and so these 
errors are not caught. The requirement is that we should be able to report 
Master command failures - that is if an internal HDFS daemon fails, or MapRed 
daemon fails. If there are n nodes in the ring, atleast 2 (in case of internal) 
or 1 hodring should come up successfully for the masters. If the number of 
reported failures exceeds this, we can report a failure to the service registry 
client.
- When a hadoop daemon fails, the message simply says failed to launch hadoop 
command. Typically the daemon.err file has more useful information. If 
possible, this should be fetched and displayed to the client.

Will try and submit a patch addressing these points.

> [HOD] HOD can improve error messages by reporting failures on compute nodes 
> back to hod client
> ----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3464
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3464
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3464, HADOOP-3464.1
>
>
> This issue addresses error messages w.r.t failures on compute nodes, while 
> HADOOP-3151 addresses error messages in hod client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to