From the jira log it shows that the committed patch lets bsp peer directly report status back to master. An issue we may need to consider right now is `how can we determine if a groom server fails?' With original mechanism we can allow groom server to manage tasks (bsp peer) and master takes care of groom servers. For instance, if a groom server fails, a master can reschedule all tasks specified on that groom server to other working one. With currently mechanism, the master, in addition to monitor the activity of groom servers, also needs to deal with bsp peer. Do we have some plans on this already?
-----Original message----- From:Edward J. Yoon <[email protected]> To:[email protected] <[email protected]> Date:Fri, 26 Aug 2011 15:11:56 +0900 Subject:Re: Summary of problems with HAMA-413 and Discussion Okay. Sent from my iPhone On 2011. 8. 26., at 오후 2:49, "ChiaHung Lin" <[email protected]> wrote: > The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer to > report its status back to master. > > + umbilical.updateTaskStatusAndReport(taskid); > > + public void updateTaskStatusAndReport(TaskAttemptID taskid) { > ... > + doReport(taskStatus); > + } > > Is there any chance to revert back using a version that reports task status > by GroomServer, so we can discuss based on that version? Just to ensure that > the following issues are not the result derived from the code changed above. > > -----Original message----- > From:Edward J. Yoon <[email protected]> > To:[email protected] > Date:Thu, 25 Aug 2011 19:43:48 +0900 > Subject:Summary of problems with HAMA-413 and Discussion > > Today, I tested all Hama examples on my cluster of 32 nodes, with 96 > tasks. Pi and Serialized Printing examples were working fine but > > 1. Barrier Synchronizations are not working well (with a 'bench' example). > 2. When an unexpected shutdown occurs, ZK nodes (which created by each > BSPPeer) will not be deleted. There's no way to clean them up before > reboot the server. > 3. Graph examples are not working. > 4. Too many reporting times between Groom and Master. > 5. And, there are many code issues that can be improved. > > 1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of > 3, 4, and 5 issues are already started by ChiaHung Lin. > > All issues around this should be fixed in HAMA-413? or, Should we just > commit HAMA-413? > > Thanks. > -- > Best Regards, Edward J. Yoon > @eddieyoon > > > -- > ChiaHung Lin > Department of Information Management > National University of Kaohsiung > Taiwan -- ChiaHung Lin Department of Information Management National University of Kaohsiung Taiwan
