As you said, the groom (and its tasks) statuses must be periodically reported for many reasons e.g., fault management, job progress report, ..., etc.
I've opened Jira ticket, HAMA-429 today, let's discuss them on it. Please feel free to assign yourself, if you are willing to design and fix them. On Tue, Aug 30, 2011 at 12:35 PM, ChiaHung Lin <[email protected]> wrote: > From the jira log it shows that the committed patch lets bsp peer directly > report status back to master. An issue we may need to consider right now is > `how can we determine if a groom server fails?' With original mechanism we > can allow groom server to manage tasks (bsp peer) and master takes care of > groom servers. For instance, if a groom server fails, a master can reschedule > all tasks specified on that groom server to other working one. With currently > mechanism, the master, in addition to monitor the activity of groom servers, > also needs to deal with bsp peer. Do we have some plans on this already? > > -----Original message----- > From:Edward J. Yoon <[email protected]> > To:[email protected] <[email protected]> > Date:Fri, 26 Aug 2011 15:11:56 +0900 > Subject:Re: Summary of problems with HAMA-413 and Discussion > > Okay. > > Sent from my iPhone > > On 2011. 8. 26., at 오후 2:49, "ChiaHung Lin" <[email protected]> wrote: > >> The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer to >> report its status back to master. >> >> + umbilical.updateTaskStatusAndReport(taskid); >> >> + public void updateTaskStatusAndReport(TaskAttemptID taskid) { >> ... >> + doReport(taskStatus); >> + } >> >> Is there any chance to revert back using a version that reports task status >> by GroomServer, so we can discuss based on that version? Just to ensure that >> the following issues are not the result derived from the code changed above. >> >> -----Original message----- >> From:Edward J. Yoon <[email protected]> >> To:[email protected] >> Date:Thu, 25 Aug 2011 19:43:48 +0900 >> Subject:Summary of problems with HAMA-413 and Discussion >> >> Today, I tested all Hama examples on my cluster of 32 nodes, with 96 >> tasks. Pi and Serialized Printing examples were working fine but >> >> 1. Barrier Synchronizations are not working well (with a 'bench' example). >> 2. When an unexpected shutdown occurs, ZK nodes (which created by each >> BSPPeer) will not be deleted. There's no way to clean them up before >> reboot the server. >> 3. Graph examples are not working. >> 4. Too many reporting times between Groom and Master. >> 5. And, there are many code issues that can be improved. >> >> 1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of >> 3, 4, and 5 issues are already started by ChiaHung Lin. >> >> All issues around this should be fixed in HAMA-413? or, Should we just >> commit HAMA-413? >> >> Thanks. >> -- >> Best Regards, Edward J. Yoon >> @eddieyoon >> >> >> -- >> ChiaHung Lin >> Department of Information Management >> National University of Kaohsiung >> Taiwan > > > -- > ChiaHung Lin > Department of Information Management > National University of Kaohsiung > Taiwan > -- Best Regards, Edward J. Yoon @eddieyoon
