As you said, the groom (and its tasks) statuses must be periodically
reported for many reasons e.g., fault management, job progress report,
..., etc.

I've opened Jira ticket, HAMA-429 today, let's discuss them on it.

Please feel free to assign yourself, if you are willing to design and fix them.

On Tue, Aug 30, 2011 at 12:35 PM, ChiaHung Lin <[email protected]> wrote:
> From the jira log it shows that the committed patch lets bsp peer directly 
> report status back to master. An issue we may need to consider right now is 
> `how can we determine if a groom server fails?' With original mechanism we 
> can allow groom server to manage tasks (bsp peer) and master takes care of 
> groom servers. For instance, if a groom server fails, a master can reschedule 
> all tasks specified on that groom server to other working one. With currently 
> mechanism, the master, in addition to monitor the activity of groom servers, 
> also needs to deal with bsp peer. Do we have some plans on this already?
>
> -----Original message-----
> From:Edward J. Yoon <[email protected]>
> To:[email protected] <[email protected]>
> Date:Fri, 26 Aug 2011 15:11:56 +0900
> Subject:Re: Summary of problems with HAMA-413 and Discussion
>
> Okay.
>
> Sent from my iPhone
>
> On 2011. 8. 26., at 오후 2:49, "ChiaHung Lin" <[email protected]> wrote:
>
>> The latest patch (HAMA_NEW.patch) for HAMA-413 seems still using bsp peer to 
>> report its status back to master.
>>
>> +        umbilical.updateTaskStatusAndReport(taskid);
>>
>> +  public void updateTaskStatusAndReport(TaskAttemptID taskid) {
>> ...
>> +    doReport(taskStatus);
>> +  }
>>
>> Is there any chance to revert back using a version that reports task status 
>> by GroomServer, so we can discuss based on that version? Just to ensure that 
>> the following issues are not the result derived from the code changed above.
>>
>> -----Original message-----
>> From:Edward J. Yoon <[email protected]>
>> To:[email protected]
>> Date:Thu, 25 Aug 2011 19:43:48 +0900
>> Subject:Summary of problems with HAMA-413 and Discussion
>>
>> Today, I tested all Hama examples on my cluster of 32 nodes, with 96
>> tasks. Pi and Serialized Printing examples were working fine but
>>
>> 1. Barrier Synchronizations are not working well (with a 'bench' example).
>> 2. When an unexpected shutdown occurs, ZK nodes (which created by each
>> BSPPeer) will not be deleted. There's no way to clean them up before
>> reboot the server.
>> 3. Graph examples are not working.
>> 4. Too many reporting times between Groom and Master.
>> 5. And, there are many code issues that can be improved.
>>
>> 1, and 2 issues are already reported (See HAMA-387, HAMA-407). Some of
>> 3, 4, and 5 issues are already started by ChiaHung Lin.
>>
>> All issues around this should be fixed in HAMA-413? or, Should we just
>> commit HAMA-413?
>>
>> Thanks.
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>
>>
>> --
>> ChiaHung Lin
>> Department of Information Management
>> National University of Kaohsiung
>> Taiwan
>
>
> --
> ChiaHung Lin
> Department of Information Management
> National University of Kaohsiung
> Taiwan
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Reply via email to