P.S., I know this task is not easy. Should we re-scheduling this to 0.5 release?

On Tue, Oct 25, 2011 at 5:22 PM, Edward J. Yoon <[email protected]> wrote:
>> 1.) Checkpointer runs on the same process with bsp task.
>> 2.) A separated checkpointing process per bsp task on each machine.
>> 3.) A separated checkopinting process per machine.
>> 4.) Checkpointing processes in forms of server farm.
>
> When some task fails, the whole tasks will be re-started with previous
> checkpoint data. Right?
>
> I'm +1 for the first idea. I believe this way is simple and reliable.
>
> 2011/10/25 ChiaHung Lin <[email protected]>:
>> Just some thoughts on why to programme checkpointer as separated process. 
>> The idea is centered on isolation. Because fault will occur, ensuring that 
>> failures/ errors would not adversely affect other parts of the system 
>> becomes critical. Also, performing user tasks and saving data to hdfs are 
>> two different issues so our goal is to ensure user tasks would continuously 
>> work even if checkpointing process fails. As long as user tasks keep 
>> continuously performing their job smoothly, checkpointing process can be 
>> ignored.
>>
>> There were 4 options considered previously:
>>
>> 1.) Checkpointer runs on the same process with bsp task.
>> 2.) A separated checkpointing process per bsp task on each machine.
>> 3.) A separated checkopinting process per machine.
>> 4.) Checkpointing processes in forms of server farm.
>>
>> The problem for the first one is if the checkpoining process fails, user 
>> tasks may fail as well, which is an unwanted behaviour for users. The fourth 
>> has a problem that it affects arbitrary user tasks for recovery if both 
>> processes fail. The second and third is similar except that the second 
>> option would min user tasks to be affected if both processes fail. Running 
>> checkpointer as separated process has an advantage that if only 
>> checkpointing process fails, it is not necessary to recover. For example, 
>> suppose a BSP job performs its tasks from supersteps 1 to 10. At the same 
>> time a separated checkpointing process stands by. In the first 3 supersteps, 
>> both processes work well. After the supersteps 4, the checkpointing process 
>> fails, but the user task is continuously doing it task. At the supersteps 7, 
>> the checkpointer is back (e.g. restart). And if user task keeps working 
>> until it finishes, there is no need to perform recovery in this case. If bsp 
>> task fails after checkpointing process is back, the system has chances to 
>> recover from the latest snapshot.
>>
>> I understand the current implementation is not perfect. But that would be 
>> good if we can work toward this direction because these are recommended to 
>> the best of my knowledge.
>>
>> -----Original message-----
>> From:Thomas Jungblut <[email protected]>
>> To:[email protected]
>> Date:Fri, 14 Oct 2011 15:54:10 +0200
>> Subject:Checkpointer Process
>>
>> Hi all.
>> My idea:
>> Since YARN and multitasking we should consider moving the Checkpointer
>> process into the BSPPeer itself instead of a single process.
>>
>> It would be great if we could discuss what would be the real advantage and
>> disadvantage of integrating it in the same process / a daemon process.
>>
>> --
>> Thomas Jungblut
>> Berlin <[email protected]>
>>
>>
>> --
>> ChiaHung Lin
>> Department of Information Management
>> National University of Kaohsiung
>> Taiwan
>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Reply via email to