> 1.) Checkpointer runs on the same process with bsp task.
> 2.) A separated checkpointing process per bsp task on each machine.
> 3.) A separated checkopinting process per machine.
> 4.) Checkpointing processes in forms of server farm.

When some task fails, the whole tasks will be re-started with previous
checkpoint data. Right?

I'm +1 for the first idea. I believe this way is simple and reliable.

2011/10/25 ChiaHung Lin <[email protected]>:
> Just some thoughts on why to programme checkpointer as separated process. The 
> idea is centered on isolation. Because fault will occur, ensuring that 
> failures/ errors would not adversely affect other parts of the system becomes 
> critical. Also, performing user tasks and saving data to hdfs are two 
> different issues so our goal is to ensure user tasks would continuously work 
> even if checkpointing process fails. As long as user tasks keep continuously 
> performing their job smoothly, checkpointing process can be ignored.
>
> There were 4 options considered previously:
>
> 1.) Checkpointer runs on the same process with bsp task.
> 2.) A separated checkpointing process per bsp task on each machine.
> 3.) A separated checkopinting process per machine.
> 4.) Checkpointing processes in forms of server farm.
>
> The problem for the first one is if the checkpoining process fails, user 
> tasks may fail as well, which is an unwanted behaviour for users. The fourth 
> has a problem that it affects arbitrary user tasks for recovery if both 
> processes fail. The second and third is similar except that the second option 
> would min user tasks to be affected if both processes fail. Running 
> checkpointer as separated process has an advantage that if only checkpointing 
> process fails, it is not necessary to recover. For example, suppose a BSP job 
> performs its tasks from supersteps 1 to 10. At the same time a separated 
> checkpointing process stands by. In the first 3 supersteps, both processes 
> work well. After the supersteps 4, the checkpointing process fails, but the 
> user task is continuously doing it task. At the supersteps 7, the 
> checkpointer is back (e.g. restart). And if user task keeps working until it 
> finishes, there is no need to perform recovery in this case. If bsp task 
> fails after checkpointing process is back, the system has chances to recover 
> from the latest snapshot.
>
> I understand the current implementation is not perfect. But that would be 
> good if we can work toward this direction because these are recommended to 
> the best of my knowledge.
>
> -----Original message-----
> From:Thomas Jungblut <[email protected]>
> To:[email protected]
> Date:Fri, 14 Oct 2011 15:54:10 +0200
> Subject:Checkpointer Process
>
> Hi all.
> My idea:
> Since YARN and multitasking we should consider moving the Checkpointer
> process into the BSPPeer itself instead of a single process.
>
> It would be great if we could discuss what would be the real advantage and
> disadvantage of integrating it in the same process / a daemon process.
>
> --
> Thomas Jungblut
> Berlin <[email protected]>
>
>
> --
> ChiaHung Lin
> Department of Information Management
> National University of Kaohsiung
> Taiwan
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Reply via email to