> 1.) Checkpointer runs on the same process with bsp task. > 2.) A separated checkpointing process per bsp task on each machine. > 3.) A separated checkopinting process per machine. > 4.) Checkpointing processes in forms of server farm.
When some task fails, the whole tasks will be re-started with previous checkpoint data. Right? I'm +1 for the first idea. I believe this way is simple and reliable. 2011/10/25 ChiaHung Lin <[email protected]>: > Just some thoughts on why to programme checkpointer as separated process. The > idea is centered on isolation. Because fault will occur, ensuring that > failures/ errors would not adversely affect other parts of the system becomes > critical. Also, performing user tasks and saving data to hdfs are two > different issues so our goal is to ensure user tasks would continuously work > even if checkpointing process fails. As long as user tasks keep continuously > performing their job smoothly, checkpointing process can be ignored. > > There were 4 options considered previously: > > 1.) Checkpointer runs on the same process with bsp task. > 2.) A separated checkpointing process per bsp task on each machine. > 3.) A separated checkopinting process per machine. > 4.) Checkpointing processes in forms of server farm. > > The problem for the first one is if the checkpoining process fails, user > tasks may fail as well, which is an unwanted behaviour for users. The fourth > has a problem that it affects arbitrary user tasks for recovery if both > processes fail. The second and third is similar except that the second option > would min user tasks to be affected if both processes fail. Running > checkpointer as separated process has an advantage that if only checkpointing > process fails, it is not necessary to recover. For example, suppose a BSP job > performs its tasks from supersteps 1 to 10. At the same time a separated > checkpointing process stands by. In the first 3 supersteps, both processes > work well. After the supersteps 4, the checkpointing process fails, but the > user task is continuously doing it task. At the supersteps 7, the > checkpointer is back (e.g. restart). And if user task keeps working until it > finishes, there is no need to perform recovery in this case. If bsp task > fails after checkpointing process is back, the system has chances to recover > from the latest snapshot. > > I understand the current implementation is not perfect. But that would be > good if we can work toward this direction because these are recommended to > the best of my knowledge. > > -----Original message----- > From:Thomas Jungblut <[email protected]> > To:[email protected] > Date:Fri, 14 Oct 2011 15:54:10 +0200 > Subject:Checkpointer Process > > Hi all. > My idea: > Since YARN and multitasking we should consider moving the Checkpointer > process into the BSPPeer itself instead of a single process. > > It would be great if we could discuss what would be the real advantage and > disadvantage of integrating it in the same process / a daemon process. > > -- > Thomas Jungblut > Berlin <[email protected]> > > > -- > ChiaHung Lin > Department of Information Management > National University of Kaohsiung > Taiwan > -- Best Regards, Edward J. Yoon @eddieyoon
