Just some thoughts on why to programme checkpointer as separated process. The idea is centered on isolation. Because fault will occur, ensuring that failures/ errors would not adversely affect other parts of the system becomes critical. Also, performing user tasks and saving data to hdfs are two different issues so our goal is to ensure user tasks would continuously work even if checkpointing process fails. As long as user tasks keep continuously performing their job smoothly, checkpointing process can be ignored.
There were 4 options considered previously: 1.) Checkpointer runs on the same process with bsp task. 2.) A separated checkpointing process per bsp task on each machine. 3.) A separated checkopinting process per machine. 4.) Checkpointing processes in forms of server farm. The problem for the first one is if the checkpoining process fails, user tasks may fail as well, which is an unwanted behaviour for users. The fourth has a problem that it affects arbitrary user tasks for recovery if both processes fail. The second and third is similar except that the second option would min user tasks to be affected if both processes fail. Running checkpointer as separated process has an advantage that if only checkpointing process fails, it is not necessary to recover. For example, suppose a BSP job performs its tasks from supersteps 1 to 10. At the same time a separated checkpointing process stands by. In the first 3 supersteps, both processes work well. After the supersteps 4, the checkpointing process fails, but the user task is continuously doing it task. At the supersteps 7, the checkpointer is back (e.g. restart). And if user task keeps working until it finishes, there is no need to perform recovery in this case. If bsp task fails after checkpointing process is back, the system has chances to recover from the latest snapshot. I understand the current implementation is not perfect. But that would be good if we can work toward this direction because these are recommended to the best of my knowledge. -----Original message----- From:Thomas Jungblut <[email protected]> To:[email protected] Date:Fri, 14 Oct 2011 15:54:10 +0200 Subject:Checkpointer Process Hi all. My idea: Since YARN and multitasking we should consider moving the Checkpointer process into the BSPPeer itself instead of a single process. It would be great if we could discuss what would be the real advantage and disadvantage of integrating it in the same process / a daemon process. -- Thomas Jungblut Berlin <[email protected]> -- ChiaHung Lin Department of Information Management National University of Kaohsiung Taiwan
