P.S., I know this task is not easy. Should we re-scheduling this to 0.5 release?
On Tue, Oct 25, 2011 at 5:22 PM, Edward J. Yoon <[email protected]> wrote: >> 1.) Checkpointer runs on the same process with bsp task. >> 2.) A separated checkpointing process per bsp task on each machine. >> 3.) A separated checkopinting process per machine. >> 4.) Checkpointing processes in forms of server farm. > > When some task fails, the whole tasks will be re-started with previous > checkpoint data. Right? > > I'm +1 for the first idea. I believe this way is simple and reliable. > > 2011/10/25 ChiaHung Lin <[email protected]>: >> Just some thoughts on why to programme checkpointer as separated process. >> The idea is centered on isolation. Because fault will occur, ensuring that >> failures/ errors would not adversely affect other parts of the system >> becomes critical. Also, performing user tasks and saving data to hdfs are >> two different issues so our goal is to ensure user tasks would continuously >> work even if checkpointing process fails. As long as user tasks keep >> continuously performing their job smoothly, checkpointing process can be >> ignored. >> >> There were 4 options considered previously: >> >> 1.) Checkpointer runs on the same process with bsp task. >> 2.) A separated checkpointing process per bsp task on each machine. >> 3.) A separated checkopinting process per machine. >> 4.) Checkpointing processes in forms of server farm. >> >> The problem for the first one is if the checkpoining process fails, user >> tasks may fail as well, which is an unwanted behaviour for users. The fourth >> has a problem that it affects arbitrary user tasks for recovery if both >> processes fail. The second and third is similar except that the second >> option would min user tasks to be affected if both processes fail. Running >> checkpointer as separated process has an advantage that if only >> checkpointing process fails, it is not necessary to recover. For example, >> suppose a BSP job performs its tasks from supersteps 1 to 10. At the same >> time a separated checkpointing process stands by. In the first 3 supersteps, >> both processes work well. After the supersteps 4, the checkpointing process >> fails, but the user task is continuously doing it task. At the supersteps 7, >> the checkpointer is back (e.g. restart). And if user task keeps working >> until it finishes, there is no need to perform recovery in this case. If bsp >> task fails after checkpointing process is back, the system has chances to >> recover from the latest snapshot. >> >> I understand the current implementation is not perfect. But that would be >> good if we can work toward this direction because these are recommended to >> the best of my knowledge. >> >> -----Original message----- >> From:Thomas Jungblut <[email protected]> >> To:[email protected] >> Date:Fri, 14 Oct 2011 15:54:10 +0200 >> Subject:Checkpointer Process >> >> Hi all. >> My idea: >> Since YARN and multitasking we should consider moving the Checkpointer >> process into the BSPPeer itself instead of a single process. >> >> It would be great if we could discuss what would be the real advantage and >> disadvantage of integrating it in the same process / a daemon process. >> >> -- >> Thomas Jungblut >> Berlin <[email protected]> >> >> >> -- >> ChiaHung Lin >> Department of Information Management >> National University of Kaohsiung >> Taiwan >> > > > > -- > Best Regards, Edward J. Yoon > @eddieyoon > -- Best Regards, Edward J. Yoon @eddieyoon
