Just some thoughts on why to programme checkpointer as separated process. The 
idea is centered on isolation. Because fault will occur, ensuring that 
failures/ errors would not adversely affect other parts of the system becomes 
critical. Also, performing user tasks and saving data to hdfs are two different 
issues so our goal is to ensure user tasks would continuously work even if 
checkpointing process fails. As long as user tasks keep continuously performing 
their job smoothly, checkpointing process can be ignored. 

There were 4 options considered previously:

1.) Checkpointer runs on the same process with bsp task.
2.) A separated checkpointing process per bsp task on each machine. 
3.) A separated checkopinting process per machine.
4.) Checkpointing processes in forms of server farm. 

The problem for the first one is if the checkpoining process fails, user tasks 
may fail as well, which is an unwanted behaviour for users. The fourth has a 
problem that it affects arbitrary user tasks for recovery if both processes 
fail. The second and third is similar except that the second option would min 
user tasks to be affected if both processes fail. Running checkpointer as 
separated process has an advantage that if only checkpointing process fails, it 
is not necessary to recover. For example, suppose a BSP job performs its tasks 
from supersteps 1 to 10. At the same time a separated checkpointing process 
stands by. In the first 3 supersteps, both processes work well. After the 
supersteps 4, the checkpointing process fails, but the user task is 
continuously doing it task. At the supersteps 7, the checkpointer is back (e.g. 
restart). And if user task keeps working until it finishes, there is no need to 
perform recovery in this case. If bsp task fails after checkpointing process is 
back, the system has chances to recover from the latest snapshot. 

I understand the current implementation is not perfect. But that would be good 
if we can work toward this direction because these are recommended to the best 
of my knowledge. 

-----Original message-----
From:Thomas Jungblut <[email protected]>
To:[email protected]
Date:Fri, 14 Oct 2011 15:54:10 +0200
Subject:Checkpointer Process

Hi all.
My idea:
Since YARN and multitasking we should consider moving the Checkpointer
process into the BSPPeer itself instead of a single process.

It would be great if we could discuss what would be the real advantage and
disadvantage of integrating it in the same process / a daemon process.

-- 
Thomas Jungblut
Berlin <[email protected]>


--
ChiaHung Lin
Department of Information Management
National University of Kaohsiung
Taiwan

Reply via email to