Nice plan. On Thu, Mar 15, 2012 at 3:58 AM, Thomas Jungblut <thomas.jungb...@googlemail.com> wrote: > Great we started a nice discussion about it. > > If user do something with communication APIs in a while read.next() >> loop or in memory, recovery is not simple. So in my opinion, first of >> all, we should have to separate (or hide) the data handlers to >> somewhere from bsp() function like M/R or Pregel. For example, > > > Yes, but do you know what? We can simply add another layer (a retry layer) > onto the MessageService instead of innovating new fancy methods and > classes. Hiding is the key. > > Otherwise I totally agree Chia-Hung. I will take a deeper look into > incremental checkpointing the next few days. > > a) If a user's bsp job fails or throws a Runtime exception while running, >> the odds are that if we recover his task state again, the task would fail >> again because we would be recovering his error state. Hadoop has a feature >> where we can tolerate failure of certain configurable percentage of failed >> tasks during computation. I think we should be open to the idea of users >> wanting to be oblivious to certain data. > > > Not for certain. There are cases where usercode may fail due to unlucky > timing, e.G. a downed service in an API. This is actually no runtime > exception like ClassNotFound or stuff which is very unlikely to be fixed in > several attempts. > I'm +1 for making it smarter than Hadoop, to decide on the exception type > if something can be recovered or not. However Hadoop is doing it right. > > 2. BSPPeerImpl should have a new flavor for initialization for recovery >> task, where: >> - it has a non-negative superstep to start with > > > Yes, this must trigger a read of the last checkpointed messages. > > We should provide a Map to the users to store their global state that >> should be recovered on failure. > > > This is really really tricky and I would put my efford in doing the > simplest thing first (sorry for the users). We can add a recovery for user > internals later on. > In my opinion internal state is not how BSP should be coded: Everything can > be stateless, really! > > I know it is difficult for the first time, but I observed that the less > state you store, the more simpler the code gets and that's what frameworks > are for. > So let's add this bad-style edge case in one of the upcoming releases if > there is really a demand for it. > > I take a bit of time on the weekend to write the JIRA issues from the > things we discussed here. > > I think we can really start right away to implement it, the simplest case > is very straightforward and we can improve later on. > > Am 14. März 2012 19:20 schrieb Suraj Menon <menonsur...@gmail.com>: > >> Hello, >> >> +1 on Chiahung's comment on getting things right till the beginning of >> superstep on failure. >> I think we are all over the place (or atleast I am) in our thinking on >> fault tolerance. I want to take a step backward. >> First we should decide the nature of faults that we have to recover from. I >> am putting these in points below: >> >> 1. Hardware failure.- In my opinion, this is the most important failure >> reason we should be working on. >> Why? - Most other nature of faults would happen because of errors in user >> programmer logic or Hama framework bugs if any. I will get to these >> scenarios in the next point. >> Picking up from Chiahung's mail, say we have 15 bsp tasks running in >> parallel for the 12th superstep. Out of it, node running task number 5,6 >> and 7 failed during execution. We would have to rollback that bsp task from >> the checkpointed data in superstep 11 based on the task id, atttempt id and >> job number. This would mean that the BSPChild executor in GroomServer >> has to be communicated that BSP Peer should be started in recovery mode and >> not assume superstep as -1. It should then load the input message queue >> with the checkpointed messages. It would be provided with the partition >> that holds the checkpointed data for the task id. All other tasks would be >> waiting in the sync barrier until recovered tasks 5,6 and 7 enters the sync >> of 12 th superstep to continue. >> >> 2. Software failure. >> >> a) If a user's bsp job fails or throws a Runtime exception while running, >> the odds are that if we recover his task state again, the task would fail >> again because we would be recovering his error state. Hadoop has a feature >> where we can tolerate failure of certain configurable percentage of failed >> tasks during computation. I think we should be open to the idea of users >> wanting to be oblivious to certain data. >> >> b) If we have a JVM error on GroomServer or in BSPPeer process, we can >> retry the execution of tasks as in case of hardware failure. >> >> I agree with Chiahung to focus on recovery logic and how to start it. With >> current design, it could be handled in following changes. >> >> 1. GroomServer should get a new directive on recovery task (different than >> new task) >> 2. BSPPeerImpl should have a new flavor for initialization for recovery >> task, where: >> - it has a non-negative superstep to start with >> - the partition and input file is from the checkpointed data. >> - the messages are read from partition and the input queue is filled >> with the messages if superstep > 0 >> >> We should provide a Map to the users to store their global state that >> should be recovered on failure. >> I will get some time from tomorrow evening. I shall get things more >> organized. >> >> Thanks, >> Suraj >> >> On Wed, Mar 14, 2012 at 9:00 AM, Chia-Hung Lin <cli...@googlemail.com> >> wrote: >> > >> > It would be simpler for the first version of fault tolerance to focus >> > on rollback to the level at the beginning of start of a superstep. For >> > example, when a job executes to the middle of 12th superstep, then the >> > task fails. The system should restart that task (assume checkpoint on >> > every superstep and fortunately the system have the latest snapshot of >> > e.g. the 11th superstep for that task) and put the necessary messages >> > (checkpointed at the 11th superstep that would be transferred to the >> > 12th superstep) back to the task so that the failed task can restart >> > from the 12th superstep. >> > >> > If later on the community wants the checkpoint to the level at >> > specific detail within a task. Probably we can exploit something like >> > incremental checkpoint[1] or Memento pattern, which requires users' >> > assistance, to restart a task for local checkpoint. >> > >> > [1]. Efficient Incremental Checkpointing of Java Programs. >> > hal.inria.fr/inria-00072848/PDF/RR-3810.pdf >> > >> > On 14 March 2012 15:29, Edward J. Yoon <edwardy...@apache.org> wrote: >> > > If user do something with communication APIs in a while read.next() >> > > loop or in memory, recovery is not simple. So in my opinion, first of >> > > all, we should have to separate (or hide) the data handlers to >> > > somewhere from bsp() function like M/R or Pregel. For example, >> > > >> > > ftbsp(Communicator comm); >> > > setup(DataInput in); >> > > close(DataOutput out); >> > > >> > > And then, maybe we can design the flow of checkpoint-based (Task >> > > Failure) recovery like this: >> > > >> > > 1. If some task failed to execute setup() or close() functions, just >> > > re-attempt, finally return "job failed" message. >> > > >> > > 2. If some task failed in the middle of processing, and it should be >> > > re-launched, the statuses of JobInProgress and TaskInProgress should >> > > be changed. >> > > >> > > 3. And, in a every step, all tasks should check whether status of >> > > rollback to earlier checkpoint or keep running (or waiting to leave >> > > barrier). >> > > >> > > 4. Re-launch a failed task. >> > > >> > > 5. Change the status to RUNNING from ROLLBACK or RECOVERY. >> > > >> > > On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut >> > > <thomas.jungb...@googlemail.com> wrote: >> > >> Ah yes, good points. >> > >> >> > >> If we don't have a checkpoint from the current superstep we have to do >> a >> > >> global rollback of the least known messages. >> > >> So we shouldn't offer this configurability through the BSPJob API, >> this is >> > >> for specialized users only. >> > >> >> > >> One more issue that I have in mind is how we would be able to recover >> the >> > >>> values of static variables that someone would be holding in each bsp >> job. >> > >>> This scenario is a problem if a user is maintaining some static >> variable >> > >>> state whose lifecycle spans across multiple supersteps. >> > >>> >> > >> >> > >> Ideally you would transfer your shared state through the messages. I >> > >> thought of making a backup function available in the BSP class where >> > >> someone can backup their internal state, but I guess this is not how >> BSP >> > >> should be written. >> > >> >> > >> Which does not mean that we don't want to provide this in next >> releases. >> > >> >> > >> Am 12. März 2012 09:01 schrieb Suraj Menon <menonsur...@gmail.com>: >> > >> >> > >>> Hello, >> > >>> >> > >>> I want to understand single task rollback. So consider a scenario, >> where >> > >>> all tasks checkpoint every 5 supersteps. Now when one of the tasks >> failed >> > >>> at superstep 7, it would have to recover from the checkpointed data >> at >> > >>> superstep 5. How would it get messages from the peer BSPs at >> superstep 6 >> > >>> and 7? >> > >>> >> > >>> One more issue that I have in mind is how we would be able to recover >> the >> > >>> values of static variables that someone would be holding in each bsp >> job. >> > >>> This scenario is a problem if a user is maintaining some static >> variable >> > >>> state whose lifecycle spans across multiple supersteps. >> > >>> >> > >>> Thanks, >> > >>> Suraj >> > >>> >> > >>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut < >> > >>> thomas.jungb...@googlemail.com> wrote: >> > >>> >> > >>> > I guess we have to slice some issues needed for checkpoint >> recovery. >> > >>> > >> > >>> > In my opinion we have two types of recovery: >> > >>> > - single task recovery >> > >>> > - global recovery of all tasks >> > >>> > >> > >>> > And I guess we can simply make a rule: >> > >>> > If a task fails inside our barrier sync method (since we have a >> double >> > >>> > barrier, after enterBarrier() and before leaveBarrier()), we have >> to do a >> > >>> > global recovery. >> > >>> > Else we can just do a single task rollback. >> > >>> > >> > >>> > For those asking why we can't do just always a global rollback: it >> is too >> > >>> > costly and we really do not need it in any case. >> > >>> > But we need it in the case where a task fails inside the barrier >> (between >> > >>> > enter and leave) just because a single rollbacked task can't trip >> the >> > >>> > enterBarrier-Barrier. >> > >>> > >> > >>> > Anything I have forgotten? >> > >>> > >> > >>> > >> > >>> > -- >> > >>> > Thomas Jungblut >> > >>> > Berlin <thomas.jungb...@gmail.com> >> > >>> > >> > >>> >> > >> >> > >> >> > >> >> > >> -- >> > >> Thomas Jungblut >> > >> Berlin <thomas.jungb...@gmail.com> >> > > >> > > >> > > >> > > -- >> > > Best Regards, Edward J. Yoon >> > > @eddieyoon >> > > > > -- > Thomas Jungblut > Berlin <thomas.jungb...@gmail.com>
-- Best Regards, Edward J. Yoon @eddieyoon