Re: Recovery Issues

Edward J. Yoon Wed, 14 Mar 2012 14:19:53 -0700

Nice plan.

On Thu, Mar 15, 2012 at 3:58 AM, Thomas Jungblut
<thomas.jungb...@googlemail.com> wrote:
> Great we started a nice discussion about it.
>
> If user do something with communication APIs in a while read.next()
>> loop or in memory, recovery is not simple. So in my opinion, first of
>> all, we should have to separate (or hide) the data handlers to
>> somewhere from bsp() function like M/R or Pregel. For example,
>
>
> Yes, but do you know what? We can simply add another layer (a retry layer)
> onto the MessageService instead of innovating new fancy methods and
> classes. Hiding is the key.
>
> Otherwise I totally agree Chia-Hung. I will take a deeper look into
> incremental checkpointing the next few days.
>
> a) If a user's bsp job fails or throws a Runtime exception while running,
>> the odds are that if we recover his task state again, the task would fail
>> again because we would be recovering his error state. Hadoop has a feature
>> where we can tolerate failure of certain configurable percentage of failed
>> tasks during computation. I think we should be open to the idea of users
>> wanting to be oblivious to certain data.
>
>
> Not for certain. There are cases where usercode may fail due to unlucky
> timing, e.G. a downed service in an API. This is actually no runtime
> exception like ClassNotFound or stuff which is very unlikely to be fixed in
> several attempts.
> I'm +1 for making it smarter than Hadoop, to decide on the exception type
> if something can be recovered or not. However Hadoop is doing it right.
>
> 2. BSPPeerImpl should have a new flavor for initialization for recovery
>> task, where:
>>    - it has a non-negative superstep to start with
>
>
> Yes, this must trigger a read of the last checkpointed messages.
>
> We should provide a Map to the users to store their global state that
>> should be recovered on failure.
>
>
> This is really really tricky and I would put my efford in doing the
> simplest thing first (sorry for the users). We can add a recovery for user
> internals later on.
> In my opinion internal state is not how BSP should be coded: Everything can
> be stateless, really!
>
> I know it is difficult for the first time, but I observed that the less
> state you store, the more simpler the code gets and that's what frameworks
> are for.
> So let's add this bad-style edge case in one of the upcoming releases if
> there is really a demand for it.
>
> I take a bit of time on the weekend to write the JIRA issues from the
> things we discussed here.
>
> I think we can really start right away to implement it, the simplest case
> is very straightforward and we can improve later on.
>
> Am 14. März 2012 19:20 schrieb Suraj Menon <menonsur...@gmail.com>:
>
>> Hello,
>>
>> +1 on Chiahung's comment on getting things right till the beginning of
>> superstep on failure.
>> I think we are all over the place (or atleast I am) in our thinking on
>> fault tolerance. I want to take a step backward.
>> First we should decide the nature of faults that we have to recover from. I
>> am putting these in points below:
>>
>> 1. Hardware failure.- In my opinion, this is the most important failure
>> reason we should be working on.
>> Why? - Most other nature of faults would happen because of errors in user
>> programmer logic or Hama framework bugs if any. I will get to these
>> scenarios in the next point.
>> Picking up from Chiahung's mail, say we have 15 bsp tasks running in
>> parallel for the 12th superstep. Out of it, node running task number 5,6
>> and 7 failed during execution. We would have to rollback that bsp task from
>> the checkpointed data in superstep 11 based on the task id, atttempt id and
>> job number. This would mean that the BSPChild executor in GroomServer
>> has to be communicated that BSP Peer should be started in recovery mode and
>> not assume superstep as -1. It should then load the input message queue
>> with the checkpointed messages. It would be provided with the partition
>> that holds the checkpointed data for the task id. All other tasks would be
>> waiting in the sync barrier until recovered tasks 5,6 and 7 enters the sync
>> of 12 th superstep to continue.
>>
>> 2. Software failure.
>>
>> a) If a user's bsp job fails or throws a Runtime exception while running,
>> the odds are that if we recover his task state again, the task would fail
>> again because we would be recovering his error state. Hadoop has a feature
>> where we can tolerate failure of certain configurable percentage of failed
>> tasks during computation. I think we should be open to the idea of users
>> wanting to be oblivious to certain data.
>>
>> b) If we have a JVM error on GroomServer or in BSPPeer process, we can
>> retry the execution of tasks as in case of hardware failure.
>>
>> I agree with Chiahung to focus on recovery logic and how to start it. With
>> current design, it could be handled in following changes.
>>
>> 1. GroomServer should get a new directive on recovery task (different than
>> new task)
>> 2. BSPPeerImpl should have a new flavor for initialization for recovery
>> task, where:
>>    - it has a non-negative superstep to start with
>>    - the partition and input file is from the checkpointed data.
>>    - the messages are read from partition and the input queue is filled
>> with the messages if superstep > 0
>>
>> We should provide a Map to the users to store their global state that
>> should be recovered on failure.
>> I will get some time from tomorrow evening. I shall get things more
>> organized.
>>
>> Thanks,
>> Suraj
>>
>> On Wed, Mar 14, 2012 at 9:00 AM, Chia-Hung Lin <cli...@googlemail.com>
>> wrote:
>> >
>> > It would be simpler for the first version of fault tolerance  to focus
>> > on rollback to the level at the beginning of start of a superstep. For
>> > example, when a job executes to the middle of 12th superstep, then the
>> > task fails. The system should restart that task (assume checkpoint on
>> > every superstep and fortunately the system have the latest snapshot of
>> > e.g. the 11th superstep for that task) and put the necessary messages
>> > (checkpointed at the 11th superstep that would be transferred to the
>> > 12th superstep) back to the task so that the failed task can restart
>> > from the 12th superstep.
>> >
>> > If later on the community wants the checkpoint to the level at
>> > specific detail within a task. Probably we can exploit something like
>> > incremental checkpoint[1] or Memento pattern, which requires users'
>> > assistance, to restart a task for local checkpoint.
>> >
>> > [1]. Efﬁcient Incremental Checkpointing of Java Programs.
>> > hal.inria.fr/inria-00072848/PDF/RR-3810.pdf
>> >
>> > On 14 March 2012 15:29, Edward J. Yoon <edwardy...@apache.org> wrote:
>> > > If user do something with communication APIs in a while read.next()
>> > > loop or in memory, recovery is not simple. So in my opinion, first of
>> > > all, we should have to separate (or hide) the data handlers to
>> > > somewhere from bsp() function like M/R or Pregel. For example,
>> > >
>> > > ftbsp(Communicator comm);
>> > > setup(DataInput in);
>> > > close(DataOutput out);
>> > >
>> > > And then, maybe we can design the flow of checkpoint-based (Task
>> > > Failure) recovery like this:
>> > >
>> > > 1. If some task failed to execute setup() or close() functions, just
>> > > re-attempt, finally return "job failed" message.
>> > >
>> > > 2. If some task failed in the middle of processing, and it should be
>> > > re-launched, the statuses of JobInProgress and TaskInProgress should
>> > > be changed.
>> > >
>> > > 3. And, in a every step, all tasks should check whether status of
>> > > rollback to earlier checkpoint or keep running (or waiting to leave
>> > > barrier).
>> > >
>> > > 4. Re-launch a failed task.
>> > >
>> > > 5. Change the status to RUNNING from ROLLBACK or RECOVERY.
>> > >
>> > > On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut
>> > > <thomas.jungb...@googlemail.com> wrote:
>> > >> Ah yes, good points.
>> > >>
>> > >> If we don't have a checkpoint from the current superstep we have to do
>> a
>> > >> global rollback of the least known messages.
>> > >> So we shouldn't offer this configurability through the BSPJob API,
>> this is
>> > >> for specialized users only.
>> > >>
>> > >> One more issue that I have in mind is how we would be able to recover
>> the
>> > >>> values of static variables that someone would be holding in each bsp
>> job.
>> > >>> This scenario is a problem if a user is maintaining some static
>> variable
>> > >>> state whose lifecycle spans across multiple supersteps.
>> > >>>
>> > >>
>> > >> Ideally you would transfer your shared state through the messages. I
>> > >> thought of making a backup function available in the BSP class where
>> > >> someone can backup their internal state, but I guess this is not how
>> BSP
>> > >> should be written.
>> > >>
>> > >> Which does not mean that we don't want to provide this in next
>> releases.
>> > >>
>> > >> Am 12. März 2012 09:01 schrieb Suraj Menon <menonsur...@gmail.com>:
>> > >>
>> > >>> Hello,
>> > >>>
>> > >>> I want to understand single task rollback. So consider a scenario,
>> where
>> > >>> all tasks checkpoint every 5 supersteps. Now when one of the tasks
>> failed
>> > >>> at superstep 7, it would have to recover from the checkpointed data
>> at
>> > >>> superstep 5. How would it get messages from the peer BSPs at
>> superstep 6
>> > >>> and 7?
>> > >>>
>> > >>> One more issue that I have in mind is how we would be able to recover
>> the
>> > >>> values of static variables that someone would be holding in each bsp
>> job.
>> > >>> This scenario is a problem if a user is maintaining some static
>> variable
>> > >>> state whose lifecycle spans across multiple supersteps.
>> > >>>
>> > >>> Thanks,
>> > >>> Suraj
>> > >>>
>> > >>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
>> > >>> thomas.jungb...@googlemail.com> wrote:
>> > >>>
>> > >>> > I guess we have to slice some issues needed for checkpoint
>> recovery.
>> > >>> >
>> > >>> > In my opinion we have two types of recovery:
>> > >>> > - single task recovery
>> > >>> > - global recovery of all tasks
>> > >>> >
>> > >>> > And I guess we can simply make a rule:
>> > >>> > If a task fails inside our barrier sync method (since we have a
>> double
>> > >>> > barrier, after enterBarrier() and before leaveBarrier()), we have
>> to do a
>> > >>> > global recovery.
>> > >>> > Else we can just do a single task rollback.
>> > >>> >
>> > >>> > For those asking why we can't do just always a global rollback: it
>> is too
>> > >>> > costly and we really do not need it in any case.
>> > >>> > But we need it in the case where a task fails inside the barrier
>> (between
>> > >>> > enter and leave) just because a single rollbacked task can't trip
>> the
>> > >>> > enterBarrier-Barrier.
>> > >>> >
>> > >>> > Anything I have forgotten?
>> > >>> >
>> > >>> >
>> > >>> > --
>> > >>> > Thomas Jungblut
>> > >>> > Berlin <thomas.jungb...@gmail.com>
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Thomas Jungblut
>> > >> Berlin <thomas.jungb...@gmail.com>
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards, Edward J. Yoon
>> > > @eddieyoon
>>
>
>
>
> --
> Thomas Jungblut
> Berlin <thomas.jungb...@gmail.com>




-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Recovery Issues

Reply via email to