Currently we have bsp() where users can code for performing thier tasks. For
instance,
... bsp() ...{
... // some computation
sync();
... // some other computation
sync();
...
}
However, this is difficult for recovery because 1st, it requires checkpointed
messages to be recovered so that the computation can be resumed from where it
fails; 2nd, the recovery procedure needs to know from which super step to
restart. With the current bsp(), it seems a common choice is preprocessing; but
this may not be good because when internally something goes wrong it, it is not
easy to find out the problem.
I come up with an alternative method but this would have change to the way of
our current procedure. So I think it would be good to discuss it first. It is
proposed as below:
1. we divide bsp() into smaller computation unit called e.g. step() or
superstep(), within which user still write their own logic.
2. in main, user composes the order of supersteps.
... class Superstep1 extends BSPSuperstep {
... superstep() ... {...}
}
... class Superstep2 extends BSPSuperstep {
... superstep() ... {...}
}
BSPJob bsp = new BSP(...);
bsp.compose(Superstep1.class).compose(Superstep2.class)...;
Therefore, when recovery, in BSPTask run() we can have
List<BSPSuperstep> steps = BSPJob.supersteps();
for(BSPSuperstep step: steps) {
if(checkpointed) {
// restore checkpointed messages e.g. adding checkpointed msg (in hdfs)
back to queues
}
step.superstep(...);
step.sync();
}
The advantage is easier for recovery procedure.
The disadvantage may be the client programme need to explicitly tell the order
of superstep.
Any thought?
--
ChiaHung Lin
Department of Information Management
National University of Kaohsiung
Taiwan