Currently we have bsp() where users can code for performing thier tasks. For 
instance, 

... bsp() ...{
   ... // some computation
   sync();
   ... // some other computation
   sync();
   ...   
}

However, this is difficult for recovery because 1st, it requires checkpointed 
messages to be recovered so that the computation can be resumed from where it 
fails; 2nd, the recovery procedure needs to know from which super step to 
restart. With the current bsp(), it seems a common choice is preprocessing; but 
this may not be good because when internally something goes wrong it, it is not 
easy to find out the problem. 

I come up with an alternative method but this would have change to the way of 
our current procedure. So I think it would be good to discuss it first. It is 
proposed as below:

1. we divide bsp() into smaller computation unit called e.g. step() or 
superstep(), within which user still write their own logic. 

2. in main, user composes the order of supersteps. 

... class Superstep1 extends BSPSuperstep {
   ... superstep() ... {...}
}
... class Superstep2 extends BSPSuperstep {
   ... superstep() ... {...}
}

BSPJob bsp = new BSP(...);
bsp.compose(Superstep1.class).compose(Superstep2.class)...;

Therefore, when recovery, in BSPTask run() we can have 

List<BSPSuperstep> steps = BSPJob.supersteps();

for(BSPSuperstep step: steps) {
   if(checkpointed) { 
     // restore checkpointed messages e.g. adding checkpointed msg (in hdfs) 
back to queues
   }
   step.superstep(...);
   step.sync();
}

The advantage is easier for recovery procedure.
The disadvantage may be the client programme need to explicitly tell the order 
of superstep.  

Any thought?

--
ChiaHung Lin
Department of Information Management
National University of Kaohsiung
Taiwan

Reply via email to