Hi guys,  

In some instances, we want to do some kind of iterative processing in Crunch, 
and run the same (or a similar) DoFn on the same PCollection multiple times.

For example, let's say we've got a PCollection of "grid" objects, and we want 
to iteratively divide each of these grids into four sub-grids, leading to 
exponential growth of the data. The naive way to do this would be to do the 
following:

PCollection<Grid> grids = …;
for (…){
   grids = grids.parallelDo(new SubdivideFn());
}

However, the above code would be optimized into a single string of DoFns, and 
not increasing the number of mappers we've got per iteration, which of course 
wouldn't work well with the exponential growth of data.

The current way of getting around this is to add a call to 
materialize().iterator() on the PCollection in each iteration (this is also 
done in the PageRankIT integration test).

What I propose is adding a "checkpoint" method to PCollection to signify that 
this should be an actual step in processing. This could work as follows:

PCollection<Grid> grids = …;
for (…){
   grids = grids.parallelDo(new SubdivideFn()).checkpoint();
}


In the short term this could even be implemented as just a call to 
materialize().iterator(), but putting encapsulating it in a method like this 
would allow us to work more efficiently with it in the future, especially once 
CRUNCH-34 is merged.

Any thoughts on this? The actual name of the method is my biggest concern, I'm 
not sure if "checkpoint" is the best name for it, but I can't think of anything 
better at the moment.

- Gabriel  

Reply via email to