Hi guys,
In some instances, we want to do some kind of iterative processing in Crunch,
and run the same (or a similar) DoFn on the same PCollection multiple times.
For example, let's say we've got a PCollection of "grid" objects, and we want
to iteratively divide each of these grids into four sub-grids, leading to
exponential growth of the data. The naive way to do this would be to do the
following:
PCollection<Grid> grids = …;
for (…){
grids = grids.parallelDo(new SubdivideFn());
}
However, the above code would be optimized into a single string of DoFns, and
not increasing the number of mappers we've got per iteration, which of course
wouldn't work well with the exponential growth of data.
The current way of getting around this is to add a call to
materialize().iterator() on the PCollection in each iteration (this is also
done in the PageRankIT integration test).
What I propose is adding a "checkpoint" method to PCollection to signify that
this should be an actual step in processing. This could work as follows:
PCollection<Grid> grids = …;
for (…){
grids = grids.parallelDo(new SubdivideFn()).checkpoint();
}
In the short term this could even be implemented as just a call to
materialize().iterator(), but putting encapsulating it in a method like this
would allow us to work more efficiently with it in the future, especially once
CRUNCH-34 is merged.
Any thoughts on this? The actual name of the method is my biggest concern, I'm
not sure if "checkpoint" is the best name for it, but I can't think of anything
better at the moment.
- Gabriel