I certainly understand the issue; do you prefer the two-function solution to one in which we added a method in DoFn to indicate which phase of the MR job a particular DoFn was being executed in? We might have options like MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a solution would work for all cases, so someone please call me out on that if there's something I'm missing.)
J On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit <[email protected]>wrote: > Hi, > > I encountered a situation where I need different behaviour of my CombineFn > during combine & reduce phase. > Basically, I have a collection of avro records that I need to combine. > For some of these, I have so many records with same key that I need to > combine them first to make my job work (memory & timing constraints) > For others, I can't combine them, because I need all records together. > So, basically I would want to know in my function if it's combining or > reducing. > The only way to solve my problem in crunch right now seems to be to first > split my collection in 2 different collections, combine them separately & > union them again. > But this give a lot of overhead for something that would be supported by > native M/R. > > I looked in the code and it seems that crunch internally has a NodeContext > object to indicate COMBINE or REDUCE, but this context is not accessible in > the DoFn. > As the (RT)Node object is an internal crunch object, it's also not a clean > solution to expose the NodeContext. > So, as a better solution, it would be possible to create a new method: > combineValues(combineFn, reduceFn) on PGroupedTable. The existing > combineValues(combineFn) is in that case just a convenience method for must > use cases, where the combineFn & reduceFn is the same function. > With this new method, I would be able to just create my combineFn twice & > pass a boolean in the constructor to indicate if it's combine or reduce. > > I already made a patch to add this function, but as the procedure > indicates to discuss the change first, I'll write this mail first to check > what you think. (I also didn't test my patch yet, although all unit & IT > still pass) > > Thanks > Stefan > >
