On Wed, Oct 23, 2013 at 2:10 PM, Josh Wills <[email protected]> wrote: > I certainly understand the issue; do you prefer the two-function solution > to one in which we added a method in DoFn to indicate which phase of the MR > job a particular DoFn was being executed in? We might have options like > MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a > solution would work for all cases, so someone please call me out on that if > there's something I'm missing.) >
Do you mean a method in DoFn so that you could say "execute this function in the reduce phase", or do you mean a method where you could say "tell me which phase I'm operating in"? If you're talking about the second case, I think that Stefan's suggested solution will be perfectly compatible with that (i.e. pass in a single CombineFn that is aware of which context it's running in), while it would also be possible to explicitly pass in two different implementations, which I think would be really interesting to have. - Gabriel > J > > > On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit > <[email protected]>wrote: > >> Hi, >> >> I encountered a situation where I need different behaviour of my CombineFn >> during combine & reduce phase. >> Basically, I have a collection of avro records that I need to combine. >> For some of these, I have so many records with same key that I need to >> combine them first to make my job work (memory & timing constraints) >> For others, I can't combine them, because I need all records together. >> So, basically I would want to know in my function if it's combining or >> reducing. >> The only way to solve my problem in crunch right now seems to be to first >> split my collection in 2 different collections, combine them separately & >> union them again. >> But this give a lot of overhead for something that would be supported by >> native M/R. >> >> I looked in the code and it seems that crunch internally has a NodeContext >> object to indicate COMBINE or REDUCE, but this context is not accessible in >> the DoFn. >> As the (RT)Node object is an internal crunch object, it's also not a clean >> solution to expose the NodeContext. >> So, as a better solution, it would be possible to create a new method: >> combineValues(combineFn, reduceFn) on PGroupedTable. The existing >> combineValues(combineFn) is in that case just a convenience method for must >> use cases, where the combineFn & reduceFn is the same function. >> With this new method, I would be able to just create my combineFn twice & >> pass a boolean in the constructor to indicate if it's combine or reduce. >> >> I already made a patch to add this function, but as the procedure >> indicates to discuss the change first, I'll write this mail first to check >> what you think. (I also didn't test my patch yet, although all unit & IT >> still pass) >> >> Thanks >> Stefan >> >>
