Stefan, could you put this in Jira (https://issues.apache.org/jira/browse/CRUNCH) and add the patch there?
- Gabriel On Wed, Oct 23, 2013 at 2:55 PM, Josh Wills <[email protected]> wrote: > On Wed, Oct 23, 2013 at 5:28 AM, Gabriel Reid <[email protected]>wrote: > >> On Wed, Oct 23, 2013 at 2:10 PM, Josh Wills <[email protected]> wrote: >> > I certainly understand the issue; do you prefer the two-function solution >> > to one in which we added a method in DoFn to indicate which phase of the >> MR >> > job a particular DoFn was being executed in? We might have options like >> > MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a >> > solution would work for all cases, so someone please call me out on that >> if >> > there's something I'm missing.) >> > >> >> Do you mean a method in DoFn so that you could say "execute this >> function in the reduce phase", or do you mean a method where you could >> say "tell me which phase I'm operating in"? >> > > The latter. > > >> >> If you're talking about the second case, I think that Stefan's >> suggested solution will be perfectly compatible with that (i.e. pass >> in a single CombineFn that is aware of which context it's running in), >> while it would also be possible to explicitly pass in two different >> implementations, which I think would be really interesting to have. >> > > Indeed-- my thought was that the functions would be pretty similar in most > cases, and that you would usually just want a switch for determining which > context you were in to reduce the code duplication. Of course, you could do > it yourself w/like a boolean flag on the constructor of the DoFn, which > might also be a good option-- wouldn't require a DoFn change. > > >> >> - Gabriel >> >> >> >> >> > J >> > >> > >> > On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit < >> [email protected]>wrote: >> > >> >> Hi, >> >> >> >> I encountered a situation where I need different behaviour of my >> CombineFn >> >> during combine & reduce phase. >> >> Basically, I have a collection of avro records that I need to combine. >> >> For some of these, I have so many records with same key that I need to >> >> combine them first to make my job work (memory & timing constraints) >> >> For others, I can't combine them, because I need all records together. >> >> So, basically I would want to know in my function if it's combining or >> >> reducing. >> >> The only way to solve my problem in crunch right now seems to be to >> first >> >> split my collection in 2 different collections, combine them separately >> & >> >> union them again. >> >> But this give a lot of overhead for something that would be supported by >> >> native M/R. >> >> >> >> I looked in the code and it seems that crunch internally has a >> NodeContext >> >> object to indicate COMBINE or REDUCE, but this context is not >> accessible in >> >> the DoFn. >> >> As the (RT)Node object is an internal crunch object, it's also not a >> clean >> >> solution to expose the NodeContext. >> >> So, as a better solution, it would be possible to create a new method: >> >> combineValues(combineFn, reduceFn) on PGroupedTable. The existing >> >> combineValues(combineFn) is in that case just a convenience method for >> must >> >> use cases, where the combineFn & reduceFn is the same function. >> >> With this new method, I would be able to just create my combineFn twice >> & >> >> pass a boolean in the constructor to indicate if it's combine or reduce. >> >> >> >> I already made a patch to add this function, but as the procedure >> >> indicates to discuss the change first, I'll write this mail first to >> check >> >> what you think. (I also didn't test my patch yet, although all unit & IT >> >> still pass) >> >> >> >> Thanks >> >> Stefan >> >> >> >> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills>
