On Wed, Oct 23, 2013 at 5:28 AM, Gabriel Reid <[email protected]>wrote:
> On Wed, Oct 23, 2013 at 2:10 PM, Josh Wills <[email protected]> wrote: > > I certainly understand the issue; do you prefer the two-function solution > > to one in which we added a method in DoFn to indicate which phase of the > MR > > job a particular DoFn was being executed in? We might have options like > > MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a > > solution would work for all cases, so someone please call me out on that > if > > there's something I'm missing.) > > > > Do you mean a method in DoFn so that you could say "execute this > function in the reduce phase", or do you mean a method where you could > say "tell me which phase I'm operating in"? > The latter. > > If you're talking about the second case, I think that Stefan's > suggested solution will be perfectly compatible with that (i.e. pass > in a single CombineFn that is aware of which context it's running in), > while it would also be possible to explicitly pass in two different > implementations, which I think would be really interesting to have. > Indeed-- my thought was that the functions would be pretty similar in most cases, and that you would usually just want a switch for determining which context you were in to reduce the code duplication. Of course, you could do it yourself w/like a boolean flag on the constructor of the DoFn, which might also be a good option-- wouldn't require a DoFn change. > > - Gabriel > > > > > > J > > > > > > On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit < > [email protected]>wrote: > > > >> Hi, > >> > >> I encountered a situation where I need different behaviour of my > CombineFn > >> during combine & reduce phase. > >> Basically, I have a collection of avro records that I need to combine. > >> For some of these, I have so many records with same key that I need to > >> combine them first to make my job work (memory & timing constraints) > >> For others, I can't combine them, because I need all records together. > >> So, basically I would want to know in my function if it's combining or > >> reducing. > >> The only way to solve my problem in crunch right now seems to be to > first > >> split my collection in 2 different collections, combine them separately > & > >> union them again. > >> But this give a lot of overhead for something that would be supported by > >> native M/R. > >> > >> I looked in the code and it seems that crunch internally has a > NodeContext > >> object to indicate COMBINE or REDUCE, but this context is not > accessible in > >> the DoFn. > >> As the (RT)Node object is an internal crunch object, it's also not a > clean > >> solution to expose the NodeContext. > >> So, as a better solution, it would be possible to create a new method: > >> combineValues(combineFn, reduceFn) on PGroupedTable. The existing > >> combineValues(combineFn) is in that case just a convenience method for > must > >> use cases, where the combineFn & reduceFn is the same function. > >> With this new method, I would be able to just create my combineFn twice > & > >> pass a boolean in the constructor to indicate if it's combine or reduce. > >> > >> I already made a patch to add this function, but as the procedure > >> indicates to discuss the change first, I'll write this mail first to > check > >> what you think. (I also didn't test my patch yet, although all unit & IT > >> still pass) > >> > >> Thanks > >> Stefan > >> > >> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
