One of the ideas that Gabriel mentioned on our last epic architecture thread has stuck w/me, and that was adding support for using a pre-existing Mapper and Reducer class on the Crunch APIs, so that you could do something like:
pipeline.read(From.tableSource(...)) .parallelDo(new SomeDoFn(), ...) .parallelDo(mapperFn(Mapper.class), ...) .groupByKey() .parallelDo(reducerFn(Reducer.class), ...) .parallelDo(new OtherDoFn(), ...) .write(To.tableTarget(...)); This turns out to be kind of tricky to do no matter how we approach the problem, because for this to work, we'll need to (at a minimum) subclass the Mapper.Context and Reducer.Context classes that are passed to the Mapper and Reducer instances, and they have different implementations (most importantly for our purposes, different constructors) under Hadoop 1 and 2. It feels to me that what I need to do is create a separate subproject that has to do some crazy stuff (e.g., use different source directories depending on the value of the crunch.platform variable) in order to be able to create the appropriate kind of subclass of Mapper.Context or Reducer.Context. But this sort of thing seems like such a bad idea that there must be some sort of less-bad option available to me, and I wanted to solicit input before I start tilting at this particular windmill. Thanks! Josh -- Director of Data Science Cloudera Twitter: @josh_wills
