On Thu, Jan 30, 2014 at 7:09 AM, Jinal Shah <[email protected]> wrote:
> Hi everyone, > > This is Jinal Shah, I'm new to the group. I had a question about Execution > Control in Crunch. Is there any way we can force Crunch to do certain > operations in parallel or certain operations in sequential ways. For > example, let's say if we want the pipeline to executed a particular DoFn > function in the Map phase instead of the Reduce phase or vice-versa. Or > Execute a particular Flow only after a particular flow is completed as > oppose to running it in parallel. > Forcing a DoFn to operate in a map or reduce phase is tough for the planner to do right now; we sort of rely on the developer to have a mental model of how the jobs will proceed. The place where you usually want to force a DoFn to execute in the reduce vs. the map phase is when you have dependent groupByKey operations, and you can use cache() or materialize() on the intermediate output that you want to split on, and the planner will respect that. On the latter question, the thing to look for is org.apache.crunch.ParallelDoOptions, which isn't something I've doc'd in the user guide yet (it's on the todo list, I promise.) You can give a parallelDo call an additional argument that specifies one or more SourceTargets that have to exist before a particular DoFn is allowed to run. In this way, you can force aspects of the pipeline to be sequential instead of parallel. We make use of ParallelDoOptions inside of the MapsideJoinStrategy code, to ensure that the data set that we'll be loading in-memory actually exists in the file system before we run the code that reads it into memory. > > Maybe this might be asked before so sorry if it came again. If you guys > have further question on the details do let me know > > > Thanks everyone and Have a great day. > > Thanks > Jinal > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
