Hi Lu, The answer is a little bit of both: all of the MR jobs that are created from a single Crunch Pipeline instance use the same Configuration object as a base, but there are a number of ways to change the configuration settings for a specific job. If you know that all of the jobs in your pipeline will have roughly the same memory requirements, you can specify your settings once on the top-level Configuration object.
If you know that a particular bit of processing will require more memory processing, there are two main ways to add additional settings to the Configuration for just that job: 1) Each DoFn has an abstract configure(Configuration conf) method that is called right before the job that contains that DoFn is launched, so if you know that the job which contains that DoFn will have additional memory requirements, you can specify those settings by overriding the configure(Configuration conf) method for that instance. 2) One of the most common places to change the settings is during the shuffle, which is represented in Crunch by the groupByKey operation on the PTable interface. The groupByKey method can take an optional GroupingOptions class, which can contain per-job configuration settings that are set either explicitly (such as the number of reducers, the grouping comparator to use, or the sorting comparator) or via the conf(String key, String value) method on GroupingOptions.Builder, which allows you to specify additional key-value pairs to set in the Configuration object that is associated with that job. I hope that helps. Please feel free to send along any other questions you have. Josh On Tue, Oct 15, 2013 at 8:52 PM, heng lu <[email protected]> wrote: > Hello: > > I am a new user of crunch, and plan to deploy crunch for my company > for use. > Based on my understanding of Crunch, I think crunch acts like a > pre-processor which will finally transform the users' codes to Map-Reduce > jobs, so one pipeline can be transformed to a number of map-reduce jobs. I > just want to know how crunch allocate resources for these jobs? As we can > configure memory usage for map-reduce jobs in hadoop by configuration file, > how crunch do this for the generated jobs? Are they all share a single > configuration, or have their own configuration based on the job load? > Looking forward for your help! > Thank you > > Best Regards > Lu Heng > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
