Re: Query on Resource Configuration for Generated MR Jobs

Josh Wills Tue, 15 Oct 2013 22:34:08 -0700

Hi Lu,

The answer is a little bit of both: all of the MR jobs that are created
from a single Crunch Pipeline instance use the same Configuration object as
a base, but there are a number of ways to change the configuration settings
for a specific job. If you know that all of the jobs in your pipeline will
have roughly the same memory requirements, you can specify your settings
once on the top-level Configuration object.

If you know that a particular bit of processing will require more memory
processing, there are two main ways to add additional settings to the
Configuration for just that job:
1) Each DoFn has an abstract configure(Configuration conf) method that is
called right before the job that contains that DoFn is launched, so if you
know that the job which contains that DoFn will have additional memory
requirements, you can specify those settings by overriding the
configure(Configuration conf) method for that instance.
2) One of the most common places to change the settings is during the
shuffle, which is represented in Crunch by the groupByKey operation on the
PTable interface. The groupByKey method can take an optional
GroupingOptions class, which can contain per-job configuration settings
that are set either explicitly (such as the number of reducers, the
grouping comparator to use, or the sorting comparator) or via the
conf(String key, String value) method on GroupingOptions.Builder, which
allows you to specify additional key-value pairs to set in the
Configuration object that is associated with that job.

I hope that helps. Please feel free to send along any other questions you
have.
Josh

On Tue, Oct 15, 2013 at 8:52 PM, heng lu <[email protected]> wrote:

> Hello:
>
>      I am a new user of crunch, and plan to deploy crunch for my company
> for use.
>      Based on my understanding of Crunch, I think crunch acts like a
> pre-processor which will finally transform the users' codes to Map-Reduce
> jobs, so one pipeline can be transformed to a number of map-reduce jobs. I
> just want to know how crunch allocate resources for these jobs? As we can
> configure memory usage for map-reduce jobs in hadoop by configuration file,
> how crunch do this for the generated jobs? Are they all share a single
> configuration, or have their own configuration based on the job load?
>      Looking forward for your help!
>      Thank you
>
> Best Regards
> Lu Heng
>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Query on Resource Configuration for Generated MR Jobs

Reply via email to