Re: Chaining jobs hadoop 0.15 best practices question.

Marco Nicosia Tue, 27 Nov 2007 06:41:29 -0800

On 11/26/07 11:06, "Jason Venner" <[EMAIL PROTECTED]> wrote:
> We have a number of tasks that we want to accomplish with hadoop, and
> would like to each each of the hadoop steps very simple.


By this I take it that you wish to use Hadoop to perform a series of simple
transforms on an initial input set of data?

> To our current limited understanding this means that we need to set up N
> hadoop jobs, and run them manually one after the other, using the output
> of one as as the input of the next.

That's correct, unless you are willing to perform multiple transforms per
record within a single step.

> Is there a best practices way of accomplishing this? We are hoping to
> avoid gigantic map tasks.

Gigantic map tasks are typically avoided by splitting the data such that a
map task only processes a certain number of records each. Hadoop's defaults
are geared to handle "large" data sets, where a file is split into 128MB
blocks by default. When submitted for processing, each block of a file
represents a unit of records to be processed by a single map task. More
blocks, more map tasks.

There are more advanced ways to manipulate the system into assigning
multiple map tasks per block. Assuming your data is not huge, don't simply
dial down the block size to generate more map tasks.

I think the InputFormat docs are a good place to learn more about this
(people, please correct me if there is a better place to start):

http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/InputFormat.htm
l

-- 
   Marco Nicosia - Grid Services Ops
   Systems, Tools, and Services Group

Re: Chaining jobs hadoop 0.15 best practices question.

Reply via email to