Hello,

On Tue, Mar 29, 2011 at 11:48 PM, W.P. McNeill <[email protected]> wrote:
>   2. Decrease the block size of my input files.
> do I have to do a distcp with a non-default block
> size.  (I think the answer is that I have to do a distcp, but I'm making
> sure.)

A distcp or even a plain "-cp" with a proper -Ddfs.blocksize=<size>
parameter passed along should do the trick.

> Are there other approaches?

You can have a look at schedulers that guarantee resources to a
submitted job, perhaps?

> Are there other gotchas that come with trying
> to increase mapper granularity.

One thing that comes to my mind is that the more your splits are
(a.k.a. # of tasks), the more meta info the JobTracker has to hold and
maintain upon in its memory. Second, your NameNode also needs to hold
higher amount of bytes in memory for every such granulared set of
files (since lowering block sizes would lead to a LOT more block info
and replica locations to keep track of).

-- 
Harsh J
http://harshj.com

Reply via email to