evaluate the size of the input & split them in parallel

Samuel Guo Sun, 16 Nov 2008 01:27:33 -0800

Hi all,

When I am using Hadoop to do some Map/Reduce jobs over a large dataset(many
thousands of large input files), It seems that the client will take a little
long time to initial the job before actually running it. I am doubting that
it may be stucked during getting thousands of file's metadata from NameNode
and computing their splits.


Is there any way to evaluate the size of the input & construct their split
information in parallel. Can we run a light map/reduce job to construct the
split information in parallel before initializing a job? I think it's worth
constructing the job's split information in parallel when we encounter the
jobs with many thousands of input files.

Hope for reply.

Regards,

Samuel

evaluate the size of the input & split them in parallel

Reply via email to