Hi all, When I am using Hadoop to do some Map/Reduce jobs over a large dataset(many thousands of large input files), It seems that the client will take a little long time to initial the job before actually running it. I am doubting that it may be stucked during getting thousands of file's metadata from NameNode and computing their splits.
Is there any way to evaluate the size of the input & construct their split information in parallel. Can we run a light map/reduce job to construct the split information in parallel before initializing a job? I think it's worth constructing the job's split information in parallel when we encounter the jobs with many thousands of input files. Hope for reply. Regards, Samuel
