The split will depend entirely on the input format that you use and the files that you have. In your case, you have lots of very small files so the limiting factor will almost certainly be the number of files. Thus, you will have 1000 splits (one per file).
Your performance, btw, will likely be pretty poor with so many small files. Can you consolidate them? 100MB of data should probably be in no more than a few files if you want good performance. At that, most kinds of processing will be completely dominated by job startup time. If your jobs are I/O bound, they will be able to read 100MB of data in a just a few seconds at most. Startup time for a hadoop job is typically 10 seconds or more. On 4/4/08 12:58 PM, "Prasan Ary" <[EMAIL PROTECTED]> wrote: > I have a question on how input files are split before they are given out to > Map functions. > Say I have an input directory containing 1000 files whose total size is 100 > MB, and I have 10 machines in my cluster and I have configured 10 > mapred.map.tasks in hadoop-site.xml. > > 1. With this configuration, do we have a way to know what size each split > will be of? > 2. Does split size depend on how many files there are in the input > directory? What if I have only 10 files in input directory, but the total size > of all these files is still 100 MB? Will it affect split size? > > Thanks. > > > --------------------------------- > You rock. That's why Blockbuster's offering you one month of Blockbuster Total > Access, No Cost.
