The number of reduces should be a function of the amount of data needing reducing, not the number of mappers.
For example, your mappers might delete 90% of the input data, in which case you should only need 1/10 of the number of reducers as mappers. Miles On 16/01/2008, Jim the Standing Bear <[EMAIL PROTECTED]> wrote: > > hmm.. interesting... these are supposed to be the output from mappers > (and default reducers since I didn't specify any for those jobs)... > but shouldn't the number of reducers match the number of mappers? If > there was only one reducer, it would mean I only had one mapper task > running?? That is why I asked my question in the first place, because > I suspect my jobs were not being running in parallel. > > On Jan 16, 2008 11:42 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > The part nomenclature does not refer to splits. It refers to how many > > reduce processes were involved in actually writing the output > file. Files > > are split at read-time as necessary. > > > > You will get more of them if you have more reducers. > > > > > > > > On 1/16/08 8:25 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> > wrote: > > > > > Thanks Ted. I just didn't ask it right. Here is a stupid 101 > > > question, which I am sure the answer lies in the documentation > > > somewhere, just that I was having some difficulties in finding it... > > > > > > when I do an "ls" on the dfs, I would see this: > > > /user/bear/output/part-00000 <r 4> > > > > > > I probably got confused on what the part-##### means... I thought > > > part-##### tells how many splits a file has... so far, I have only > > > seen part-00000. When will it have part-00001, 00002, etc? > > > > > > > > > > > > On Jan 16, 2008 11:04 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > >> > > >> > > >> Parallelizing the processing of data occurs at two steps. The first > is > > >> during the map phase where the input data file is (hopefully) split > across > > >> multiple tasks. This should happen transparently most of the time > unless > > >> you have a perverse data format or use unsplittable compression on > your > > >> file. > > >> > > >> This parallelism can occur whether you have one input file or many. > > >> > > >> The second level of parallelism is at reduce phase. You set this by > setting > > >> the number of reducers. This will also determine the number of > output files > > >> that you get. > > >> > > >> Depending on your algorithm, it may help or hurt to have one or many > > >> reducers. The recent example of a program to find the 10 largest > elements > > >> is an example that pretty much requires a single reducer. Other > programs > > >> where the mapper produces huge amounts of output would be better > served by > > >> having many reducers. > > >> > > >> This is a general answer since the question is kind of non-specific. > > >> > > >> > > >> > > >> On 1/16/08 7:59 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> > wrote: > > >> > > >>> Hi, > > >>> > > >>> How do I make hadoop split its output? The program I am writing > > >>> crawls a catalog tree from a single url, so initially the input > > >>> contains only one entry. after a few iterations, it will have tens > of > > >>> thousands of urls. But what I noticed is that the file is always in > > >>> one block (part-00000). What I would like to have is once the > number > > >>> of entries increases, it can parallelize the job. Currently it > > >>> doesn't seem to be case. > > >> > > >> > > > > > > > > > > > > > > -- > -------------------------------------- > Standing Bear Has Spoken > -------------------------------------- >