Re: a question on number of parallel tasks

Jim the Standing Bear Wed, 16 Jan 2008 08:48:37 -0800

hmm.. interesting... these are supposed to be the output from mappers
(and default reducers since I didn't specify any for those jobs)...
but shouldn't the number of reducers match the number of mappers?  If
there was only one reducer, it would mean I only had one mapper task
running??  That is why I asked my question in the first place, because
I suspect my jobs were not being running in parallel.


On Jan 16, 2008 11:42 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> The part nomenclature does not refer to splits.  It refers to how many
> reduce processes were involved in actually writing the output file.  Files
> are split at read-time as necessary.
>
> You will get more of them if you have more reducers.
>
>
>
> On 1/16/08 8:25 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote:
>
> > Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
> > question, which I am sure the answer lies in the documentation
> > somewhere, just that I was having some difficulties in finding it...
> >
> > when I do an "ls" on the dfs,  I would see this:
> > /user/bear/output/part-00000 <r 4>
> >
> > I probably got confused on what the part-##### means... I thought
> > part-##### tells how many splits a file has... so far, I have only
> > seen part-00000.  When will it have part-00001, 00002, etc?
> >
> >
> >
> > On Jan 16, 2008 11:04 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >> Parallelizing the processing of data occurs at two steps.  The first is
> >> during the map phase where the input data file is (hopefully) split across
> >> multiple tasks.  This should happen transparently most of the time unless
> >> you have a perverse data format or use unsplittable compression on your
> >> file.
> >>
> >> This parallelism can occur whether you have one input file or many.
> >>
> >> The second level of parallelism is at reduce phase.  You set this by 
> >> setting
> >> the number of reducers.  This will also determine the number of output 
> >> files
> >> that you get.
> >>
> >> Depending on your algorithm, it may help or hurt to have one or many
> >> reducers.  The recent example of a program to find the 10 largest elements
> >> is an example that pretty much requires a single reducer.  Other programs
> >> where the mapper produces huge amounts of output would be better served by
> >> having many reducers.
> >>
> >> This is a general answer since the question is kind of non-specific.
> >>
> >>
> >>
> >> On 1/16/08 7:59 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote:
> >>
> >>> Hi,
> >>>
> >>> How do I make hadoop split its output?  The program I am writing
> >>> crawls a catalog tree from a single url, so initially the input
> >>> contains only one entry.  after a few iterations, it will have tens of
> >>> thousands of urls.  But what I noticed is that the file is always in
> >>> one block (part-00000).   What I would like to have is once the number
> >>> of entries increases, it can parallelize the job.  Currently it
> >>> doesn't seem to be case.
> >>
> >>
> >
> >
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: a question on number of parallel tasks

Reply via email to