Re: a question on number of parallel tasks

Miles Osborne Wed, 16 Jan 2008 08:51:57 -0800

The number of reduces should be a function of the amount of data needing
reducing, not the number of mappers.


For example,  your mappers might delete  90% of the input data, in which
case you should only need 1/10 of the number of reducers as mappers.

Miles

On 16/01/2008, Jim the Standing Bear <[EMAIL PROTECTED]> wrote:
>
> hmm.. interesting... these are supposed to be the output from mappers
> (and default reducers since I didn't specify any for those jobs)...
> but shouldn't the number of reducers match the number of mappers?  If
> there was only one reducer, it would mean I only had one mapper task
> running??  That is why I asked my question in the first place, because
> I suspect my jobs were not being running in parallel.
>
> On Jan 16, 2008 11:42 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> > The part nomenclature does not refer to splits.  It refers to how many
> > reduce processes were involved in actually writing the output
> file.  Files
> > are split at read-time as necessary.
> >
> > You will get more of them if you have more reducers.
> >
> >
> >
> > On 1/16/08 8:25 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]>
> wrote:
> >
> > > Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
> > > question, which I am sure the answer lies in the documentation
> > > somewhere, just that I was having some difficulties in finding it...
> > >
> > > when I do an "ls" on the dfs,  I would see this:
> > > /user/bear/output/part-00000 <r 4>
> > >
> > > I probably got confused on what the part-##### means... I thought
> > > part-##### tells how many splits a file has... so far, I have only
> > > seen part-00000.  When will it have part-00001, 00002, etc?
> > >
> > >
> > >
> > > On Jan 16, 2008 11:04 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> > >>
> > >>
> > >> Parallelizing the processing of data occurs at two steps.  The first
> is
> > >> during the map phase where the input data file is (hopefully) split
> across
> > >> multiple tasks.  This should happen transparently most of the time
> unless
> > >> you have a perverse data format or use unsplittable compression on
> your
> > >> file.
> > >>
> > >> This parallelism can occur whether you have one input file or many.
> > >>
> > >> The second level of parallelism is at reduce phase.  You set this by
> setting
> > >> the number of reducers.  This will also determine the number of
> output files
> > >> that you get.
> > >>
> > >> Depending on your algorithm, it may help or hurt to have one or many
> > >> reducers.  The recent example of a program to find the 10 largest
> elements
> > >> is an example that pretty much requires a single reducer.  Other
> programs
> > >> where the mapper produces huge amounts of output would be better
> served by
> > >> having many reducers.
> > >>
> > >> This is a general answer since the question is kind of non-specific.
> > >>
> > >>
> > >>
> > >> On 1/16/08 7:59 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]>
> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> How do I make hadoop split its output?  The program I am writing
> > >>> crawls a catalog tree from a single url, so initially the input
> > >>> contains only one entry.  after a few iterations, it will have tens
> of
> > >>> thousands of urls.  But what I noticed is that the file is always in
> > >>> one block (part-00000).   What I would like to have is once the
> number
> > >>> of entries increases, it can parallelize the job.  Currently it
> > >>> doesn't seem to be case.
> > >>
> > >>
> > >
> > >
> >
> >
>
>
>
> --
> --------------------------------------
> Standing Bear Has Spoken
> --------------------------------------
>

Re: a question on number of parallel tasks

Reply via email to