Re: a question on number of parallel tasks

Jim the Standing Bear Wed, 16 Jan 2008 09:36:35 -0800

Thanks, Miles.

On Jan 16, 2008 11:51 AM, Miles Osborne <[EMAIL PROTECTED]> wrote:
> The number of reduces should be a function of the amount of data needing
> reducing, not the number of mappers.
>
> For example,  your mappers might delete  90% of the input data, in which
> case you should only need 1/10 of the number of reducers as mappers.
>
> Miles
>
>
> On 16/01/2008, Jim the Standing Bear <[EMAIL PROTECTED]> wrote:
> >
> > hmm.. interesting... these are supposed to be the output from mappers
> > (and default reducers since I didn't specify any for those jobs)...
> > but shouldn't the number of reducers match the number of mappers?  If
> > there was only one reducer, it would mean I only had one mapper task
> > running??  That is why I asked my question in the first place, because
> > I suspect my jobs were not being running in parallel.
> >
> > On Jan 16, 2008 11:42 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> > >
> > > The part nomenclature does not refer to splits.  It refers to how many
> > > reduce processes were involved in actually writing the output
> > file.  Files
> > > are split at read-time as necessary.
> > >
> > > You will get more of them if you have more reducers.
> > >
> > >
> > >
> > > On 1/16/08 8:25 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
> > > > question, which I am sure the answer lies in the documentation
> > > > somewhere, just that I was having some difficulties in finding it...
> > > >
> > > > when I do an "ls" on the dfs,  I would see this:
> > > > /user/bear/output/part-00000 <r 4>
> > > >
> > > > I probably got confused on what the part-##### means... I thought
> > > > part-##### tells how many splits a file has... so far, I have only
> > > > seen part-00000.  When will it have part-00001, 00002, etc?
> > > >
> > > >
> > > >
> > > > On Jan 16, 2008 11:04 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> > > >>
> > > >>
> > > >> Parallelizing the processing of data occurs at two steps.  The first
> > is
> > > >> during the map phase where the input data file is (hopefully) split
> > across
> > > >> multiple tasks.  This should happen transparently most of the time
> > unless
> > > >> you have a perverse data format or use unsplittable compression on
> > your
> > > >> file.
> > > >>
> > > >> This parallelism can occur whether you have one input file or many.
> > > >>
> > > >> The second level of parallelism is at reduce phase.  You set this by
> > setting
> > > >> the number of reducers.  This will also determine the number of
> > output files
> > > >> that you get.
> > > >>
> > > >> Depending on your algorithm, it may help or hurt to have one or many
> > > >> reducers.  The recent example of a program to find the 10 largest
> > elements
> > > >> is an example that pretty much requires a single reducer.  Other
> > programs
> > > >> where the mapper produces huge amounts of output would be better
> > served by
> > > >> having many reducers.
> > > >>
> > > >> This is a general answer since the question is kind of non-specific.
> > > >>
> > > >>
> > > >>
> > > >> On 1/16/08 7:59 AM, "Jim the Standing Bear" <[EMAIL PROTECTED]>
> > wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> How do I make hadoop split its output?  The program I am writing
> > > >>> crawls a catalog tree from a single url, so initially the input
> > > >>> contains only one entry.  after a few iterations, it will have tens
> > of
> > > >>> thousands of urls.  But what I noticed is that the file is always in
> > > >>> one block (part-00000).   What I would like to have is once the
> > number
> > > >>> of entries increases, it can parallelize the job.  Currently it
> > > >>> doesn't seem to be case.
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >
> >
> >
> >
> > --
> > --------------------------------------
> > Standing Bear Has Spoken
> > --------------------------------------
> >
>




-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: a question on number of parallel tasks

Reply via email to