Re: What do people use Hadoop for?

Bryan A. P. Pendleton Thu, 25 Jan 2007 13:04:35 -0800

If the output is already sorted, the sort pass *should* be able to run in
linear time - perhaps not worth optimizing it out for cases of sorted
output. Given the scatter/reassemble nature of the default map/reduce
(scatter by Partition, by default by the Hash), inputs that are sorted may
not be written as such to output..... so, if you're counting on sorted data,
maybe it's best to leave the sort in (and verify that the current
infrastructure will perform well given sorted input). Otherwise, if there is
no implication/need of sorted output, then sort can be totally disabled.


I just think it's a bad idea to disable sort in the case where sorts are
expected, especially given that sorted input may not be preserved in the
current distribution system.

On 1/25/07, Doug Judd <[EMAIL PROTECTED]> wrote:


Hi Arkady,

Long time no see.  :)  I actually would like to see support for both map()
and reduce().  For example, if you know that the input is already sorted,
then it might be useful to have reduce() called on the aggregated
intermediate results.  For example, a "classic" usage would be to
implement
a database join.  You sort the first table on the Foreign key and then do
a
merging-only MapReduce with the second table (who's primary key is the
first
table's foreign key).  Now you can see a joined view of the rows inside
the
reduce() function, whereas the map() function won't have all of the
pieces.
However, having map() is good too for projecting the input in different
ways.

I'll go ahead and file a JIRA enhancement request.

- Doug

On 1/25/07, Arkady Borkovsky <[EMAIL PROTECTED]> wrote:
>
> "Disabling the sort"  == "map without reduce" == "map writes the output
> into DFS"
> is indeed a very useful and desirable feature.
> File a JIRA issue.
>
>
> On Jan 24, 2007, at 5:32 PM, Doug Judd wrote:
>
> > After digging into this a bit, it looks like the use of
> > IdentityReducer does
> > not disable the sort.  I wrote a simple Map/Reduce program that uses
> > /usr/share/dict/words as input and generates keys that are a Text
> > representation of the CRC of the word modulo 65536 and values that are
> > the
> > word itself.  I set the reducer to be the IdentityReducer and the
> > output
> > came out sorted:
> >
> > 0       apperceptively
> > 0       Connarus
> > 1       overfold
> > 1       derationalization
> > 1       gymnasium
> > 10      respecting
> > 10      supperwards
> > 100     cellulofibrous
> > 100     drogherman
> > 100     heteroptics
> > 1000    bacao
> > 1000    Cumaean
> > 1000    didymate
> > 1000    disbelieving
> > 1001    polymer
> > 1001    salveline
> > 1001    workwomanly
> > 1002    sporty
> > 1002    bakal
> > 1003    preferentialist
> >
> > Also, after reviewing the Google paper, they make no mention of the
> > sort
> > being disabled by the Identity reducer.  In fact, they describe their
> > Sort
> > implementation as using the identity reducer.
> >
> > Unless I'm missing something, I retract my previous statement.
> > Map-Reduce
> > is really just distributed sort.  I do think that being able to
> > disable the
> > sort is a much needed enhancement, especially since quite a few
> > applications
> > don't need it.
> >
> > - Doug
> >
> > On 1/24/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> >>
> >> Doug Judd wrote:
> >> > Part of the problem is that calling the paradigm "Map-Reduce" is
> >> somewhat
> >> > misleading.  It is really just a distributed sort.  The sort is
> >> where
> >> > all of
> >> > the complexity comes from.  Invoking map() over the input is O(n),
> >> > invoking
> >> > reduce() over the intermediate results is O(n) as well.  The sort
is
> >> > O(nlogn).  A more appropriate name for this algorithm would be
> >> > "Distributed
> >> > Sort with a Pre-map Phase and a Post-reduce Phase"  Calling it
> >> Map-Reduce
> >> > and leaving out the word "sort" (the most important part) is a
> >> source of
> >> > confusion.
> >> >
> >> > If you think of it in these terms, I think it's easier to see where
> >> > and how
> >> > it applies.
> >>
> >> :) Sure, that's one point of view on this - however, in quite a few
> >> applications sort is definitely less important than the ability to
> >> split
> >> the processing load in map() and reduce() over many machines.
> >> Sometimes
> >> I don't care about the sorting at all (in all cases where
> >> IdentityReducer is used).
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >> ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >>
>
>



--
Bryan A. P. Pendleton
Ph: (877) geek-1-bp

Re: What do people use Hadoop for?

Reply via email to