Re: strategies to share information between mapreduce tasks

Bertrand Dechoux Wed, 26 Sep 2012 11:55:09 -0700

I wouldn't so surprised. It takes times, energy and money to solve problems
and make solutions that would be prod-ready. A few people would consider
that the namenode/secondary spof is a limit for Hadoop itself in order to
go into a critical production environnement. (I am only quoting it and
don't want to start a discussion about it.)


One paper that I heard about (but didn't have the time to read as of now)
might be related to your problem space
http://arxiv.org/abs/1110.4198
But research paper does not mean prod ready for tomorrow.

http://research.google.com/archive/mapreduce.html is from 2004.
and http://research.google.com/pubs/pub36632.html (dremel) is from 2010.

Regards

Bertrand

On Wed, Sep 26, 2012 at 8:18 PM, Jane Wayne <jane.wayne2...@gmail.com>wrote:

> jay,
>
> thanks. i just needed a sanity check. i hope and expect that one day,
> hadoop will mature towards supporting a "shared-something" approach.
> the web service call is not a bad idea at all. that way, we can
> abstract what that ultimate data store really is.
>
> i'm just a little surprised that we are still in the same state with
> hadoop in regards to this issue (there are probably higher priorities)
> and that no research (that i know of) has come out of academia to
> mitigate some of these limitations of hadoop (where's all the funding
> to hadoop/mapreduce research gone to if this framework is the
> fundamental building block of a vast amount of knowledge mining
> activities?).
>
> On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas <jayunit...@gmail.com> wrote:
> > The reason this is so rare is that the nature of map/reduce tasks is that
> > they are orthogonal  i.e. the word count, batch image recognition, tera
> > sort -- all the things hadoop is famous for are largely orthogonal tasks.
> > Its much more rare (i think) to see people using hadoop to do traffic
> > simulations or solve protein folding problems... Because those tasks
> > require continuous signal integration.
> >
> > 1) First, try to consider rewriting it so that ll communication is
> replaced
> > by state variables in a reducer, and choose your keys wisely, so that all
> > "communication" between machines is obviated by the fact that a single
> > reducer is receiving all the information relevant for it to do its task.
> >
> > 2) If a small amount of state needs to be preserved or cached in real
> time
> > two optimize the situation where two machines might dont have to redo the
> > same task (i.e. invoke a web service to get a peice of data, or some
> other
> > task that needs to be rate limited and not duplicated) then you can use a
> > fast key value store (like you suggested) like the ones provided by
> basho (
> > http://basho.com/) or amazon (Dynamo).
> >
> > 3) If you really need alot of message passing, then then you might be
> > better of using an inherently more integrated tool like GridGain... which
> > allows for sophisticated message passing between asynchronously running
> > processes, i.e.
> >
> http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/
> .
> >
> >
> > It seems like there might not be a reliable way to implement a
> > sophisticated message passing architecutre in hadoop, because the system
> is
> > inherently so dynamic, and is built for rapid streaming reads/writes,
> which
> > would be stifled by significant communication overhead.
>



-- 
Bertrand Dechoux

Re: strategies to share information between mapreduce tasks

Reply via email to