Re: strategies to share information between mapreduce tasks

Jane Wayne Wed, 26 Sep 2012 12:39:28 -0700

thanks. those issues pointed out do cover the pain points i'm experiencing.


On Wed, Sep 26, 2012 at 3:11 PM, Harsh J <ha...@cloudera.com> wrote:
> Also read: http://arxiv.org/abs/1209.2191 ;-)
>
> On Thu, Sep 27, 2012 at 12:24 AM, Bertrand Dechoux <decho...@gmail.com> wrote:
>> I wouldn't so surprised. It takes times, energy and money to solve problems
>> and make solutions that would be prod-ready. A few people would consider
>> that the namenode/secondary spof is a limit for Hadoop itself in order to
>> go into a critical production environnement. (I am only quoting it and
>> don't want to start a discussion about it.)
>>
>> One paper that I heard about (but didn't have the time to read as of now)
>> might be related to your problem space
>> http://arxiv.org/abs/1110.4198
>> But research paper does not mean prod ready for tomorrow.
>>
>> http://research.google.com/archive/mapreduce.html is from 2004.
>> and http://research.google.com/pubs/pub36632.html (dremel) is from 2010.
>>
>> Regards
>>
>> Bertrand
>>
>> On Wed, Sep 26, 2012 at 8:18 PM, Jane Wayne <jane.wayne2...@gmail.com>wrote:
>>
>>> jay,
>>>
>>> thanks. i just needed a sanity check. i hope and expect that one day,
>>> hadoop will mature towards supporting a "shared-something" approach.
>>> the web service call is not a bad idea at all. that way, we can
>>> abstract what that ultimate data store really is.
>>>
>>> i'm just a little surprised that we are still in the same state with
>>> hadoop in regards to this issue (there are probably higher priorities)
>>> and that no research (that i know of) has come out of academia to
>>> mitigate some of these limitations of hadoop (where's all the funding
>>> to hadoop/mapreduce research gone to if this framework is the
>>> fundamental building block of a vast amount of knowledge mining
>>> activities?).
>>>
>>> On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas <jayunit...@gmail.com> wrote:
>>> > The reason this is so rare is that the nature of map/reduce tasks is that
>>> > they are orthogonal  i.e. the word count, batch image recognition, tera
>>> > sort -- all the things hadoop is famous for are largely orthogonal tasks.
>>> > Its much more rare (i think) to see people using hadoop to do traffic
>>> > simulations or solve protein folding problems... Because those tasks
>>> > require continuous signal integration.
>>> >
>>> > 1) First, try to consider rewriting it so that ll communication is
>>> replaced
>>> > by state variables in a reducer, and choose your keys wisely, so that all
>>> > "communication" between machines is obviated by the fact that a single
>>> > reducer is receiving all the information relevant for it to do its task.
>>> >
>>> > 2) If a small amount of state needs to be preserved or cached in real
>>> time
>>> > two optimize the situation where two machines might dont have to redo the
>>> > same task (i.e. invoke a web service to get a peice of data, or some
>>> other
>>> > task that needs to be rate limited and not duplicated) then you can use a
>>> > fast key value store (like you suggested) like the ones provided by
>>> basho (
>>> > http://basho.com/) or amazon (Dynamo).
>>> >
>>> > 3) If you really need alot of message passing, then then you might be
>>> > better of using an inherently more integrated tool like GridGain... which
>>> > allows for sophisticated message passing between asynchronously running
>>> > processes, i.e.
>>> >
>>> http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/
>>> .
>>> >
>>> >
>>> > It seems like there might not be a reliable way to implement a
>>> > sophisticated message passing architecutre in hadoop, because the system
>>> is
>>> > inherently so dynamic, and is built for rapid streaming reads/writes,
>>> which
>>> > would be stifled by significant communication overhead.
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>
>
>
> --
> Harsh J

Re: strategies to share information between mapreduce tasks

Reply via email to