Also read: http://arxiv.org/abs/1209.2191 ;-)
On Thu, Sep 27, 2012 at 12:24 AM, Bertrand Dechoux <decho...@gmail.com> wrote: > I wouldn't so surprised. It takes times, energy and money to solve problems > and make solutions that would be prod-ready. A few people would consider > that the namenode/secondary spof is a limit for Hadoop itself in order to > go into a critical production environnement. (I am only quoting it and > don't want to start a discussion about it.) > > One paper that I heard about (but didn't have the time to read as of now) > might be related to your problem space > http://arxiv.org/abs/1110.4198 > But research paper does not mean prod ready for tomorrow. > > http://research.google.com/archive/mapreduce.html is from 2004. > and http://research.google.com/pubs/pub36632.html (dremel) is from 2010. > > Regards > > Bertrand > > On Wed, Sep 26, 2012 at 8:18 PM, Jane Wayne <jane.wayne2...@gmail.com>wrote: > >> jay, >> >> thanks. i just needed a sanity check. i hope and expect that one day, >> hadoop will mature towards supporting a "shared-something" approach. >> the web service call is not a bad idea at all. that way, we can >> abstract what that ultimate data store really is. >> >> i'm just a little surprised that we are still in the same state with >> hadoop in regards to this issue (there are probably higher priorities) >> and that no research (that i know of) has come out of academia to >> mitigate some of these limitations of hadoop (where's all the funding >> to hadoop/mapreduce research gone to if this framework is the >> fundamental building block of a vast amount of knowledge mining >> activities?). >> >> On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas <jayunit...@gmail.com> wrote: >> > The reason this is so rare is that the nature of map/reduce tasks is that >> > they are orthogonal i.e. the word count, batch image recognition, tera >> > sort -- all the things hadoop is famous for are largely orthogonal tasks. >> > Its much more rare (i think) to see people using hadoop to do traffic >> > simulations or solve protein folding problems... Because those tasks >> > require continuous signal integration. >> > >> > 1) First, try to consider rewriting it so that ll communication is >> replaced >> > by state variables in a reducer, and choose your keys wisely, so that all >> > "communication" between machines is obviated by the fact that a single >> > reducer is receiving all the information relevant for it to do its task. >> > >> > 2) If a small amount of state needs to be preserved or cached in real >> time >> > two optimize the situation where two machines might dont have to redo the >> > same task (i.e. invoke a web service to get a peice of data, or some >> other >> > task that needs to be rate limited and not duplicated) then you can use a >> > fast key value store (like you suggested) like the ones provided by >> basho ( >> > http://basho.com/) or amazon (Dynamo). >> > >> > 3) If you really need alot of message passing, then then you might be >> > better of using an inherently more integrated tool like GridGain... which >> > allows for sophisticated message passing between asynchronously running >> > processes, i.e. >> > >> http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/ >> . >> > >> > >> > It seems like there might not be a reliable way to implement a >> > sophisticated message passing architecutre in hadoop, because the system >> is >> > inherently so dynamic, and is built for rapid streaming reads/writes, >> which >> > would be stifled by significant communication overhead. >> > > > > -- > Bertrand Dechoux -- Harsh J