I wouldn't so surprised. It takes times, energy and money to solve problems and make solutions that would be prod-ready. A few people would consider that the namenode/secondary spof is a limit for Hadoop itself in order to go into a critical production environnement. (I am only quoting it and don't want to start a discussion about it.)
One paper that I heard about (but didn't have the time to read as of now) might be related to your problem space http://arxiv.org/abs/1110.4198 But research paper does not mean prod ready for tomorrow. http://research.google.com/archive/mapreduce.html is from 2004. and http://research.google.com/pubs/pub36632.html (dremel) is from 2010. Regards Bertrand On Wed, Sep 26, 2012 at 8:18 PM, Jane Wayne <jane.wayne2...@gmail.com>wrote: > jay, > > thanks. i just needed a sanity check. i hope and expect that one day, > hadoop will mature towards supporting a "shared-something" approach. > the web service call is not a bad idea at all. that way, we can > abstract what that ultimate data store really is. > > i'm just a little surprised that we are still in the same state with > hadoop in regards to this issue (there are probably higher priorities) > and that no research (that i know of) has come out of academia to > mitigate some of these limitations of hadoop (where's all the funding > to hadoop/mapreduce research gone to if this framework is the > fundamental building block of a vast amount of knowledge mining > activities?). > > On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas <jayunit...@gmail.com> wrote: > > The reason this is so rare is that the nature of map/reduce tasks is that > > they are orthogonal i.e. the word count, batch image recognition, tera > > sort -- all the things hadoop is famous for are largely orthogonal tasks. > > Its much more rare (i think) to see people using hadoop to do traffic > > simulations or solve protein folding problems... Because those tasks > > require continuous signal integration. > > > > 1) First, try to consider rewriting it so that ll communication is > replaced > > by state variables in a reducer, and choose your keys wisely, so that all > > "communication" between machines is obviated by the fact that a single > > reducer is receiving all the information relevant for it to do its task. > > > > 2) If a small amount of state needs to be preserved or cached in real > time > > two optimize the situation where two machines might dont have to redo the > > same task (i.e. invoke a web service to get a peice of data, or some > other > > task that needs to be rate limited and not duplicated) then you can use a > > fast key value store (like you suggested) like the ones provided by > basho ( > > http://basho.com/) or amazon (Dynamo). > > > > 3) If you really need alot of message passing, then then you might be > > better of using an inherently more integrated tool like GridGain... which > > allows for sophisticated message passing between asynchronously running > > processes, i.e. > > > http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/ > . > > > > > > It seems like there might not be a reliable way to implement a > > sophisticated message passing architecutre in hadoop, because the system > is > > inherently so dynamic, and is built for rapid streaming reads/writes, > which > > would be stifled by significant communication overhead. > -- Bertrand Dechoux