thanks. those issues pointed out do cover the pain points i'm experiencing.
On Wed, Sep 26, 2012 at 3:11 PM, Harsh J <ha...@cloudera.com> wrote: > Also read: http://arxiv.org/abs/1209.2191 ;-) > > On Thu, Sep 27, 2012 at 12:24 AM, Bertrand Dechoux <decho...@gmail.com> wrote: >> I wouldn't so surprised. It takes times, energy and money to solve problems >> and make solutions that would be prod-ready. A few people would consider >> that the namenode/secondary spof is a limit for Hadoop itself in order to >> go into a critical production environnement. (I am only quoting it and >> don't want to start a discussion about it.) >> >> One paper that I heard about (but didn't have the time to read as of now) >> might be related to your problem space >> http://arxiv.org/abs/1110.4198 >> But research paper does not mean prod ready for tomorrow. >> >> http://research.google.com/archive/mapreduce.html is from 2004. >> and http://research.google.com/pubs/pub36632.html (dremel) is from 2010. >> >> Regards >> >> Bertrand >> >> On Wed, Sep 26, 2012 at 8:18 PM, Jane Wayne <jane.wayne2...@gmail.com>wrote: >> >>> jay, >>> >>> thanks. i just needed a sanity check. i hope and expect that one day, >>> hadoop will mature towards supporting a "shared-something" approach. >>> the web service call is not a bad idea at all. that way, we can >>> abstract what that ultimate data store really is. >>> >>> i'm just a little surprised that we are still in the same state with >>> hadoop in regards to this issue (there are probably higher priorities) >>> and that no research (that i know of) has come out of academia to >>> mitigate some of these limitations of hadoop (where's all the funding >>> to hadoop/mapreduce research gone to if this framework is the >>> fundamental building block of a vast amount of knowledge mining >>> activities?). >>> >>> On Wed, Sep 26, 2012 at 12:40 PM, Jay Vyas <jayunit...@gmail.com> wrote: >>> > The reason this is so rare is that the nature of map/reduce tasks is that >>> > they are orthogonal i.e. the word count, batch image recognition, tera >>> > sort -- all the things hadoop is famous for are largely orthogonal tasks. >>> > Its much more rare (i think) to see people using hadoop to do traffic >>> > simulations or solve protein folding problems... Because those tasks >>> > require continuous signal integration. >>> > >>> > 1) First, try to consider rewriting it so that ll communication is >>> replaced >>> > by state variables in a reducer, and choose your keys wisely, so that all >>> > "communication" between machines is obviated by the fact that a single >>> > reducer is receiving all the information relevant for it to do its task. >>> > >>> > 2) If a small amount of state needs to be preserved or cached in real >>> time >>> > two optimize the situation where two machines might dont have to redo the >>> > same task (i.e. invoke a web service to get a peice of data, or some >>> other >>> > task that needs to be rate limited and not duplicated) then you can use a >>> > fast key value store (like you suggested) like the ones provided by >>> basho ( >>> > http://basho.com/) or amazon (Dynamo). >>> > >>> > 3) If you really need alot of message passing, then then you might be >>> > better of using an inherently more integrated tool like GridGain... which >>> > allows for sophisticated message passing between asynchronously running >>> > processes, i.e. >>> > >>> http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/ >>> . >>> > >>> > >>> > It seems like there might not be a reliable way to implement a >>> > sophisticated message passing architecutre in hadoop, because the system >>> is >>> > inherently so dynamic, and is built for rapid streaming reads/writes, >>> which >>> > would be stifled by significant communication overhead. >>> >> >> >> >> -- >> Bertrand Dechoux > > > > -- > Harsh J