Just to be clear, Spark actually *does* support general task graphs, similar to Dryad (though a bit simpler in that there's a notion of "stages" and a fixed set of connection patterns between them). However, MBrace goes a step beyond that, in that the graphs can be modified dynamically based on user code. It's also not clear what the granularity of task spawns in MBrace is -- can you spawn stuff that runs for 1 millisecond, or 1 second, or 1 hour? The choice there greatly affects system design.
Matei On Oct 23, 2013, at 6:54 PM, Christopher Nguyen <c...@adatao.com> wrote: > Re MBrace: very interesting work. I'm a bit surprised though that the paper > makes no mention of DryadLINQ ( > http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf). > > Architecturally it's a lot easier to see an MBrace implementation > specialized to a MapReduce (or more generically, a BSP) computation, than > to have a Spark implement the fully async DAG model of an MBrace/Dryad > engine. > > More practically, as interesting as it might be as a side effort, I think > for the core Spark effort to attempt something like that would be "off > mission". Spark's success to date has been more due to beautiful > implementation of a known architecture, than beautiful new architecture. > Basically, Spark does MapReduce 10-100x faster than Hadoop, and more people > by now understand how to get MapReduce to solve their problems than any > other parallel model. Spark sits natively on HDFS so that makes adoption a > lot easier to swallow. So at present, for Spark to mature quickly along > that successful trajectory, the key problems to address are more practical > "user interface" or "productivity" things like manageability, > deployability, fault-tolerance improvements, multi-user access, a bigger > library of pre-packaged algorithms, etc. > > Whether MapReduce's own success is an accident of history or something more > fundamental is subject to interesting debate. I remember being constantly > amazed by the number of problems that when squinted at the right way > becomes an MR-soluble problem at Google (starting ironically with PageRank > itself). Yes, apparently sometimes it does pay to see many things as a nail > when you have invested in a powerful hammer. > > Along those lines, here are some interesting perspectives on the beauty of > Dryad/DryadLINQ, and at least one practical reason why it didn't succeed as > an implementation. > > - > > http://blogs.msdn.com/b/dryad/archive/2010/02/15/some-dryad-and-dryadlinq-history.aspx > - > > http://geekswithblogs.net/johnsPerfBlog/archive/2011/12/12/rip-dryadlinq-or-long-live-linq-to-hadoop.aspx > > > > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen > > > > On Wed, Oct 23, 2013 at 2:33 PM, Alex Boisvert <alex.boisv...@gmail.com>wrote: > >> (Resending to @apache list instead of old google-group) >> >> A bit of a random question but I was wondering if there were efforts >> underway to generalize / expand the Spark API towards something that would >> be similar to the MBrace [1] model ... there's certainly an overlap between >> the features of the systems already ... so I guess I'm thinking about an >> API that's less centered around RDDs (as a collection) and more towards >> distributed dataflow that would feel more like composing Promises/Futures >> ... or even generalizing to support various sorts of container/context >> monads. >> >> [1] "MBrace: Cloud Computing with Monads" >> http://plosworkshop.org/2013/preprint/dzik.pdf >>