I guess my question i, is it a better framework and for what kind of problem? It says it gains speed for iterative algorithms by keeping stuff in memory. That's a fine tradeoff to make but sounds about like a point on the same "efficient frontier" of tradeoffs that any good system lives on. It's also a tradeoff you can already kind of make on Hadoop, and that some of the implementations here already do: loading via distributed cache or side-loading from HDFS. At first glance I'd guess it's "a bit better" for some kinds of problem.
So what's the "cost" of using this? You certainly wouldn't want to replace the Hadoop-based version as the audience for that is much greater. Having an additional implementation doesn't hurt anyone. But it's another dependency and item to support (unless it's not going to get supported) and there is some small harm to having one isolated orphan implementation in the project: we've been trying to kill rather than feed such orphans lately. I personally don't think that this project needs more algorithms, and am personally directing all my time to what I view as more essential infrastructure tasks. Or to put it another way: how about tidying up Hadoop before moving on? That's just me. My gut says it would be cool to implement the SVD on something like this to see how it goes. I don't yet see this is anything to move to. On Thu, Jun 16, 2011 at 6:34 PM, Hector Yee <[email protected]> wrote: > What do people think of using Spark for iterative jobs: > > http://www.spark-project.org/ > > Or is there a new version of hadoop that supports this kind of computation? > > -- > Yee Yang Li Hector > http://hectorgon.blogspot.com/ (tech + travel) > http://hectorgon.com (book reviews) >
