On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[email protected]>wrote:
> Seems like the bigger thing I see us discussing/needing is a distributed > memory layer. Do each of these tools invent their own or is there a good, > open (ASL compatible) implementation out there somewhere that we could use? > Given such a layer, wouldn't it be fairly straightforward to implement both > graph based and matrix based approaches? Thinking aloud (and perhaps a bit > crazy), I wonder if one could simply implement a Hadoop filesystem that was > based on distributed memory (and persistable to disk, perhaps) thereby > allowing existing code to simply work. > The problem with raw Hadoop jobs which are iterative is that they launch multiple jobs, which can get executed on whatever machines the JobTracker sends them to, with open mapper slots. An in-memory HDFS would still have files living at various locations, not necessarily the same as where all of the mappers go, which means the chunks need to get moved over to local disk of the mapper nodes. Now if the entire HDFS-accessible-filesystem is on a memory-mapped filesystem, it would still go to memory, I guess, but this doesn't like a very efficient process: Hadoop is optimized for streaming over big files, and the map/reduce shuffle requires a lot of disk (in this case, memory!) to do what it does as well. As for "matrix-based" vs. "graph based", since every graph has an adjacency matrix which describes it, and every matrix can describe a (possibly bipartite) graph, there's an isomorphism hiding here, and while I've always thought of "everything as being a matrix", calling everything a graph probably works just as well, and the translation shouldn't be too terribly hard (famous last words). A big "distributed memory layer" does indeed sound great, however. Spark and Giraph both provide their own, although the former seems to lean more toward "read-only, with allowed side-effects", and very general purpose, while the latter is couched in the language of graphs, and computation is specifically BSP (currently), but allows for fairly arbitrary mutation (and persisting final results back to HDFS). -jake --Grant > > On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote: > > > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies <[email protected] > >wrote: > > > >> I've since reached the conclusion that the thing I'm trying to compare > >> it to is a 'data grid', e.g. gigaspaces. > >> > >> We want a large, evolving, data structure, which is essentially cached > >> in memory split over nodes. > >> > > > > I should mention that Giraph certainly allows for the graph to change > (both > > in > > edge values, and in actual graph structure). But it's currently a very > > BSP-specific > > paradigm: run _this_ algorithm, via BSP, over _this_ initial data set, > until > > _this_ many iterations have run, then exit. You could hack it to do > other > > things, > > but it wasn't the original intent, from what I can tell. > > > > -jake > > >
