The distinction is in uses: matrix algorithms are generally macroscopic over the entire graph, while graph algorithms are generally microscopic within the entire matrix.
(Now that is a Big-Endian/Little-Endian dispute I had not considered.) http://www.ietf.org/rfc/ien/ien137.txt On Thu, Sep 15, 2011 at 5:14 PM, Jake Mannix <[email protected]> wrote: > On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[email protected] > >wrote: > > > Seems like the bigger thing I see us discussing/needing is a distributed > > memory layer. Do each of these tools invent their own or is there a > good, > > open (ASL compatible) implementation out there somewhere that we could > use? > > Given such a layer, wouldn't it be fairly straightforward to implement > both > > graph based and matrix based approaches? Thinking aloud (and perhaps a > bit > > crazy), I wonder if one could simply implement a Hadoop filesystem that > was > > based on distributed memory (and persistable to disk, perhaps) thereby > > allowing existing code to simply work. > > > > The problem with raw Hadoop jobs which are iterative is that they launch > multiple jobs, which can get executed on whatever machines the JobTracker > sends them to, with open mapper slots. An in-memory HDFS would still have > files living at various locations, not necessarily the same as where all of > the mappers go, which means the chunks need to get moved over to local disk > of the mapper nodes. Now if the entire HDFS-accessible-filesystem is on a > memory-mapped filesystem, it would still go to memory, I guess, but this > doesn't like a very efficient process: Hadoop is optimized for streaming > over big files, and the map/reduce shuffle requires a lot of disk (in this > case, memory!) to do what it does as well. > > As for "matrix-based" vs. "graph based", since every graph has an adjacency > matrix which describes it, and every matrix can describe a (possibly > bipartite) graph, there's an isomorphism hiding here, and while I've always > thought of "everything as being a matrix", calling everything a graph > probably works just as well, and the translation shouldn't be too terribly > hard (famous last words). > > A big "distributed memory layer" does indeed sound great, however. Spark > and Giraph both provide their own, although the former seems to lean more > toward "read-only, with allowed side-effects", and very general purpose, > while the latter is couched in the language of graphs, and computation is > specifically BSP (currently), but allows for fairly arbitrary mutation (and > persisting final results back to HDFS). > > -jake > > --Grant > > > > On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote: > > > > > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies < > [email protected] > > >wrote: > > > > > >> I've since reached the conclusion that the thing I'm trying to compare > > >> it to is a 'data grid', e.g. gigaspaces. > > >> > > >> We want a large, evolving, data structure, which is essentially cached > > >> in memory split over nodes. > > >> > > > > > > I should mention that Giraph certainly allows for the graph to change > > (both > > > in > > > edge values, and in actual graph structure). But it's currently a very > > > BSP-specific > > > paradigm: run _this_ algorithm, via BSP, over _this_ initial data set, > > until > > > _this_ many iterations have run, then exit. You could hack it to do > > other > > > things, > > > but it wasn't the original intent, from what I can tell. > > > > > > -jake > > > > > > > -- Lance Norskog [email protected]
