On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[email protected]>wrote:

> Seems like the bigger thing I see us discussing/needing is a distributed
> memory layer.  Do each of these tools invent their own or is there a good,
> open (ASL compatible) implementation out there somewhere that we could use?
>  Given such a layer, wouldn't it be fairly straightforward to implement both
> graph based and matrix based approaches?  Thinking aloud (and perhaps a bit
> crazy), I wonder if one could simply implement a Hadoop filesystem that was
> based on distributed memory (and persistable to disk, perhaps) thereby
> allowing existing code to simply work.
>

The problem with raw Hadoop jobs which are iterative is that they launch
multiple jobs, which can get executed on whatever machines the JobTracker
sends them to, with open mapper slots.  An in-memory HDFS would still have
files living at various locations, not necessarily the same as where all of
the mappers go, which means the chunks need to get moved over to local disk
of the mapper nodes.  Now if the entire HDFS-accessible-filesystem is on a
memory-mapped filesystem, it would still go to memory, I guess, but this
doesn't like a very efficient process: Hadoop is optimized for streaming
over big files, and the map/reduce shuffle requires a lot of disk (in this
case, memory!) to do what it does as well.

As for "matrix-based" vs. "graph based", since every graph has an adjacency
matrix which describes it, and every matrix can describe a (possibly
bipartite) graph, there's an isomorphism hiding here, and while I've always
thought of "everything as being a matrix", calling everything a graph
probably works just as well, and the translation shouldn't be too terribly
hard (famous last words).

A big "distributed memory layer" does indeed sound great, however.  Spark
and Giraph both provide their own, although the former seems to lean more
toward "read-only, with allowed side-effects", and very general purpose,
while the latter is couched in the language of graphs, and computation is
specifically BSP (currently), but allows for fairly arbitrary mutation (and
persisting final results back to HDFS).

  -jake

--Grant
>
> On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote:
>
> > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies <[email protected]
> >wrote:
> >
> >> I've since reached the conclusion that the thing I'm trying to compare
> >> it to is a 'data grid', e.g. gigaspaces.
> >>
> >> We want a large, evolving, data structure, which is essentially cached
> >> in memory split over nodes.
> >>
> >
> > I should mention that Giraph certainly allows for the graph to change
> (both
> > in
> > edge values, and in actual graph structure).  But it's currently a very
> > BSP-specific
> > paradigm: run _this_ algorithm, via BSP, over _this_ initial data set,
> until
> > _this_ many iterations have run, then exit.  You could hack it to do
> other
> > things,
> > but it wasn't the original intent, from what I can tell.
> >
> >  -jake
>
>
>

Reply via email to