Re: Apache Giraph?

Lance Norskog Thu, 15 Sep 2011 20:25:52 -0700

The distinction is in uses: matrix algorithms are generally macroscopic over
the entire graph, while graph algorithms are generally microscopic within
the entire matrix.


(Now that is a Big-Endian/Little-Endian dispute I had not considered.)

http://www.ietf.org/rfc/ien/ien137.txt

On Thu, Sep 15, 2011 at 5:14 PM, Jake Mannix <[email protected]> wrote:

> On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[email protected]
> >wrote:
>
> > Seems like the bigger thing I see us discussing/needing is a distributed
> > memory layer.  Do each of these tools invent their own or is there a
> good,
> > open (ASL compatible) implementation out there somewhere that we could
> use?
> >  Given such a layer, wouldn't it be fairly straightforward to implement
> both
> > graph based and matrix based approaches?  Thinking aloud (and perhaps a
> bit
> > crazy), I wonder if one could simply implement a Hadoop filesystem that
> was
> > based on distributed memory (and persistable to disk, perhaps) thereby
> > allowing existing code to simply work.
> >
>
> The problem with raw Hadoop jobs which are iterative is that they launch
> multiple jobs, which can get executed on whatever machines the JobTracker
> sends them to, with open mapper slots.  An in-memory HDFS would still have
> files living at various locations, not necessarily the same as where all of
> the mappers go, which means the chunks need to get moved over to local disk
> of the mapper nodes.  Now if the entire HDFS-accessible-filesystem is on a
> memory-mapped filesystem, it would still go to memory, I guess, but this
> doesn't like a very efficient process: Hadoop is optimized for streaming
> over big files, and the map/reduce shuffle requires a lot of disk (in this
> case, memory!) to do what it does as well.
>
> As for "matrix-based" vs. "graph based", since every graph has an adjacency
> matrix which describes it, and every matrix can describe a (possibly
> bipartite) graph, there's an isomorphism hiding here, and while I've always
> thought of "everything as being a matrix", calling everything a graph
> probably works just as well, and the translation shouldn't be too terribly
> hard (famous last words).
>
> A big "distributed memory layer" does indeed sound great, however.  Spark
> and Giraph both provide their own, although the former seems to lean more
> toward "read-only, with allowed side-effects", and very general purpose,
> while the latter is couched in the language of graphs, and computation is
> specifically BSP (currently), but allows for fairly arbitrary mutation (and
> persisting final results back to HDFS).
>
>  -jake
>
> --Grant
> >
> > On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote:
> >
> > > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies <
> [email protected]
> > >wrote:
> > >
> > >> I've since reached the conclusion that the thing I'm trying to compare
> > >> it to is a 'data grid', e.g. gigaspaces.
> > >>
> > >> We want a large, evolving, data structure, which is essentially cached
> > >> in memory split over nodes.
> > >>
> > >
> > > I should mention that Giraph certainly allows for the graph to change
> > (both
> > > in
> > > edge values, and in actual graph structure).  But it's currently a very
> > > BSP-specific
> > > paradigm: run _this_ algorithm, via BSP, over _this_ initial data set,
> > until
> > > _this_ many iterations have run, then exit.  You could hack it to do
> > other
> > > things,
> > > but it wasn't the original intent, from what I can tell.
> > >
> > >  -jake
> >
> >
> >
>



-- 
Lance Norskog
[email protected]

Re: Apache Giraph?

Reply via email to