On Thu, Sep 15, 2011 at 8:25 PM, Lance Norskog <[email protected]> wrote:
> The distinction is in uses: matrix algorithms are generally macroscopic
> over
> the entire graph, while graph algorithms are generally microscopic within
> the entire matrix.
>
I'm not sure I've seen that distinction. All the work I've done on Mahout's
DistributedRowMatrix is based around the idea that the rows of the matrix
do no direct communication (as they will be dealt with by different Mapper
instances possibly on different machines). Similarly, BSP calculations
on a graph assume that each vertex with its attendant edges (read:
matrix row or column) only communicates with its direct neighbors.
PageRank is a good example of a global calculation, typically done
iteratively over the entire graph / matrix, by a series of operations
which may be looked at as aggregating lots of local ("microscopic")
graph / matrix sub-parts.
-jake
>
> (Now that is a Big-Endian/Little-Endian dispute I had not considered.)
>
> http://www.ietf.org/rfc/ien/ien137.txt
>
> On Thu, Sep 15, 2011 at 5:14 PM, Jake Mannix <[email protected]>
> wrote:
>
> > On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[email protected]
> > >wrote:
> >
> > > Seems like the bigger thing I see us discussing/needing is a
> distributed
> > > memory layer. Do each of these tools invent their own or is there a
> > good,
> > > open (ASL compatible) implementation out there somewhere that we could
> > use?
> > > Given such a layer, wouldn't it be fairly straightforward to implement
> > both
> > > graph based and matrix based approaches? Thinking aloud (and perhaps a
> > bit
> > > crazy), I wonder if one could simply implement a Hadoop filesystem that
> > was
> > > based on distributed memory (and persistable to disk, perhaps) thereby
> > > allowing existing code to simply work.
> > >
> >
> > The problem with raw Hadoop jobs which are iterative is that they launch
> > multiple jobs, which can get executed on whatever machines the JobTracker
> > sends them to, with open mapper slots. An in-memory HDFS would still
> have
> > files living at various locations, not necessarily the same as where all
> of
> > the mappers go, which means the chunks need to get moved over to local
> disk
> > of the mapper nodes. Now if the entire HDFS-accessible-filesystem is on
> a
> > memory-mapped filesystem, it would still go to memory, I guess, but this
> > doesn't like a very efficient process: Hadoop is optimized for streaming
> > over big files, and the map/reduce shuffle requires a lot of disk (in
> this
> > case, memory!) to do what it does as well.
> >
> > As for "matrix-based" vs. "graph based", since every graph has an
> adjacency
> > matrix which describes it, and every matrix can describe a (possibly
> > bipartite) graph, there's an isomorphism hiding here, and while I've
> always
> > thought of "everything as being a matrix", calling everything a graph
> > probably works just as well, and the translation shouldn't be too
> terribly
> > hard (famous last words).
> >
> > A big "distributed memory layer" does indeed sound great, however. Spark
> > and Giraph both provide their own, although the former seems to lean more
> > toward "read-only, with allowed side-effects", and very general purpose,
> > while the latter is couched in the language of graphs, and computation is
> > specifically BSP (currently), but allows for fairly arbitrary mutation
> (and
> > persisting final results back to HDFS).
> >
> > -jake
> >
> > --Grant
> > >
> > > On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote:
> > >
> > > > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies <
> > [email protected]
> > > >wrote:
> > > >
> > > >> I've since reached the conclusion that the thing I'm trying to
> compare
> > > >> it to is a 'data grid', e.g. gigaspaces.
> > > >>
> > > >> We want a large, evolving, data structure, which is essentially
> cached
> > > >> in memory split over nodes.
> > > >>
> > > >
> > > > I should mention that Giraph certainly allows for the graph to change
> > > (both
> > > > in
> > > > edge values, and in actual graph structure). But it's currently a
> very
> > > > BSP-specific
> > > > paradigm: run _this_ algorithm, via BSP, over _this_ initial data
> set,
> > > until
> > > > _this_ many iterations have run, then exit. You could hack it to do
> > > other
> > > > things,
> > > > but it wasn't the original intent, from what I can tell.
> > > >
> > > > -jake
> > >
> > >
> > >
> >
>
>
>
> --
> Lance Norskog
> [email protected]
>