The interface you propose sounds very reasonable. +1 for that.

For floats vs double, they consume much less memory, O(|E|) reduction, and
there is basically never any need to have a double precision on edges
because they are immutable.
Anyway it's not a strong constraint, double edges would work as well.

Cheers,

--
Gianmarco


On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <[email protected]> wrote:

> Maja and I have thought of a way we could include this type of
> functionality without it being burdensome of the user.
>
> We could add the following interfaces:
>
> interface StrictRandomAccesVertexEdges extends VertexEdges {
>   E getEdgeValue(I)
> }
>
> interface MultiRandomAccessVertexEdges extends VertexEdges {
>   Iterable<E> getAllEdgeValues(I)
> }
>
> We could add the following to the Vertex interface:
>
> // Returns the first edge value for a given target vertex id (or null if
> none)
> E getEdgeValue(I targetVertexId) {
>   // if we have an instance of StrictRandomAccessVertexEdges, we delegate;
>   // otherwise, we iterate over the edges to find the first one
> }
>
> // Returns an Iterable over all edge values for a given target vertex id
> (only useful for multigraphs)
> Iterable<E> getAllEdgeValues(I targetVertexId) {
>   // if we have an instance of MultiRandomAccessVertexEdges, we delegate;
>   // otherwise, we wrap the edges into an Iterable that filters for the
> matching ones
> }
>
> This way, only VertexEdges implementations where it makes sense would
> define the random-access methods, and a default (slow!) implementation
> would be available in all other cases.
>
> How does this sound? I could include it in GIRAPH-528.
>
> On 3/3/13 4:57 PM, "Alessandro Presta" <[email protected]> wrote:
>
> >I don't contest the graph-matrix duality, but I'm not yet convinced that
> >you need random access to the edges for the scenarios you mentioned.
> >PageRank and label propagation, two of the most common applications of our
> >framework, have indeed a linear algebraic formulation, the bulk of which
> >is always matrix multiplication. The most natural and efficient (if not
> >the only) implementations in our model involve simply iterating over the
> >edges.
> >As for your "dot products" example, maybe it would help if you could be a
> >bit more specific: what are your vertices and edges? How do we get the
> >"input vectors"?. Everything should be expressed in terms of vertex
> >values, edges, and messages, otherwise it doesn't fit the current Giraph
> >model anyway.
> >In any case, to compute a dot product efficiently you don't need random
> >access to both vectors: if you have the non-zero coordinates of vector A
> >(equivalent of the edge list), and random access to vector B, you iterate
> >over A and lookup in B. This is efficient.
> >
> >> I'm not sure what this would correspond to in graphland, but I can
> >> certainly see
> >> wanting to have a big in-memory matrix which you can compute dot
> >>products
> >> of vectors with each row of it, and Giraph would do this very
> >>efficiently
> >> (but some
> >> of the implementations of this would assume you had random access to the
> >> edges (meaning edgeId and edgeWeight of each vertex).
> >
> >
> >Here we are talking about a method with signature
> >
> >    EdgeValueType getEdgeValue(VertexIdType targetVertexId)
> >
> >or the multigraph version
> >
> >    Iterable<EdgeValueType> getEdgeValues(VertexIdType targetVertexId)
> >
> >I'm looking for examples of algorithms that can't be implemented in Giraph
> >because of the lack of this method, and would otherwise be a good fit.
> >
> >Having said that, I think there *is* still a case for hash-map-backed
> >edges, and that is when we need to guarantee a "strict graph" (i.e. no
> >parallel edges) or when our algorithm uses remote edge removal requests.
> >So I'm not against having an optimized LongDoubleHashMapEdges (and
> >LongNullHashSetEdges for the unweighted case) included, provided that it's
> >correct and efficient.
> >I will make sure I include it. Now it's a matter of picking a primitive
> >collections library among HPPC, fastutil, Mahout, Trove, etc...
> >
> >On 3/1/13 4:12 PM, "Jake Mannix" <[email protected]> wrote:
> >
> >>Sorry to have missed out on some of this conversation - had some work
> >>stuff
> >>interrupt (how dare it!)
> >>
> >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta
> >><[email protected]>wrote:
> >>
> >>> On 2/28/13 5:08 PM, "Jake Mannix" <[email protected]> wrote:
> >>>
> >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
> >>> ><[email protected]>wrote:
> >>> >
> >>> >> It's not like it causes problems, it's just that it's a pretty big
> >>> >> dependency to justify for a small example.
> >>> >>
> >>> >> As for the motivation, if your point is to prove the framework's
> >>> >> superiority in some context, then you can use the simplest possible
> >>> >> implementation (ArrayList).
> >>> >>
> >>> >
> >>> >This takes up LOTS of memory.  Primitives rule, objects drool
> >>>(memory).
> >>>
> >>> Ok, but having to copy the keys and values to external arrays (which is
> >>> what you have to do with Mahout's hash map) is even worse. A good
> >>> implementation of (long, double) edges (e.g. for RandomWalkVertex) is
> >>> primitive arrays.
> >>>
> >>
> >>So it is certainly true that the *typical* usage of iteration over Mahout
> >>maps
> >>is by doing this copy of internal values (which is not efficient, no),
> >>it's
> >>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p)
> >>passes in a callback which operates directly on the underlying primitive
> >>arrays.
> >>
> >>
> >>>
> >>> >
> >>> >
> >>> >> The Giraph framework is all about iterating over edges, so an
> >>> >> implementation that doesn't support that with reasonable efficiency
> >>> >> doesn't make a lot of sense to me.
> >>> >>
> >>> >> It also follows that hash map-based implementations only make sense
> >>>for
> >>> >> algorithms that make use of mutations.
> >>> >>
> >>> >
> >>> >Agreed, maps aren't absolutely necessary for immutable graphs.  But
> >>>random
> >>> >access to collections of integers _is_ necessary for many algorithms.
> >>> >It's
> >>> >not all just iteration over lists.
> >>>
> >>> Can you give me some concrete examples of algorithms where random
> >>>access
> >>> to the edges is required?
> >>> I'm really interested in this, because I'm considering killing
> >>> getEdgeValue() if there are no use cases.
> >>>
> >>
> >>So I'm not a graph person, I think in terms of matrices, but since every
> >>graph
> >>has an adjecency matrix, and every matrix is the adjacency matrix of some
> >>(possibly
> >>bipartite) graph, lets pretend they're basically the same:
> >>
> >>If you load a graph into Giraph, and want to compute the matrix product
> >>of
> >>this
> >>graph with collection of input vectors, then you may want to take each of
> >>these
> >>input vectors and compute their dot products with the vertices of the
> >>graph
> >>(to
> >>mix my metaphors: each vertex is essentially a row or column of the
> >>matrix),
> >>which can be done by either iterating over the input vector and looking
> >>up
> >>randomly for nonzero entries in the vertex, or the reverse.
> >>
> >>I'm not sure what this would correspond to in graphland, but I can
> >>certainly see
> >>wanting to have a big in-memory matrix which you can compute dot products
> >>of vectors with each row of it, and Giraph would do this very efficiently
> >>(but some
> >>of the implementations of this would assume you had random access to the
> >>edges (meaning edgeId and edgeWeight of each vertex).
> >>
> >>
> >>>
> >>> >
> >>> >
> >>> >> That said, something like a Trove hash map would probably be more
> >>> >> appropriate (more efficient than the standard Java HashMap, at the
> >>> >>expense
> >>> >> of generality). That could be a good candidate for a
> >>> >> LongDoubleHashMapEdges implementation.
> >>> >> I can give that a shot if it sounds good.
> >>> >>
> >>> >
> >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
> >>>
> >>> Whoops, totally missed that part.
> >>>
> >>> >
> >>> >What does Giraph depend on of Mahout now?  Just mahout-collections?
> >>> >That's
> >>> >not a very big dependency, and has all sorts of primitive-to-primitive
> >>> >collections,
> >>> >and from what I've seen benchmarked, just as good or better than
> >>>Trove.
> >>> > Carrot2's
> >>> >hppc may be better yet, but I'm not sure if that is stably released
> >>>yet.
> >>>
> >>> I'll take a look at HPPC and Mahout collections then. They all seem to
> >>> provide the same stuff, so I'll consider benchmarks and convenience of
> >>>the
> >>> API. Thanks for the pointers.
> >>>
> >>> >
> >>> >
> >>> >>
> >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <[email protected]> wrote:
> >>> >>
> >>> >> >Is the mahout dependency causing problems?
> >>> >> >
> >>> >> >It would be nice if we could actually implement some of the
> >>>algorithms
> >>> >> >that
> >>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show off
> >>>how
> >>> >>it
> >>> >> >improves things.  Using the Mahout primitives can show that it's
> >>>not
> >>> >>about
> >>> >> >the inner loop implementation, but the framework itself...
> >>> >> >
> >>> >> >
> >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
> >>> >> ><[email protected]>wrote:
> >>> >> >
> >>> >> >> I like the idea of refactoring it into something more appropriate
> >>> >>for us
> >>> >> >> and ditching the Mahout dep. Good looking out.
> >>> >> >>
> >>> >> >>
> >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella <
> >>> >> >> [email protected]> wrote:
> >>> >> >>
> >>> >> >> > I agree, at this point we could have a RandomWalkVertex with
> >>>edge
> >>> >> >>values,
> >>> >> >> > and a "null-edged" vertex for the PR benchmarks.
> >>> >> >> > We make everybody happy and avoid code duplication.
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta
> >>> >><[email protected]
> >>> >> >> > >wrote:
> >>> >> >> >
> >>> >> >> > > Hi Gianmarco,
> >>> >> >> > >
> >>> >> >> > > Yes, there will be more efficient implementations.
> >>> >> >> > > In the redesign I'm working on (GIRAPH-528), there will be
> >>>only
> >>> >>one
> >>> >> >> > Vertex
> >>> >> >> > > class and edge storage is delegated to a VertexEdges class.
> >>> >> >> > > So far I'm adding some generic implementations
> >>>(ByteArrayEdges,
> >>> >> >> > > ArrayListEdges, HashMapEdges) that work for all types, and
> >>>some
> >>> >> >> optimized
> >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
> >>> >> >> > >
> >>> >> >> > > Do you specifically need edge values to be float while the
> >>>other
> >>> >> >>types
> >>> >> >> > are
> >>> >> >> > > double?
> >>> >> >> > > It seems to me it would make sense to change RandomWalkVertex
> >>>to
> >>> >>use
> >>> >> >> > > double edge values instead, and avoid code duplication (i.e.
> >>> >>adding
> >>> >> >>a
> >>> >> >> > > LongFloatArrayEdges that's basically the same). We're not
> >>>Trove
> >>> >> >>after
> >>> >> >> > all.
> >>> >> >> > > Makes sense?
> >>> >> >> > >
> >>> >> >> > > Thanks for the feedback,
> >>> >> >> > >
> >>> >> >> > > Alessandro
> >>> >> >> > >
> >>> >> >> > >
> >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales"
> >>> >> >><[email protected]>
> >>> >> >> > > wrote:
> >>> >> >> > >
> >>> >> >> > > >Hi,
> >>> >> >> > > >
> >>> >> >> > > >Maybe the specific implementation can be thrown away, but
> >>> >> >>personally I
> >>> >> >> > > >feel
> >>> >> >> > > >very strongly for the need of a good LongDoubleFloatDouble
> >>> >>vertex.
> >>> >> >> > > >It's the base for any serious random walk algorithm.
> >>> >> >> > > >
> >>> >> >> > > >I would call for a refactoring rather than a removal.
> >>> >> >> > > >
> >>> >> >> > > >Just my 2c.
> >>> >> >> > > >
> >>> >> >> > > >Cheers,
> >>> >> >> > > >
> >>> >> >> > > >--
> >>> >> >> > > >Gianmarco
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta
> >>> >> >> > > ><[email protected]>wrote:
> >>> >> >> > > >
> >>> >> >> > > >> Hi all,
> >>> >> >> > > >>
> >>> >> >> > > >> Does anyone feel strongly for LongDoubleFloatDoubleVertex?
> >>> >> >> > > >> Reasons why I think it should be removed:
> >>> >> >> > > >>
> >>> >> >> > > >>   1.  Right now it's incorrect (returns target vertex id
> >>>as
> >>> >>edge
> >>> >> >> > value).
> >>> >> >> > > >>   2.  Iteration will always be inefficient, since the
> >>> >>underlying
> >>> >> >> > Mahout
> >>> >> >> > > >> open-addressing hash map implementation doesn't provide
> >>> >> >>iterators.
> >>> >> >> It
> >>> >> >> > > >> provides a way to copy the keys and values to external
> >>> >> >>arrays/lists.
> >>> >> >> > > >>   3.  It's the only reason why we have Mahout as a
> >>>dependency.
> >>> >> >> > > >>
> >>> >> >> > > >> I think we should strive to provide model implementations
> >>>that
> >>> >> >>are
> >>> >> >> > > >>generic
> >>> >> >> > > >> and/or extremely efficient. This one satisfies neither.
> >>> >> >> > > >>
> >>> >> >> > > >> Thanks,
> >>> >> >> > > >>
> >>> >> >> > > >> Alessandro
> >>> >> >> > > >>
> >>> >> >> > >
> >>> >> >> > >
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > --
> >>> >> >> >    Claudio Martella
> >>> >> >> >    [email protected]
> >>> >> >> >
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >--
> >>> >> >
> >>> >> >  -jake
> >>> >>
> >>> >>
> >>> >
> >>> >
> >>> >--
> >>> >
> >>> >  -jake
> >>>
> >>>
> >>
> >>
> >>--
> >>
> >>  -jake
> >
>
>

Reply via email to