Yep, I'm also +1 on the approach.
On Mon, Mar 4, 2013 at 11:47 PM, Gianmarco De Francisci Morales < [email protected]> wrote: > The interface you propose sounds very reasonable. +1 for that. > > For floats vs double, they consume much less memory, O(|E|) reduction, and > there is basically never any need to have a double precision on edges > because they are immutable. > Anyway it's not a strong constraint, double edges would work as well. > > Cheers, > > -- > Gianmarco > > > On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <[email protected]> > wrote: > > > Maja and I have thought of a way we could include this type of > > functionality without it being burdensome of the user. > > > > We could add the following interfaces: > > > > interface StrictRandomAccesVertexEdges extends VertexEdges { > > E getEdgeValue(I) > > } > > > > interface MultiRandomAccessVertexEdges extends VertexEdges { > > Iterable<E> getAllEdgeValues(I) > > } > > > > We could add the following to the Vertex interface: > > > > // Returns the first edge value for a given target vertex id (or null if > > none) > > E getEdgeValue(I targetVertexId) { > > // if we have an instance of StrictRandomAccessVertexEdges, we > delegate; > > // otherwise, we iterate over the edges to find the first one > > } > > > > // Returns an Iterable over all edge values for a given target vertex id > > (only useful for multigraphs) > > Iterable<E> getAllEdgeValues(I targetVertexId) { > > // if we have an instance of MultiRandomAccessVertexEdges, we delegate; > > // otherwise, we wrap the edges into an Iterable that filters for the > > matching ones > > } > > > > This way, only VertexEdges implementations where it makes sense would > > define the random-access methods, and a default (slow!) implementation > > would be available in all other cases. > > > > How does this sound? I could include it in GIRAPH-528. > > > > On 3/3/13 4:57 PM, "Alessandro Presta" <[email protected]> wrote: > > > > >I don't contest the graph-matrix duality, but I'm not yet convinced that > > >you need random access to the edges for the scenarios you mentioned. > > >PageRank and label propagation, two of the most common applications of > our > > >framework, have indeed a linear algebraic formulation, the bulk of which > > >is always matrix multiplication. The most natural and efficient (if not > > >the only) implementations in our model involve simply iterating over the > > >edges. > > >As for your "dot products" example, maybe it would help if you could be > a > > >bit more specific: what are your vertices and edges? How do we get the > > >"input vectors"?. Everything should be expressed in terms of vertex > > >values, edges, and messages, otherwise it doesn't fit the current Giraph > > >model anyway. > > >In any case, to compute a dot product efficiently you don't need random > > >access to both vectors: if you have the non-zero coordinates of vector A > > >(equivalent of the edge list), and random access to vector B, you > iterate > > >over A and lookup in B. This is efficient. > > > > > >> I'm not sure what this would correspond to in graphland, but I can > > >> certainly see > > >> wanting to have a big in-memory matrix which you can compute dot > > >>products > > >> of vectors with each row of it, and Giraph would do this very > > >>efficiently > > >> (but some > > >> of the implementations of this would assume you had random access to > the > > >> edges (meaning edgeId and edgeWeight of each vertex). > > > > > > > > >Here we are talking about a method with signature > > > > > > EdgeValueType getEdgeValue(VertexIdType targetVertexId) > > > > > >or the multigraph version > > > > > > Iterable<EdgeValueType> getEdgeValues(VertexIdType targetVertexId) > > > > > >I'm looking for examples of algorithms that can't be implemented in > Giraph > > >because of the lack of this method, and would otherwise be a good fit. > > > > > >Having said that, I think there *is* still a case for hash-map-backed > > >edges, and that is when we need to guarantee a "strict graph" (i.e. no > > >parallel edges) or when our algorithm uses remote edge removal requests. > > >So I'm not against having an optimized LongDoubleHashMapEdges (and > > >LongNullHashSetEdges for the unweighted case) included, provided that > it's > > >correct and efficient. > > >I will make sure I include it. Now it's a matter of picking a primitive > > >collections library among HPPC, fastutil, Mahout, Trove, etc... > > > > > >On 3/1/13 4:12 PM, "Jake Mannix" <[email protected]> wrote: > > > > > >>Sorry to have missed out on some of this conversation - had some work > > >>stuff > > >>interrupt (how dare it!) > > >> > > >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta > > >><[email protected]>wrote: > > >> > > >>> On 2/28/13 5:08 PM, "Jake Mannix" <[email protected]> wrote: > > >>> > > >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta > > >>> ><[email protected]>wrote: > > >>> > > > >>> >> It's not like it causes problems, it's just that it's a pretty big > > >>> >> dependency to justify for a small example. > > >>> >> > > >>> >> As for the motivation, if your point is to prove the framework's > > >>> >> superiority in some context, then you can use the simplest > possible > > >>> >> implementation (ArrayList). > > >>> >> > > >>> > > > >>> >This takes up LOTS of memory. Primitives rule, objects drool > > >>>(memory). > > >>> > > >>> Ok, but having to copy the keys and values to external arrays (which > is > > >>> what you have to do with Mahout's hash map) is even worse. A good > > >>> implementation of (long, double) edges (e.g. for RandomWalkVertex) is > > >>> primitive arrays. > > >>> > > >> > > >>So it is certainly true that the *typical* usage of iteration over > Mahout > > >>maps > > >>is by doing this copy of internal values (which is not efficient, no), > > >>it's > > >>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p) > > >>passes in a callback which operates directly on the underlying > primitive > > >>arrays. > > >> > > >> > > >>> > > >>> > > > >>> > > > >>> >> The Giraph framework is all about iterating over edges, so an > > >>> >> implementation that doesn't support that with reasonable > efficiency > > >>> >> doesn't make a lot of sense to me. > > >>> >> > > >>> >> It also follows that hash map-based implementations only make > sense > > >>>for > > >>> >> algorithms that make use of mutations. > > >>> >> > > >>> > > > >>> >Agreed, maps aren't absolutely necessary for immutable graphs. But > > >>>random > > >>> >access to collections of integers _is_ necessary for many > algorithms. > > >>> >It's > > >>> >not all just iteration over lists. > > >>> > > >>> Can you give me some concrete examples of algorithms where random > > >>>access > > >>> to the edges is required? > > >>> I'm really interested in this, because I'm considering killing > > >>> getEdgeValue() if there are no use cases. > > >>> > > >> > > >>So I'm not a graph person, I think in terms of matrices, but since > every > > >>graph > > >>has an adjecency matrix, and every matrix is the adjacency matrix of > some > > >>(possibly > > >>bipartite) graph, lets pretend they're basically the same: > > >> > > >>If you load a graph into Giraph, and want to compute the matrix product > > >>of > > >>this > > >>graph with collection of input vectors, then you may want to take each > of > > >>these > > >>input vectors and compute their dot products with the vertices of the > > >>graph > > >>(to > > >>mix my metaphors: each vertex is essentially a row or column of the > > >>matrix), > > >>which can be done by either iterating over the input vector and looking > > >>up > > >>randomly for nonzero entries in the vertex, or the reverse. > > >> > > >>I'm not sure what this would correspond to in graphland, but I can > > >>certainly see > > >>wanting to have a big in-memory matrix which you can compute dot > products > > >>of vectors with each row of it, and Giraph would do this very > efficiently > > >>(but some > > >>of the implementations of this would assume you had random access to > the > > >>edges (meaning edgeId and edgeWeight of each vertex). > > >> > > >> > > >>> > > >>> > > > >>> > > > >>> >> That said, something like a Trove hash map would probably be more > > >>> >> appropriate (more efficient than the standard Java HashMap, at the > > >>> >>expense > > >>> >> of generality). That could be a good candidate for a > > >>> >> LongDoubleHashMapEdges implementation. > > >>> >> I can give that a shot if it sounds good. > > >>> >> > > >>> > > > >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects. > > >>> > > >>> Whoops, totally missed that part. > > >>> > > >>> > > > >>> >What does Giraph depend on of Mahout now? Just mahout-collections? > > >>> >That's > > >>> >not a very big dependency, and has all sorts of > primitive-to-primitive > > >>> >collections, > > >>> >and from what I've seen benchmarked, just as good or better than > > >>>Trove. > > >>> > Carrot2's > > >>> >hppc may be better yet, but I'm not sure if that is stably released > > >>>yet. > > >>> > > >>> I'll take a look at HPPC and Mahout collections then. They all seem > to > > >>> provide the same stuff, so I'll consider benchmarks and convenience > of > > >>>the > > >>> API. Thanks for the pointers. > > >>> > > >>> > > > >>> > > > >>> >> > > >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <[email protected]> wrote: > > >>> >> > > >>> >> >Is the mahout dependency causing problems? > > >>> >> > > > >>> >> >It would be nice if we could actually implement some of the > > >>>algorithms > > >>> >> >that > > >>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show off > > >>>how > > >>> >>it > > >>> >> >improves things. Using the Mahout primitives can show that it's > > >>>not > > >>> >>about > > >>> >> >the inner loop implementation, but the framework itself... > > >>> >> > > > >>> >> > > > >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman > > >>> >> ><[email protected]>wrote: > > >>> >> > > > >>> >> >> I like the idea of refactoring it into something more > appropriate > > >>> >>for us > > >>> >> >> and ditching the Mahout dep. Good looking out. > > >>> >> >> > > >>> >> >> > > >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella < > > >>> >> >> [email protected]> wrote: > > >>> >> >> > > >>> >> >> > I agree, at this point we could have a RandomWalkVertex with > > >>>edge > > >>> >> >>values, > > >>> >> >> > and a "null-edged" vertex for the PR benchmarks. > > >>> >> >> > We make everybody happy and avoid code duplication. > > >>> >> >> > > > >>> >> >> > > > >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta > > >>> >><[email protected] > > >>> >> >> > >wrote: > > >>> >> >> > > > >>> >> >> > > Hi Gianmarco, > > >>> >> >> > > > > >>> >> >> > > Yes, there will be more efficient implementations. > > >>> >> >> > > In the redesign I'm working on (GIRAPH-528), there will be > > >>>only > > >>> >>one > > >>> >> >> > Vertex > > >>> >> >> > > class and edge storage is delegated to a VertexEdges class. > > >>> >> >> > > So far I'm adding some generic implementations > > >>>(ByteArrayEdges, > > >>> >> >> > > ArrayListEdges, HashMapEdges) that work for all types, and > > >>>some > > >>> >> >> optimized > > >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges). > > >>> >> >> > > > > >>> >> >> > > Do you specifically need edge values to be float while the > > >>>other > > >>> >> >>types > > >>> >> >> > are > > >>> >> >> > > double? > > >>> >> >> > > It seems to me it would make sense to change > RandomWalkVertex > > >>>to > > >>> >>use > > >>> >> >> > > double edge values instead, and avoid code duplication > (i.e. > > >>> >>adding > > >>> >> >>a > > >>> >> >> > > LongFloatArrayEdges that's basically the same). We're not > > >>>Trove > > >>> >> >>after > > >>> >> >> > all. > > >>> >> >> > > Makes sense? > > >>> >> >> > > > > >>> >> >> > > Thanks for the feedback, > > >>> >> >> > > > > >>> >> >> > > Alessandro > > >>> >> >> > > > > >>> >> >> > > > > >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales" > > >>> >> >><[email protected]> > > >>> >> >> > > wrote: > > >>> >> >> > > > > >>> >> >> > > >Hi, > > >>> >> >> > > > > > >>> >> >> > > >Maybe the specific implementation can be thrown away, but > > >>> >> >>personally I > > >>> >> >> > > >feel > > >>> >> >> > > >very strongly for the need of a good LongDoubleFloatDouble > > >>> >>vertex. > > >>> >> >> > > >It's the base for any serious random walk algorithm. > > >>> >> >> > > > > > >>> >> >> > > >I would call for a refactoring rather than a removal. > > >>> >> >> > > > > > >>> >> >> > > >Just my 2c. > > >>> >> >> > > > > > >>> >> >> > > >Cheers, > > >>> >> >> > > > > > >>> >> >> > > >-- > > >>> >> >> > > >Gianmarco > > >>> >> >> > > > > > >>> >> >> > > > > > >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta > > >>> >> >> > > ><[email protected]>wrote: > > >>> >> >> > > > > > >>> >> >> > > >> Hi all, > > >>> >> >> > > >> > > >>> >> >> > > >> Does anyone feel strongly for > LongDoubleFloatDoubleVertex? > > >>> >> >> > > >> Reasons why I think it should be removed: > > >>> >> >> > > >> > > >>> >> >> > > >> 1. Right now it's incorrect (returns target vertex id > > >>>as > > >>> >>edge > > >>> >> >> > value). > > >>> >> >> > > >> 2. Iteration will always be inefficient, since the > > >>> >>underlying > > >>> >> >> > Mahout > > >>> >> >> > > >> open-addressing hash map implementation doesn't provide > > >>> >> >>iterators. > > >>> >> >> It > > >>> >> >> > > >> provides a way to copy the keys and values to external > > >>> >> >>arrays/lists. > > >>> >> >> > > >> 3. It's the only reason why we have Mahout as a > > >>>dependency. > > >>> >> >> > > >> > > >>> >> >> > > >> I think we should strive to provide model > implementations > > >>>that > > >>> >> >>are > > >>> >> >> > > >>generic > > >>> >> >> > > >> and/or extremely efficient. This one satisfies neither. > > >>> >> >> > > >> > > >>> >> >> > > >> Thanks, > > >>> >> >> > > >> > > >>> >> >> > > >> Alessandro > > >>> >> >> > > >> > > >>> >> >> > > > > >>> >> >> > > > > >>> >> >> > > > >>> >> >> > > > >>> >> >> > -- > > >>> >> >> > Claudio Martella > > >>> >> >> > [email protected] > > >>> >> >> > > > >>> >> >> > > >>> >> > > > >>> >> > > > >>> >> > > > >>> >> >-- > > >>> >> > > > >>> >> > -jake > > >>> >> > > >>> >> > > >>> > > > >>> > > > >>> >-- > > >>> > > > >>> > -jake > > >>> > > >>> > > >> > > >> > > >>-- > > >> > > >> -jake > > > > > > > > -- Claudio Martella [email protected]
