On 3/4/13 2:47 PM, "Gianmarco De Francisci Morales" <[email protected]> wrote:
>The interface you propose sounds very reasonable. +1 for that. > >For floats vs double, they consume much less memory, O(|E|) reduction, and >there is basically never any need to have a double precision on edges >because they are immutable. >Anyway it's not a strong constraint, double edges would work as well. Sure, makes sense. The question here is whether we should be providing all possible combinations (basically duplicating a lot of code), or let the user copy our model implementations. If there is enough interest, we can always add all combinations of (Long/Int, Double/Float/Null, Array/HashMap), although it would be tiring and still not comprehensive (what about String ids?). > >Cheers, > >-- >Gianmarco > > >On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <[email protected]> >wrote: > >> Maja and I have thought of a way we could include this type of >> functionality without it being burdensome of the user. >> >> We could add the following interfaces: >> >> interface StrictRandomAccesVertexEdges extends VertexEdges { >> E getEdgeValue(I) >> } >> >> interface MultiRandomAccessVertexEdges extends VertexEdges { >> Iterable<E> getAllEdgeValues(I) >> } >> >> We could add the following to the Vertex interface: >> >> // Returns the first edge value for a given target vertex id (or null if >> none) >> E getEdgeValue(I targetVertexId) { >> // if we have an instance of StrictRandomAccessVertexEdges, we >>delegate; >> // otherwise, we iterate over the edges to find the first one >> } >> >> // Returns an Iterable over all edge values for a given target vertex id >> (only useful for multigraphs) >> Iterable<E> getAllEdgeValues(I targetVertexId) { >> // if we have an instance of MultiRandomAccessVertexEdges, we >>delegate; >> // otherwise, we wrap the edges into an Iterable that filters for the >> matching ones >> } >> >> This way, only VertexEdges implementations where it makes sense would >> define the random-access methods, and a default (slow!) implementation >> would be available in all other cases. >> >> How does this sound? I could include it in GIRAPH-528. >> >> On 3/3/13 4:57 PM, "Alessandro Presta" <[email protected]> wrote: >> >> >I don't contest the graph-matrix duality, but I'm not yet convinced >>that >> >you need random access to the edges for the scenarios you mentioned. >> >PageRank and label propagation, two of the most common applications of >>our >> >framework, have indeed a linear algebraic formulation, the bulk of >>which >> >is always matrix multiplication. The most natural and efficient (if not >> >the only) implementations in our model involve simply iterating over >>the >> >edges. >> >As for your "dot products" example, maybe it would help if you could >>be a >> >bit more specific: what are your vertices and edges? How do we get the >> >"input vectors"?. Everything should be expressed in terms of vertex >> >values, edges, and messages, otherwise it doesn't fit the current >>Giraph >> >model anyway. >> >In any case, to compute a dot product efficiently you don't need random >> >access to both vectors: if you have the non-zero coordinates of vector >>A >> >(equivalent of the edge list), and random access to vector B, you >>iterate >> >over A and lookup in B. This is efficient. >> > >> >> I'm not sure what this would correspond to in graphland, but I can >> >> certainly see >> >> wanting to have a big in-memory matrix which you can compute dot >> >>products >> >> of vectors with each row of it, and Giraph would do this very >> >>efficiently >> >> (but some >> >> of the implementations of this would assume you had random access to >>the >> >> edges (meaning edgeId and edgeWeight of each vertex). >> > >> > >> >Here we are talking about a method with signature >> > >> > EdgeValueType getEdgeValue(VertexIdType targetVertexId) >> > >> >or the multigraph version >> > >> > Iterable<EdgeValueType> getEdgeValues(VertexIdType targetVertexId) >> > >> >I'm looking for examples of algorithms that can't be implemented in >>Giraph >> >because of the lack of this method, and would otherwise be a good fit. >> > >> >Having said that, I think there *is* still a case for hash-map-backed >> >edges, and that is when we need to guarantee a "strict graph" (i.e. no >> >parallel edges) or when our algorithm uses remote edge removal >>requests. >> >So I'm not against having an optimized LongDoubleHashMapEdges (and >> >LongNullHashSetEdges for the unweighted case) included, provided that >>it's >> >correct and efficient. >> >I will make sure I include it. Now it's a matter of picking a primitive >> >collections library among HPPC, fastutil, Mahout, Trove, etc... >> > >> >On 3/1/13 4:12 PM, "Jake Mannix" <[email protected]> wrote: >> > >> >>Sorry to have missed out on some of this conversation - had some work >> >>stuff >> >>interrupt (how dare it!) >> >> >> >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta >> >><[email protected]>wrote: >> >> >> >>> On 2/28/13 5:08 PM, "Jake Mannix" <[email protected]> wrote: >> >>> >> >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta >> >>> ><[email protected]>wrote: >> >>> > >> >>> >> It's not like it causes problems, it's just that it's a pretty >>big >> >>> >> dependency to justify for a small example. >> >>> >> >> >>> >> As for the motivation, if your point is to prove the framework's >> >>> >> superiority in some context, then you can use the simplest >>possible >> >>> >> implementation (ArrayList). >> >>> >> >> >>> > >> >>> >This takes up LOTS of memory. Primitives rule, objects drool >> >>>(memory). >> >>> >> >>> Ok, but having to copy the keys and values to external arrays >>(which is >> >>> what you have to do with Mahout's hash map) is even worse. A good >> >>> implementation of (long, double) edges (e.g. for RandomWalkVertex) >>is >> >>> primitive arrays. >> >>> >> >> >> >>So it is certainly true that the *typical* usage of iteration over >>Mahout >> >>maps >> >>is by doing this copy of internal values (which is not efficient, no), >> >>it's >> >>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p) >> >>passes in a callback which operates directly on the underlying >>primitive >> >>arrays. >> >> >> >> >> >>> >> >>> > >> >>> > >> >>> >> The Giraph framework is all about iterating over edges, so an >> >>> >> implementation that doesn't support that with reasonable >>efficiency >> >>> >> doesn't make a lot of sense to me. >> >>> >> >> >>> >> It also follows that hash map-based implementations only make >>sense >> >>>for >> >>> >> algorithms that make use of mutations. >> >>> >> >> >>> > >> >>> >Agreed, maps aren't absolutely necessary for immutable graphs. But >> >>>random >> >>> >access to collections of integers _is_ necessary for many >>algorithms. >> >>> >It's >> >>> >not all just iteration over lists. >> >>> >> >>> Can you give me some concrete examples of algorithms where random >> >>>access >> >>> to the edges is required? >> >>> I'm really interested in this, because I'm considering killing >> >>> getEdgeValue() if there are no use cases. >> >>> >> >> >> >>So I'm not a graph person, I think in terms of matrices, but since >>every >> >>graph >> >>has an adjecency matrix, and every matrix is the adjacency matrix of >>some >> >>(possibly >> >>bipartite) graph, lets pretend they're basically the same: >> >> >> >>If you load a graph into Giraph, and want to compute the matrix >>product >> >>of >> >>this >> >>graph with collection of input vectors, then you may want to take >>each of >> >>these >> >>input vectors and compute their dot products with the vertices of the >> >>graph >> >>(to >> >>mix my metaphors: each vertex is essentially a row or column of the >> >>matrix), >> >>which can be done by either iterating over the input vector and >>looking >> >>up >> >>randomly for nonzero entries in the vertex, or the reverse. >> >> >> >>I'm not sure what this would correspond to in graphland, but I can >> >>certainly see >> >>wanting to have a big in-memory matrix which you can compute dot >>products >> >>of vectors with each row of it, and Giraph would do this very >>efficiently >> >>(but some >> >>of the implementations of this would assume you had random access to >>the >> >>edges (meaning edgeId and edgeWeight of each vertex). >> >> >> >> >> >>> >> >>> > >> >>> > >> >>> >> That said, something like a Trove hash map would probably be more >> >>> >> appropriate (more efficient than the standard Java HashMap, at >>the >> >>> >>expense >> >>> >> of generality). That could be a good candidate for a >> >>> >> LongDoubleHashMapEdges implementation. >> >>> >> I can give that a shot if it sounds good. >> >>> >> >> >>> > >> >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects. >> >>> >> >>> Whoops, totally missed that part. >> >>> >> >>> > >> >>> >What does Giraph depend on of Mahout now? Just mahout-collections? >> >>> >That's >> >>> >not a very big dependency, and has all sorts of >>primitive-to-primitive >> >>> >collections, >> >>> >and from what I've seen benchmarked, just as good or better than >> >>>Trove. >> >>> > Carrot2's >> >>> >hppc may be better yet, but I'm not sure if that is stably released >> >>>yet. >> >>> >> >>> I'll take a look at HPPC and Mahout collections then. They all seem >>to >> >>> provide the same stuff, so I'll consider benchmarks and convenience >>of >> >>>the >> >>> API. Thanks for the pointers. >> >>> >> >>> > >> >>> > >> >>> >> >> >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <[email protected]> wrote: >> >>> >> >> >>> >> >Is the mahout dependency causing problems? >> >>> >> > >> >>> >> >It would be nice if we could actually implement some of the >> >>>algorithms >> >>> >> >that >> >>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show >>off >> >>>how >> >>> >>it >> >>> >> >improves things. Using the Mahout primitives can show that it's >> >>>not >> >>> >>about >> >>> >> >the inner loop implementation, but the framework itself... >> >>> >> > >> >>> >> > >> >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman >> >>> >> ><[email protected]>wrote: >> >>> >> > >> >>> >> >> I like the idea of refactoring it into something more >>appropriate >> >>> >>for us >> >>> >> >> and ditching the Mahout dep. Good looking out. >> >>> >> >> >> >>> >> >> >> >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella < >> >>> >> >> [email protected]> wrote: >> >>> >> >> >> >>> >> >> > I agree, at this point we could have a RandomWalkVertex with >> >>>edge >> >>> >> >>values, >> >>> >> >> > and a "null-edged" vertex for the PR benchmarks. >> >>> >> >> > We make everybody happy and avoid code duplication. >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta >> >>> >><[email protected] >> >>> >> >> > >wrote: >> >>> >> >> > >> >>> >> >> > > Hi Gianmarco, >> >>> >> >> > > >> >>> >> >> > > Yes, there will be more efficient implementations. >> >>> >> >> > > In the redesign I'm working on (GIRAPH-528), there will be >> >>>only >> >>> >>one >> >>> >> >> > Vertex >> >>> >> >> > > class and edge storage is delegated to a VertexEdges >>class. >> >>> >> >> > > So far I'm adding some generic implementations >> >>>(ByteArrayEdges, >> >>> >> >> > > ArrayListEdges, HashMapEdges) that work for all types, and >> >>>some >> >>> >> >> optimized >> >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges). >> >>> >> >> > > >> >>> >> >> > > Do you specifically need edge values to be float while the >> >>>other >> >>> >> >>types >> >>> >> >> > are >> >>> >> >> > > double? >> >>> >> >> > > It seems to me it would make sense to change >>RandomWalkVertex >> >>>to >> >>> >>use >> >>> >> >> > > double edge values instead, and avoid code duplication >>(i.e. >> >>> >>adding >> >>> >> >>a >> >>> >> >> > > LongFloatArrayEdges that's basically the same). We're not >> >>>Trove >> >>> >> >>after >> >>> >> >> > all. >> >>> >> >> > > Makes sense? >> >>> >> >> > > >> >>> >> >> > > Thanks for the feedback, >> >>> >> >> > > >> >>> >> >> > > Alessandro >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales" >> >>> >> >><[email protected]> >> >>> >> >> > > wrote: >> >>> >> >> > > >> >>> >> >> > > >Hi, >> >>> >> >> > > > >> >>> >> >> > > >Maybe the specific implementation can be thrown away, but >> >>> >> >>personally I >> >>> >> >> > > >feel >> >>> >> >> > > >very strongly for the need of a good >>LongDoubleFloatDouble >> >>> >>vertex. >> >>> >> >> > > >It's the base for any serious random walk algorithm. >> >>> >> >> > > > >> >>> >> >> > > >I would call for a refactoring rather than a removal. >> >>> >> >> > > > >> >>> >> >> > > >Just my 2c. >> >>> >> >> > > > >> >>> >> >> > > >Cheers, >> >>> >> >> > > > >> >>> >> >> > > >-- >> >>> >> >> > > >Gianmarco >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta >> >>> >> >> > > ><[email protected]>wrote: >> >>> >> >> > > > >> >>> >> >> > > >> Hi all, >> >>> >> >> > > >> >> >>> >> >> > > >> Does anyone feel strongly for >>LongDoubleFloatDoubleVertex? >> >>> >> >> > > >> Reasons why I think it should be removed: >> >>> >> >> > > >> >> >>> >> >> > > >> 1. Right now it's incorrect (returns target vertex >>id >> >>>as >> >>> >>edge >> >>> >> >> > value). >> >>> >> >> > > >> 2. Iteration will always be inefficient, since the >> >>> >>underlying >> >>> >> >> > Mahout >> >>> >> >> > > >> open-addressing hash map implementation doesn't provide >> >>> >> >>iterators. >> >>> >> >> It >> >>> >> >> > > >> provides a way to copy the keys and values to external >> >>> >> >>arrays/lists. >> >>> >> >> > > >> 3. It's the only reason why we have Mahout as a >> >>>dependency. >> >>> >> >> > > >> >> >>> >> >> > > >> I think we should strive to provide model >>implementations >> >>>that >> >>> >> >>are >> >>> >> >> > > >>generic >> >>> >> >> > > >> and/or extremely efficient. This one satisfies neither. >> >>> >> >> > > >> >> >>> >> >> > > >> Thanks, >> >>> >> >> > > >> >> >>> >> >> > > >> Alessandro >> >>> >> >> > > >> >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > -- >> >>> >> >> > Claudio Martella >> >>> >> >> > [email protected] >> >>> >> >> > >> >>> >> >> >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> >-- >> >>> >> > >> >>> >> > -jake >> >>> >> >> >>> >> >> >>> > >> >>> > >> >>> >-- >> >>> > >> >>> > -jake >> >>> >> >>> >> >> >> >> >> >>-- >> >> >> >> -jake >> > >> >>
