We could *shudder* write templates for basic abstract primitive Vertex classes (Abstract$KeyType$ValueTypeVertex.java.t ...) a la Mahout collections...
On Mon, Mar 4, 2013 at 3:03 PM, Claudio Martella <[email protected] > wrote: > I guess we could do some crazy reflection code and get the types as > parameters... i also do feel that sometimes Long ids is overkill, for > example. > > > On Mon, Mar 4, 2013 at 11:52 PM, Alessandro Presta <[email protected] > >wrote: > > > > > > > On 3/4/13 2:47 PM, "Gianmarco De Francisci Morales" <[email protected]> > > wrote: > > > > >The interface you propose sounds very reasonable. +1 for that. > > > > > >For floats vs double, they consume much less memory, O(|E|) reduction, > and > > >there is basically never any need to have a double precision on edges > > >because they are immutable. > > >Anyway it's not a strong constraint, double edges would work as well. > > > > Sure, makes sense. The question here is whether we should be providing > all > > possible combinations (basically duplicating a lot of code), or let the > > user copy our model implementations. > > If there is enough interest, we can always add all combinations of > > (Long/Int, Double/Float/Null, Array/HashMap), although it would be tiring > > and still not comprehensive (what about String ids?). > > > > > > > >Cheers, > > > > > >-- > > >Gianmarco > > > > > > > > >On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <[email protected]> > > >wrote: > > > > > >> Maja and I have thought of a way we could include this type of > > >> functionality without it being burdensome of the user. > > >> > > >> We could add the following interfaces: > > >> > > >> interface StrictRandomAccesVertexEdges extends VertexEdges { > > >> E getEdgeValue(I) > > >> } > > >> > > >> interface MultiRandomAccessVertexEdges extends VertexEdges { > > >> Iterable<E> getAllEdgeValues(I) > > >> } > > >> > > >> We could add the following to the Vertex interface: > > >> > > >> // Returns the first edge value for a given target vertex id (or null > if > > >> none) > > >> E getEdgeValue(I targetVertexId) { > > >> // if we have an instance of StrictRandomAccessVertexEdges, we > > >>delegate; > > >> // otherwise, we iterate over the edges to find the first one > > >> } > > >> > > >> // Returns an Iterable over all edge values for a given target vertex > id > > >> (only useful for multigraphs) > > >> Iterable<E> getAllEdgeValues(I targetVertexId) { > > >> // if we have an instance of MultiRandomAccessVertexEdges, we > > >>delegate; > > >> // otherwise, we wrap the edges into an Iterable that filters for > the > > >> matching ones > > >> } > > >> > > >> This way, only VertexEdges implementations where it makes sense would > > >> define the random-access methods, and a default (slow!) implementation > > >> would be available in all other cases. > > >> > > >> How does this sound? I could include it in GIRAPH-528. > > >> > > >> On 3/3/13 4:57 PM, "Alessandro Presta" <[email protected]> wrote: > > >> > > >> >I don't contest the graph-matrix duality, but I'm not yet convinced > > >>that > > >> >you need random access to the edges for the scenarios you mentioned. > > >> >PageRank and label propagation, two of the most common applications > of > > >>our > > >> >framework, have indeed a linear algebraic formulation, the bulk of > > >>which > > >> >is always matrix multiplication. The most natural and efficient (if > not > > >> >the only) implementations in our model involve simply iterating over > > >>the > > >> >edges. > > >> >As for your "dot products" example, maybe it would help if you could > > >>be a > > >> >bit more specific: what are your vertices and edges? How do we get > the > > >> >"input vectors"?. Everything should be expressed in terms of vertex > > >> >values, edges, and messages, otherwise it doesn't fit the current > > >>Giraph > > >> >model anyway. > > >> >In any case, to compute a dot product efficiently you don't need > random > > >> >access to both vectors: if you have the non-zero coordinates of > vector > > >>A > > >> >(equivalent of the edge list), and random access to vector B, you > > >>iterate > > >> >over A and lookup in B. This is efficient. > > >> > > > >> >> I'm not sure what this would correspond to in graphland, but I can > > >> >> certainly see > > >> >> wanting to have a big in-memory matrix which you can compute dot > > >> >>products > > >> >> of vectors with each row of it, and Giraph would do this very > > >> >>efficiently > > >> >> (but some > > >> >> of the implementations of this would assume you had random access > to > > >>the > > >> >> edges (meaning edgeId and edgeWeight of each vertex). > > >> > > > >> > > > >> >Here we are talking about a method with signature > > >> > > > >> > EdgeValueType getEdgeValue(VertexIdType targetVertexId) > > >> > > > >> >or the multigraph version > > >> > > > >> > Iterable<EdgeValueType> getEdgeValues(VertexIdType > targetVertexId) > > >> > > > >> >I'm looking for examples of algorithms that can't be implemented in > > >>Giraph > > >> >because of the lack of this method, and would otherwise be a good > fit. > > >> > > > >> >Having said that, I think there *is* still a case for hash-map-backed > > >> >edges, and that is when we need to guarantee a "strict graph" (i.e. > no > > >> >parallel edges) or when our algorithm uses remote edge removal > > >>requests. > > >> >So I'm not against having an optimized LongDoubleHashMapEdges (and > > >> >LongNullHashSetEdges for the unweighted case) included, provided that > > >>it's > > >> >correct and efficient. > > >> >I will make sure I include it. Now it's a matter of picking a > primitive > > >> >collections library among HPPC, fastutil, Mahout, Trove, etc... > > >> > > > >> >On 3/1/13 4:12 PM, "Jake Mannix" <[email protected]> wrote: > > >> > > > >> >>Sorry to have missed out on some of this conversation - had some > work > > >> >>stuff > > >> >>interrupt (how dare it!) > > >> >> > > >> >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta > > >> >><[email protected]>wrote: > > >> >> > > >> >>> On 2/28/13 5:08 PM, "Jake Mannix" <[email protected]> wrote: > > >> >>> > > >> >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta > > >> >>> ><[email protected]>wrote: > > >> >>> > > > >> >>> >> It's not like it causes problems, it's just that it's a pretty > > >>big > > >> >>> >> dependency to justify for a small example. > > >> >>> >> > > >> >>> >> As for the motivation, if your point is to prove the > framework's > > >> >>> >> superiority in some context, then you can use the simplest > > >>possible > > >> >>> >> implementation (ArrayList). > > >> >>> >> > > >> >>> > > > >> >>> >This takes up LOTS of memory. Primitives rule, objects drool > > >> >>>(memory). > > >> >>> > > >> >>> Ok, but having to copy the keys and values to external arrays > > >>(which is > > >> >>> what you have to do with Mahout's hash map) is even worse. A good > > >> >>> implementation of (long, double) edges (e.g. for RandomWalkVertex) > > >>is > > >> >>> primitive arrays. > > >> >>> > > >> >> > > >> >>So it is certainly true that the *typical* usage of iteration over > > >>Mahout > > >> >>maps > > >> >>is by doing this copy of internal values (which is not efficient, > no), > > >> >>it's > > >> >>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure > p) > > >> >>passes in a callback which operates directly on the underlying > > >>primitive > > >> >>arrays. > > >> >> > > >> >> > > >> >>> > > >> >>> > > > >> >>> > > > >> >>> >> The Giraph framework is all about iterating over edges, so an > > >> >>> >> implementation that doesn't support that with reasonable > > >>efficiency > > >> >>> >> doesn't make a lot of sense to me. > > >> >>> >> > > >> >>> >> It also follows that hash map-based implementations only make > > >>sense > > >> >>>for > > >> >>> >> algorithms that make use of mutations. > > >> >>> >> > > >> >>> > > > >> >>> >Agreed, maps aren't absolutely necessary for immutable graphs. > But > > >> >>>random > > >> >>> >access to collections of integers _is_ necessary for many > > >>algorithms. > > >> >>> >It's > > >> >>> >not all just iteration over lists. > > >> >>> > > >> >>> Can you give me some concrete examples of algorithms where random > > >> >>>access > > >> >>> to the edges is required? > > >> >>> I'm really interested in this, because I'm considering killing > > >> >>> getEdgeValue() if there are no use cases. > > >> >>> > > >> >> > > >> >>So I'm not a graph person, I think in terms of matrices, but since > > >>every > > >> >>graph > > >> >>has an adjecency matrix, and every matrix is the adjacency matrix of > > >>some > > >> >>(possibly > > >> >>bipartite) graph, lets pretend they're basically the same: > > >> >> > > >> >>If you load a graph into Giraph, and want to compute the matrix > > >>product > > >> >>of > > >> >>this > > >> >>graph with collection of input vectors, then you may want to take > > >>each of > > >> >>these > > >> >>input vectors and compute their dot products with the vertices of > the > > >> >>graph > > >> >>(to > > >> >>mix my metaphors: each vertex is essentially a row or column of the > > >> >>matrix), > > >> >>which can be done by either iterating over the input vector and > > >>looking > > >> >>up > > >> >>randomly for nonzero entries in the vertex, or the reverse. > > >> >> > > >> >>I'm not sure what this would correspond to in graphland, but I can > > >> >>certainly see > > >> >>wanting to have a big in-memory matrix which you can compute dot > > >>products > > >> >>of vectors with each row of it, and Giraph would do this very > > >>efficiently > > >> >>(but some > > >> >>of the implementations of this would assume you had random access to > > >>the > > >> >>edges (meaning edgeId and edgeWeight of each vertex). > > >> >> > > >> >> > > >> >>> > > >> >>> > > > >> >>> > > > >> >>> >> That said, something like a Trove hash map would probably be > more > > >> >>> >> appropriate (more efficient than the standard Java HashMap, at > > >>the > > >> >>> >>expense > > >> >>> >> of generality). That could be a good candidate for a > > >> >>> >> LongDoubleHashMapEdges implementation. > > >> >>> >> I can give that a shot if it sounds good. > > >> >>> >> > > >> >>> > > > >> >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects. > > >> >>> > > >> >>> Whoops, totally missed that part. > > >> >>> > > >> >>> > > > >> >>> >What does Giraph depend on of Mahout now? Just > mahout-collections? > > >> >>> >That's > > >> >>> >not a very big dependency, and has all sorts of > > >>primitive-to-primitive > > >> >>> >collections, > > >> >>> >and from what I've seen benchmarked, just as good or better than > > >> >>>Trove. > > >> >>> > Carrot2's > > >> >>> >hppc may be better yet, but I'm not sure if that is stably > released > > >> >>>yet. > > >> >>> > > >> >>> I'll take a look at HPPC and Mahout collections then. They all > seem > > >>to > > >> >>> provide the same stuff, so I'll consider benchmarks and > convenience > > >>of > > >> >>>the > > >> >>> API. Thanks for the pointers. > > >> >>> > > >> >>> > > > >> >>> > > > >> >>> >> > > >> >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <[email protected]> > wrote: > > >> >>> >> > > >> >>> >> >Is the mahout dependency causing problems? > > >> >>> >> > > > >> >>> >> >It would be nice if we could actually implement some of the > > >> >>>algorithms > > >> >>> >> >that > > >> >>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show > > >>off > > >> >>>how > > >> >>> >>it > > >> >>> >> >improves things. Using the Mahout primitives can show that > it's > > >> >>>not > > >> >>> >>about > > >> >>> >> >the inner loop implementation, but the framework itself... > > >> >>> >> > > > >> >>> >> > > > >> >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman > > >> >>> >> ><[email protected]>wrote: > > >> >>> >> > > > >> >>> >> >> I like the idea of refactoring it into something more > > >>appropriate > > >> >>> >>for us > > >> >>> >> >> and ditching the Mahout dep. Good looking out. > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella < > > >> >>> >> >> [email protected]> wrote: > > >> >>> >> >> > > >> >>> >> >> > I agree, at this point we could have a RandomWalkVertex > with > > >> >>>edge > > >> >>> >> >>values, > > >> >>> >> >> > and a "null-edged" vertex for the PR benchmarks. > > >> >>> >> >> > We make everybody happy and avoid code duplication. > > >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta > > >> >>> >><[email protected] > > >> >>> >> >> > >wrote: > > >> >>> >> >> > > > >> >>> >> >> > > Hi Gianmarco, > > >> >>> >> >> > > > > >> >>> >> >> > > Yes, there will be more efficient implementations. > > >> >>> >> >> > > In the redesign I'm working on (GIRAPH-528), there will > be > > >> >>>only > > >> >>> >>one > > >> >>> >> >> > Vertex > > >> >>> >> >> > > class and edge storage is delegated to a VertexEdges > > >>class. > > >> >>> >> >> > > So far I'm adding some generic implementations > > >> >>>(ByteArrayEdges, > > >> >>> >> >> > > ArrayListEdges, HashMapEdges) that work for all types, > and > > >> >>>some > > >> >>> >> >> optimized > > >> >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges). > > >> >>> >> >> > > > > >> >>> >> >> > > Do you specifically need edge values to be float while > the > > >> >>>other > > >> >>> >> >>types > > >> >>> >> >> > are > > >> >>> >> >> > > double? > > >> >>> >> >> > > It seems to me it would make sense to change > > >>RandomWalkVertex > > >> >>>to > > >> >>> >>use > > >> >>> >> >> > > double edge values instead, and avoid code duplication > > >>(i.e. > > >> >>> >>adding > > >> >>> >> >>a > > >> >>> >> >> > > LongFloatArrayEdges that's basically the same). We're > not > > >> >>>Trove > > >> >>> >> >>after > > >> >>> >> >> > all. > > >> >>> >> >> > > Makes sense? > > >> >>> >> >> > > > > >> >>> >> >> > > Thanks for the feedback, > > >> >>> >> >> > > > > >> >>> >> >> > > Alessandro > > >> >>> >> >> > > > > >> >>> >> >> > > > > >> >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales" > > >> >>> >> >><[email protected]> > > >> >>> >> >> > > wrote: > > >> >>> >> >> > > > > >> >>> >> >> > > >Hi, > > >> >>> >> >> > > > > > >> >>> >> >> > > >Maybe the specific implementation can be thrown away, > but > > >> >>> >> >>personally I > > >> >>> >> >> > > >feel > > >> >>> >> >> > > >very strongly for the need of a good > > >>LongDoubleFloatDouble > > >> >>> >>vertex. > > >> >>> >> >> > > >It's the base for any serious random walk algorithm. > > >> >>> >> >> > > > > > >> >>> >> >> > > >I would call for a refactoring rather than a removal. > > >> >>> >> >> > > > > > >> >>> >> >> > > >Just my 2c. > > >> >>> >> >> > > > > > >> >>> >> >> > > >Cheers, > > >> >>> >> >> > > > > > >> >>> >> >> > > >-- > > >> >>> >> >> > > >Gianmarco > > >> >>> >> >> > > > > > >> >>> >> >> > > > > > >> >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta > > >> >>> >> >> > > ><[email protected]>wrote: > > >> >>> >> >> > > > > > >> >>> >> >> > > >> Hi all, > > >> >>> >> >> > > >> > > >> >>> >> >> > > >> Does anyone feel strongly for > > >>LongDoubleFloatDoubleVertex? > > >> >>> >> >> > > >> Reasons why I think it should be removed: > > >> >>> >> >> > > >> > > >> >>> >> >> > > >> 1. Right now it's incorrect (returns target vertex > > >>id > > >> >>>as > > >> >>> >>edge > > >> >>> >> >> > value). > > >> >>> >> >> > > >> 2. Iteration will always be inefficient, since the > > >> >>> >>underlying > > >> >>> >> >> > Mahout > > >> >>> >> >> > > >> open-addressing hash map implementation doesn't > provide > > >> >>> >> >>iterators. > > >> >>> >> >> It > > >> >>> >> >> > > >> provides a way to copy the keys and values to > external > > >> >>> >> >>arrays/lists. > > >> >>> >> >> > > >> 3. It's the only reason why we have Mahout as a > > >> >>>dependency. > > >> >>> >> >> > > >> > > >> >>> >> >> > > >> I think we should strive to provide model > > >>implementations > > >> >>>that > > >> >>> >> >>are > > >> >>> >> >> > > >>generic > > >> >>> >> >> > > >> and/or extremely efficient. This one satisfies > neither. > > >> >>> >> >> > > >> > > >> >>> >> >> > > >> Thanks, > > >> >>> >> >> > > >> > > >> >>> >> >> > > >> Alessandro > > >> >>> >> >> > > >> > > >> >>> >> >> > > > > >> >>> >> >> > > > > >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > -- > > >> >>> >> >> > Claudio Martella > > >> >>> >> >> > [email protected] > > >> >>> >> >> > > > >> >>> >> >> > > >> >>> >> > > > >> >>> >> > > > >> >>> >> > > > >> >>> >> >-- > > >> >>> >> > > > >> >>> >> > -jake > > >> >>> >> > > >> >>> >> > > >> >>> > > > >> >>> > > > >> >>> >-- > > >> >>> > > > >> >>> > -jake > > >> >>> > > >> >>> > > >> >> > > >> >> > > >> >>-- > > >> >> > > >> >> -jake > > >> > > > >> > > >> > > > > > > > -- > Claudio Martella > [email protected] > -- -jake
