I guess we could do some crazy reflection code and get the types as parameters... i also do feel that sometimes Long ids is overkill, for example.
On Mon, Mar 4, 2013 at 11:52 PM, Alessandro Presta <[email protected]>wrote: > > > On 3/4/13 2:47 PM, "Gianmarco De Francisci Morales" <[email protected]> > wrote: > > >The interface you propose sounds very reasonable. +1 for that. > > > >For floats vs double, they consume much less memory, O(|E|) reduction, and > >there is basically never any need to have a double precision on edges > >because they are immutable. > >Anyway it's not a strong constraint, double edges would work as well. > > Sure, makes sense. The question here is whether we should be providing all > possible combinations (basically duplicating a lot of code), or let the > user copy our model implementations. > If there is enough interest, we can always add all combinations of > (Long/Int, Double/Float/Null, Array/HashMap), although it would be tiring > and still not comprehensive (what about String ids?). > > > > >Cheers, > > > >-- > >Gianmarco > > > > > >On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <[email protected]> > >wrote: > > > >> Maja and I have thought of a way we could include this type of > >> functionality without it being burdensome of the user. > >> > >> We could add the following interfaces: > >> > >> interface StrictRandomAccesVertexEdges extends VertexEdges { > >> E getEdgeValue(I) > >> } > >> > >> interface MultiRandomAccessVertexEdges extends VertexEdges { > >> Iterable<E> getAllEdgeValues(I) > >> } > >> > >> We could add the following to the Vertex interface: > >> > >> // Returns the first edge value for a given target vertex id (or null if > >> none) > >> E getEdgeValue(I targetVertexId) { > >> // if we have an instance of StrictRandomAccessVertexEdges, we > >>delegate; > >> // otherwise, we iterate over the edges to find the first one > >> } > >> > >> // Returns an Iterable over all edge values for a given target vertex id > >> (only useful for multigraphs) > >> Iterable<E> getAllEdgeValues(I targetVertexId) { > >> // if we have an instance of MultiRandomAccessVertexEdges, we > >>delegate; > >> // otherwise, we wrap the edges into an Iterable that filters for the > >> matching ones > >> } > >> > >> This way, only VertexEdges implementations where it makes sense would > >> define the random-access methods, and a default (slow!) implementation > >> would be available in all other cases. > >> > >> How does this sound? I could include it in GIRAPH-528. > >> > >> On 3/3/13 4:57 PM, "Alessandro Presta" <[email protected]> wrote: > >> > >> >I don't contest the graph-matrix duality, but I'm not yet convinced > >>that > >> >you need random access to the edges for the scenarios you mentioned. > >> >PageRank and label propagation, two of the most common applications of > >>our > >> >framework, have indeed a linear algebraic formulation, the bulk of > >>which > >> >is always matrix multiplication. The most natural and efficient (if not > >> >the only) implementations in our model involve simply iterating over > >>the > >> >edges. > >> >As for your "dot products" example, maybe it would help if you could > >>be a > >> >bit more specific: what are your vertices and edges? How do we get the > >> >"input vectors"?. Everything should be expressed in terms of vertex > >> >values, edges, and messages, otherwise it doesn't fit the current > >>Giraph > >> >model anyway. > >> >In any case, to compute a dot product efficiently you don't need random > >> >access to both vectors: if you have the non-zero coordinates of vector > >>A > >> >(equivalent of the edge list), and random access to vector B, you > >>iterate > >> >over A and lookup in B. This is efficient. > >> > > >> >> I'm not sure what this would correspond to in graphland, but I can > >> >> certainly see > >> >> wanting to have a big in-memory matrix which you can compute dot > >> >>products > >> >> of vectors with each row of it, and Giraph would do this very > >> >>efficiently > >> >> (but some > >> >> of the implementations of this would assume you had random access to > >>the > >> >> edges (meaning edgeId and edgeWeight of each vertex). > >> > > >> > > >> >Here we are talking about a method with signature > >> > > >> > EdgeValueType getEdgeValue(VertexIdType targetVertexId) > >> > > >> >or the multigraph version > >> > > >> > Iterable<EdgeValueType> getEdgeValues(VertexIdType targetVertexId) > >> > > >> >I'm looking for examples of algorithms that can't be implemented in > >>Giraph > >> >because of the lack of this method, and would otherwise be a good fit. > >> > > >> >Having said that, I think there *is* still a case for hash-map-backed > >> >edges, and that is when we need to guarantee a "strict graph" (i.e. no > >> >parallel edges) or when our algorithm uses remote edge removal > >>requests. > >> >So I'm not against having an optimized LongDoubleHashMapEdges (and > >> >LongNullHashSetEdges for the unweighted case) included, provided that > >>it's > >> >correct and efficient. > >> >I will make sure I include it. Now it's a matter of picking a primitive > >> >collections library among HPPC, fastutil, Mahout, Trove, etc... > >> > > >> >On 3/1/13 4:12 PM, "Jake Mannix" <[email protected]> wrote: > >> > > >> >>Sorry to have missed out on some of this conversation - had some work > >> >>stuff > >> >>interrupt (how dare it!) > >> >> > >> >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta > >> >><[email protected]>wrote: > >> >> > >> >>> On 2/28/13 5:08 PM, "Jake Mannix" <[email protected]> wrote: > >> >>> > >> >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta > >> >>> ><[email protected]>wrote: > >> >>> > > >> >>> >> It's not like it causes problems, it's just that it's a pretty > >>big > >> >>> >> dependency to justify for a small example. > >> >>> >> > >> >>> >> As for the motivation, if your point is to prove the framework's > >> >>> >> superiority in some context, then you can use the simplest > >>possible > >> >>> >> implementation (ArrayList). > >> >>> >> > >> >>> > > >> >>> >This takes up LOTS of memory. Primitives rule, objects drool > >> >>>(memory). > >> >>> > >> >>> Ok, but having to copy the keys and values to external arrays > >>(which is > >> >>> what you have to do with Mahout's hash map) is even worse. A good > >> >>> implementation of (long, double) edges (e.g. for RandomWalkVertex) > >>is > >> >>> primitive arrays. > >> >>> > >> >> > >> >>So it is certainly true that the *typical* usage of iteration over > >>Mahout > >> >>maps > >> >>is by doing this copy of internal values (which is not efficient, no), > >> >>it's > >> >>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p) > >> >>passes in a callback which operates directly on the underlying > >>primitive > >> >>arrays. > >> >> > >> >> > >> >>> > >> >>> > > >> >>> > > >> >>> >> The Giraph framework is all about iterating over edges, so an > >> >>> >> implementation that doesn't support that with reasonable > >>efficiency > >> >>> >> doesn't make a lot of sense to me. > >> >>> >> > >> >>> >> It also follows that hash map-based implementations only make > >>sense > >> >>>for > >> >>> >> algorithms that make use of mutations. > >> >>> >> > >> >>> > > >> >>> >Agreed, maps aren't absolutely necessary for immutable graphs. But > >> >>>random > >> >>> >access to collections of integers _is_ necessary for many > >>algorithms. > >> >>> >It's > >> >>> >not all just iteration over lists. > >> >>> > >> >>> Can you give me some concrete examples of algorithms where random > >> >>>access > >> >>> to the edges is required? > >> >>> I'm really interested in this, because I'm considering killing > >> >>> getEdgeValue() if there are no use cases. > >> >>> > >> >> > >> >>So I'm not a graph person, I think in terms of matrices, but since > >>every > >> >>graph > >> >>has an adjecency matrix, and every matrix is the adjacency matrix of > >>some > >> >>(possibly > >> >>bipartite) graph, lets pretend they're basically the same: > >> >> > >> >>If you load a graph into Giraph, and want to compute the matrix > >>product > >> >>of > >> >>this > >> >>graph with collection of input vectors, then you may want to take > >>each of > >> >>these > >> >>input vectors and compute their dot products with the vertices of the > >> >>graph > >> >>(to > >> >>mix my metaphors: each vertex is essentially a row or column of the > >> >>matrix), > >> >>which can be done by either iterating over the input vector and > >>looking > >> >>up > >> >>randomly for nonzero entries in the vertex, or the reverse. > >> >> > >> >>I'm not sure what this would correspond to in graphland, but I can > >> >>certainly see > >> >>wanting to have a big in-memory matrix which you can compute dot > >>products > >> >>of vectors with each row of it, and Giraph would do this very > >>efficiently > >> >>(but some > >> >>of the implementations of this would assume you had random access to > >>the > >> >>edges (meaning edgeId and edgeWeight of each vertex). > >> >> > >> >> > >> >>> > >> >>> > > >> >>> > > >> >>> >> That said, something like a Trove hash map would probably be more > >> >>> >> appropriate (more efficient than the standard Java HashMap, at > >>the > >> >>> >>expense > >> >>> >> of generality). That could be a good candidate for a > >> >>> >> LongDoubleHashMapEdges implementation. > >> >>> >> I can give that a shot if it sounds good. > >> >>> >> > >> >>> > > >> >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects. > >> >>> > >> >>> Whoops, totally missed that part. > >> >>> > >> >>> > > >> >>> >What does Giraph depend on of Mahout now? Just mahout-collections? > >> >>> >That's > >> >>> >not a very big dependency, and has all sorts of > >>primitive-to-primitive > >> >>> >collections, > >> >>> >and from what I've seen benchmarked, just as good or better than > >> >>>Trove. > >> >>> > Carrot2's > >> >>> >hppc may be better yet, but I'm not sure if that is stably released > >> >>>yet. > >> >>> > >> >>> I'll take a look at HPPC and Mahout collections then. They all seem > >>to > >> >>> provide the same stuff, so I'll consider benchmarks and convenience > >>of > >> >>>the > >> >>> API. Thanks for the pointers. > >> >>> > >> >>> > > >> >>> > > >> >>> >> > >> >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <[email protected]> wrote: > >> >>> >> > >> >>> >> >Is the mahout dependency causing problems? > >> >>> >> > > >> >>> >> >It would be nice if we could actually implement some of the > >> >>>algorithms > >> >>> >> >that > >> >>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show > >>off > >> >>>how > >> >>> >>it > >> >>> >> >improves things. Using the Mahout primitives can show that it's > >> >>>not > >> >>> >>about > >> >>> >> >the inner loop implementation, but the framework itself... > >> >>> >> > > >> >>> >> > > >> >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman > >> >>> >> ><[email protected]>wrote: > >> >>> >> > > >> >>> >> >> I like the idea of refactoring it into something more > >>appropriate > >> >>> >>for us > >> >>> >> >> and ditching the Mahout dep. Good looking out. > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella < > >> >>> >> >> [email protected]> wrote: > >> >>> >> >> > >> >>> >> >> > I agree, at this point we could have a RandomWalkVertex with > >> >>>edge > >> >>> >> >>values, > >> >>> >> >> > and a "null-edged" vertex for the PR benchmarks. > >> >>> >> >> > We make everybody happy and avoid code duplication. > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta > >> >>> >><[email protected] > >> >>> >> >> > >wrote: > >> >>> >> >> > > >> >>> >> >> > > Hi Gianmarco, > >> >>> >> >> > > > >> >>> >> >> > > Yes, there will be more efficient implementations. > >> >>> >> >> > > In the redesign I'm working on (GIRAPH-528), there will be > >> >>>only > >> >>> >>one > >> >>> >> >> > Vertex > >> >>> >> >> > > class and edge storage is delegated to a VertexEdges > >>class. > >> >>> >> >> > > So far I'm adding some generic implementations > >> >>>(ByteArrayEdges, > >> >>> >> >> > > ArrayListEdges, HashMapEdges) that work for all types, and > >> >>>some > >> >>> >> >> optimized > >> >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges). > >> >>> >> >> > > > >> >>> >> >> > > Do you specifically need edge values to be float while the > >> >>>other > >> >>> >> >>types > >> >>> >> >> > are > >> >>> >> >> > > double? > >> >>> >> >> > > It seems to me it would make sense to change > >>RandomWalkVertex > >> >>>to > >> >>> >>use > >> >>> >> >> > > double edge values instead, and avoid code duplication > >>(i.e. > >> >>> >>adding > >> >>> >> >>a > >> >>> >> >> > > LongFloatArrayEdges that's basically the same). We're not > >> >>>Trove > >> >>> >> >>after > >> >>> >> >> > all. > >> >>> >> >> > > Makes sense? > >> >>> >> >> > > > >> >>> >> >> > > Thanks for the feedback, > >> >>> >> >> > > > >> >>> >> >> > > Alessandro > >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales" > >> >>> >> >><[email protected]> > >> >>> >> >> > > wrote: > >> >>> >> >> > > > >> >>> >> >> > > >Hi, > >> >>> >> >> > > > > >> >>> >> >> > > >Maybe the specific implementation can be thrown away, but > >> >>> >> >>personally I > >> >>> >> >> > > >feel > >> >>> >> >> > > >very strongly for the need of a good > >>LongDoubleFloatDouble > >> >>> >>vertex. > >> >>> >> >> > > >It's the base for any serious random walk algorithm. > >> >>> >> >> > > > > >> >>> >> >> > > >I would call for a refactoring rather than a removal. > >> >>> >> >> > > > > >> >>> >> >> > > >Just my 2c. > >> >>> >> >> > > > > >> >>> >> >> > > >Cheers, > >> >>> >> >> > > > > >> >>> >> >> > > >-- > >> >>> >> >> > > >Gianmarco > >> >>> >> >> > > > > >> >>> >> >> > > > > >> >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta > >> >>> >> >> > > ><[email protected]>wrote: > >> >>> >> >> > > > > >> >>> >> >> > > >> Hi all, > >> >>> >> >> > > >> > >> >>> >> >> > > >> Does anyone feel strongly for > >>LongDoubleFloatDoubleVertex? > >> >>> >> >> > > >> Reasons why I think it should be removed: > >> >>> >> >> > > >> > >> >>> >> >> > > >> 1. Right now it's incorrect (returns target vertex > >>id > >> >>>as > >> >>> >>edge > >> >>> >> >> > value). > >> >>> >> >> > > >> 2. Iteration will always be inefficient, since the > >> >>> >>underlying > >> >>> >> >> > Mahout > >> >>> >> >> > > >> open-addressing hash map implementation doesn't provide > >> >>> >> >>iterators. > >> >>> >> >> It > >> >>> >> >> > > >> provides a way to copy the keys and values to external > >> >>> >> >>arrays/lists. > >> >>> >> >> > > >> 3. It's the only reason why we have Mahout as a > >> >>>dependency. > >> >>> >> >> > > >> > >> >>> >> >> > > >> I think we should strive to provide model > >>implementations > >> >>>that > >> >>> >> >>are > >> >>> >> >> > > >>generic > >> >>> >> >> > > >> and/or extremely efficient. This one satisfies neither. > >> >>> >> >> > > >> > >> >>> >> >> > > >> Thanks, > >> >>> >> >> > > >> > >> >>> >> >> > > >> Alessandro > >> >>> >> >> > > >> > >> >>> >> >> > > > >> >>> >> >> > > > >> >>> >> >> > > >> >>> >> >> > > >> >>> >> >> > -- > >> >>> >> >> > Claudio Martella > >> >>> >> >> > [email protected] > >> >>> >> >> > > >> >>> >> >> > >> >>> >> > > >> >>> >> > > >> >>> >> > > >> >>> >> >-- > >> >>> >> > > >> >>> >> > -jake > >> >>> >> > >> >>> >> > >> >>> > > >> >>> > > >> >>> >-- > >> >>> > > >> >>> > -jake > >> >>> > >> >>> > >> >> > >> >> > >> >>-- > >> >> > >> >> -jake > >> > > >> > >> > > -- Claudio Martella [email protected]
