This addresses all the points we discussed: https://reviews.apache.org/r/9732/ I went with Fastutil because it fares well in benchmarks, has a reasonable API, and still allows for access to internals (for example, in the array-backed version, I was able to make removing a single edge O(1) by simply replacing it with the rightmost one).
On 3/4/13 4:44 PM, "Jake Mannix" <[email protected]> wrote: >We could *shudder* write templates for basic abstract primitive Vertex >classes (Abstract$KeyType$ValueTypeVertex.java.t ...) a la Mahout >collections... > > >On Mon, Mar 4, 2013 at 3:03 PM, Claudio Martella ><[email protected] >> wrote: > >> I guess we could do some crazy reflection code and get the types as >> parameters... i also do feel that sometimes Long ids is overkill, for >> example. >> >> >> On Mon, Mar 4, 2013 at 11:52 PM, Alessandro Presta <[email protected] >> >wrote: >> >> > >> > >> > On 3/4/13 2:47 PM, "Gianmarco De Francisci Morales" <[email protected]> >> > wrote: >> > >> > >The interface you propose sounds very reasonable. +1 for that. >> > > >> > >For floats vs double, they consume much less memory, O(|E|) >>reduction, >> and >> > >there is basically never any need to have a double precision on edges >> > >because they are immutable. >> > >Anyway it's not a strong constraint, double edges would work as well. >> > >> > Sure, makes sense. The question here is whether we should be providing >> all >> > possible combinations (basically duplicating a lot of code), or let >>the >> > user copy our model implementations. >> > If there is enough interest, we can always add all combinations of >> > (Long/Int, Double/Float/Null, Array/HashMap), although it would be >>tiring >> > and still not comprehensive (what about String ids?). >> > >> > > >> > >Cheers, >> > > >> > >-- >> > >Gianmarco >> > > >> > > >> > >On Mon, Mar 4, 2013 at 9:58 PM, Alessandro Presta <[email protected]> >> > >wrote: >> > > >> > >> Maja and I have thought of a way we could include this type of >> > >> functionality without it being burdensome of the user. >> > >> >> > >> We could add the following interfaces: >> > >> >> > >> interface StrictRandomAccesVertexEdges extends VertexEdges { >> > >> E getEdgeValue(I) >> > >> } >> > >> >> > >> interface MultiRandomAccessVertexEdges extends VertexEdges { >> > >> Iterable<E> getAllEdgeValues(I) >> > >> } >> > >> >> > >> We could add the following to the Vertex interface: >> > >> >> > >> // Returns the first edge value for a given target vertex id (or >>null >> if >> > >> none) >> > >> E getEdgeValue(I targetVertexId) { >> > >> // if we have an instance of StrictRandomAccessVertexEdges, we >> > >>delegate; >> > >> // otherwise, we iterate over the edges to find the first one >> > >> } >> > >> >> > >> // Returns an Iterable over all edge values for a given target >>vertex >> id >> > >> (only useful for multigraphs) >> > >> Iterable<E> getAllEdgeValues(I targetVertexId) { >> > >> // if we have an instance of MultiRandomAccessVertexEdges, we >> > >>delegate; >> > >> // otherwise, we wrap the edges into an Iterable that filters for >> the >> > >> matching ones >> > >> } >> > >> >> > >> This way, only VertexEdges implementations where it makes sense >>would >> > >> define the random-access methods, and a default (slow!) >>implementation >> > >> would be available in all other cases. >> > >> >> > >> How does this sound? I could include it in GIRAPH-528. >> > >> >> > >> On 3/3/13 4:57 PM, "Alessandro Presta" <[email protected]> wrote: >> > >> >> > >> >I don't contest the graph-matrix duality, but I'm not yet >>convinced >> > >>that >> > >> >you need random access to the edges for the scenarios you >>mentioned. >> > >> >PageRank and label propagation, two of the most common >>applications >> of >> > >>our >> > >> >framework, have indeed a linear algebraic formulation, the bulk of >> > >>which >> > >> >is always matrix multiplication. The most natural and efficient >>(if >> not >> > >> >the only) implementations in our model involve simply iterating >>over >> > >>the >> > >> >edges. >> > >> >As for your "dot products" example, maybe it would help if you >>could >> > >>be a >> > >> >bit more specific: what are your vertices and edges? How do we get >> the >> > >> >"input vectors"?. Everything should be expressed in terms of >>vertex >> > >> >values, edges, and messages, otherwise it doesn't fit the current >> > >>Giraph >> > >> >model anyway. >> > >> >In any case, to compute a dot product efficiently you don't need >> random >> > >> >access to both vectors: if you have the non-zero coordinates of >> vector >> > >>A >> > >> >(equivalent of the edge list), and random access to vector B, you >> > >>iterate >> > >> >over A and lookup in B. This is efficient. >> > >> > >> > >> >> I'm not sure what this would correspond to in graphland, but I >>can >> > >> >> certainly see >> > >> >> wanting to have a big in-memory matrix which you can compute dot >> > >> >>products >> > >> >> of vectors with each row of it, and Giraph would do this very >> > >> >>efficiently >> > >> >> (but some >> > >> >> of the implementations of this would assume you had random >>access >> to >> > >>the >> > >> >> edges (meaning edgeId and edgeWeight of each vertex). >> > >> > >> > >> > >> > >> >Here we are talking about a method with signature >> > >> > >> > >> > EdgeValueType getEdgeValue(VertexIdType targetVertexId) >> > >> > >> > >> >or the multigraph version >> > >> > >> > >> > Iterable<EdgeValueType> getEdgeValues(VertexIdType >> targetVertexId) >> > >> > >> > >> >I'm looking for examples of algorithms that can't be implemented >>in >> > >>Giraph >> > >> >because of the lack of this method, and would otherwise be a good >> fit. >> > >> > >> > >> >Having said that, I think there *is* still a case for >>hash-map-backed >> > >> >edges, and that is when we need to guarantee a "strict graph" >>(i.e. >> no >> > >> >parallel edges) or when our algorithm uses remote edge removal >> > >>requests. >> > >> >So I'm not against having an optimized LongDoubleHashMapEdges (and >> > >> >LongNullHashSetEdges for the unweighted case) included, provided >>that >> > >>it's >> > >> >correct and efficient. >> > >> >I will make sure I include it. Now it's a matter of picking a >> primitive >> > >> >collections library among HPPC, fastutil, Mahout, Trove, etc... >> > >> > >> > >> >On 3/1/13 4:12 PM, "Jake Mannix" <[email protected]> wrote: >> > >> > >> > >> >>Sorry to have missed out on some of this conversation - had some >> work >> > >> >>stuff >> > >> >>interrupt (how dare it!) >> > >> >> >> > >> >>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta >> > >> >><[email protected]>wrote: >> > >> >> >> > >> >>> On 2/28/13 5:08 PM, "Jake Mannix" <[email protected]> >>wrote: >> > >> >>> >> > >> >>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta >> > >> >>> ><[email protected]>wrote: >> > >> >>> > >> > >> >>> >> It's not like it causes problems, it's just that it's a >>pretty >> > >>big >> > >> >>> >> dependency to justify for a small example. >> > >> >>> >> >> > >> >>> >> As for the motivation, if your point is to prove the >> framework's >> > >> >>> >> superiority in some context, then you can use the simplest >> > >>possible >> > >> >>> >> implementation (ArrayList). >> > >> >>> >> >> > >> >>> > >> > >> >>> >This takes up LOTS of memory. Primitives rule, objects drool >> > >> >>>(memory). >> > >> >>> >> > >> >>> Ok, but having to copy the keys and values to external arrays >> > >>(which is >> > >> >>> what you have to do with Mahout's hash map) is even worse. A >>good >> > >> >>> implementation of (long, double) edges (e.g. for >>RandomWalkVertex) >> > >>is >> > >> >>> primitive arrays. >> > >> >>> >> > >> >> >> > >> >>So it is certainly true that the *typical* usage of iteration >>over >> > >>Mahout >> > >> >>maps >> > >> >>is by doing this copy of internal values (which is not efficient, >> no), >> > >> >>it's >> > >> >>not necessary: >>OpenLongFloatHashMap.forEachPair(LongFloatProcedure >> p) >> > >> >>passes in a callback which operates directly on the underlying >> > >>primitive >> > >> >>arrays. >> > >> >> >> > >> >> >> > >> >>> >> > >> >>> > >> > >> >>> > >> > >> >>> >> The Giraph framework is all about iterating over edges, so >>an >> > >> >>> >> implementation that doesn't support that with reasonable >> > >>efficiency >> > >> >>> >> doesn't make a lot of sense to me. >> > >> >>> >> >> > >> >>> >> It also follows that hash map-based implementations only >>make >> > >>sense >> > >> >>>for >> > >> >>> >> algorithms that make use of mutations. >> > >> >>> >> >> > >> >>> > >> > >> >>> >Agreed, maps aren't absolutely necessary for immutable graphs. >> But >> > >> >>>random >> > >> >>> >access to collections of integers _is_ necessary for many >> > >>algorithms. >> > >> >>> >It's >> > >> >>> >not all just iteration over lists. >> > >> >>> >> > >> >>> Can you give me some concrete examples of algorithms where >>random >> > >> >>>access >> > >> >>> to the edges is required? >> > >> >>> I'm really interested in this, because I'm considering killing >> > >> >>> getEdgeValue() if there are no use cases. >> > >> >>> >> > >> >> >> > >> >>So I'm not a graph person, I think in terms of matrices, but >>since >> > >>every >> > >> >>graph >> > >> >>has an adjecency matrix, and every matrix is the adjacency >>matrix of >> > >>some >> > >> >>(possibly >> > >> >>bipartite) graph, lets pretend they're basically the same: >> > >> >> >> > >> >>If you load a graph into Giraph, and want to compute the matrix >> > >>product >> > >> >>of >> > >> >>this >> > >> >>graph with collection of input vectors, then you may want to take >> > >>each of >> > >> >>these >> > >> >>input vectors and compute their dot products with the vertices of >> the >> > >> >>graph >> > >> >>(to >> > >> >>mix my metaphors: each vertex is essentially a row or column of >>the >> > >> >>matrix), >> > >> >>which can be done by either iterating over the input vector and >> > >>looking >> > >> >>up >> > >> >>randomly for nonzero entries in the vertex, or the reverse. >> > >> >> >> > >> >>I'm not sure what this would correspond to in graphland, but I >>can >> > >> >>certainly see >> > >> >>wanting to have a big in-memory matrix which you can compute dot >> > >>products >> > >> >>of vectors with each row of it, and Giraph would do this very >> > >>efficiently >> > >> >>(but some >> > >> >>of the implementations of this would assume you had random >>access to >> > >>the >> > >> >>edges (meaning edgeId and edgeWeight of each vertex). >> > >> >> >> > >> >> >> > >> >>> >> > >> >>> > >> > >> >>> > >> > >> >>> >> That said, something like a Trove hash map would probably be >> more >> > >> >>> >> appropriate (more efficient than the standard Java HashMap, >>at >> > >>the >> > >> >>> >>expense >> > >> >>> >> of generality). That could be a good candidate for a >> > >> >>> >> LongDoubleHashMapEdges implementation. >> > >> >>> >> I can give that a shot if it sounds good. >> > >> >>> >> >> > >> >>> > >> > >> >>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects. >> > >> >>> >> > >> >>> Whoops, totally missed that part. >> > >> >>> >> > >> >>> > >> > >> >>> >What does Giraph depend on of Mahout now? Just >> mahout-collections? >> > >> >>> >That's >> > >> >>> >not a very big dependency, and has all sorts of >> > >>primitive-to-primitive >> > >> >>> >collections, >> > >> >>> >and from what I've seen benchmarked, just as good or better >>than >> > >> >>>Trove. >> > >> >>> > Carrot2's >> > >> >>> >hppc may be better yet, but I'm not sure if that is stably >> released >> > >> >>>yet. >> > >> >>> >> > >> >>> I'll take a look at HPPC and Mahout collections then. They all >> seem >> > >>to >> > >> >>> provide the same stuff, so I'll consider benchmarks and >> convenience >> > >>of >> > >> >>>the >> > >> >>> API. Thanks for the pointers. >> > >> >>> >> > >> >>> > >> > >> >>> > >> > >> >>> >> >> > >> >>> >> On 2/28/13 4:03 PM, "Jake Mannix" <[email protected]> >> wrote: >> > >> >>> >> >> > >> >>> >> >Is the mahout dependency causing problems? >> > >> >>> >> > >> > >> >>> >> >It would be nice if we could actually implement some of the >> > >> >>>algorithms >> > >> >>> >> >that >> > >> >>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to >>show >> > >>off >> > >> >>>how >> > >> >>> >>it >> > >> >>> >> >improves things. Using the Mahout primitives can show that >> it's >> > >> >>>not >> > >> >>> >>about >> > >> >>> >> >the inner loop implementation, but the framework itself... >> > >> >>> >> > >> > >> >>> >> > >> > >> >>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman >> > >> >>> >> ><[email protected]>wrote: >> > >> >>> >> > >> > >> >>> >> >> I like the idea of refactoring it into something more >> > >>appropriate >> > >> >>> >>for us >> > >> >>> >> >> and ditching the Mahout dep. Good looking out. >> > >> >>> >> >> >> > >> >>> >> >> >> > >> >>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella < >> > >> >>> >> >> [email protected]> wrote: >> > >> >>> >> >> >> > >> >>> >> >> > I agree, at this point we could have a RandomWalkVertex >> with >> > >> >>>edge >> > >> >>> >> >>values, >> > >> >>> >> >> > and a "null-edged" vertex for the PR benchmarks. >> > >> >>> >> >> > We make everybody happy and avoid code duplication. >> > >> >>> >> >> > >> > >> >>> >> >> > >> > >> >>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta >> > >> >>> >><[email protected] >> > >> >>> >> >> > >wrote: >> > >> >>> >> >> > >> > >> >>> >> >> > > Hi Gianmarco, >> > >> >>> >> >> > > >> > >> >>> >> >> > > Yes, there will be more efficient implementations. >> > >> >>> >> >> > > In the redesign I'm working on (GIRAPH-528), there >>will >> be >> > >> >>>only >> > >> >>> >>one >> > >> >>> >> >> > Vertex >> > >> >>> >> >> > > class and edge storage is delegated to a VertexEdges >> > >>class. >> > >> >>> >> >> > > So far I'm adding some generic implementations >> > >> >>>(ByteArrayEdges, >> > >> >>> >> >> > > ArrayListEdges, HashMapEdges) that work for all >>types, >> and >> > >> >>>some >> > >> >>> >> >> optimized >> > >> >>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges). >> > >> >>> >> >> > > >> > >> >>> >> >> > > Do you specifically need edge values to be float >>while >> the >> > >> >>>other >> > >> >>> >> >>types >> > >> >>> >> >> > are >> > >> >>> >> >> > > double? >> > >> >>> >> >> > > It seems to me it would make sense to change >> > >>RandomWalkVertex >> > >> >>>to >> > >> >>> >>use >> > >> >>> >> >> > > double edge values instead, and avoid code >>duplication >> > >>(i.e. >> > >> >>> >>adding >> > >> >>> >> >>a >> > >> >>> >> >> > > LongFloatArrayEdges that's basically the same). We're >> not >> > >> >>>Trove >> > >> >>> >> >>after >> > >> >>> >> >> > all. >> > >> >>> >> >> > > Makes sense? >> > >> >>> >> >> > > >> > >> >>> >> >> > > Thanks for the feedback, >> > >> >>> >> >> > > >> > >> >>> >> >> > > Alessandro >> > >> >>> >> >> > > >> > >> >>> >> >> > > >> > >> >>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales" >> > >> >>> >> >><[email protected]> >> > >> >>> >> >> > > wrote: >> > >> >>> >> >> > > >> > >> >>> >> >> > > >Hi, >> > >> >>> >> >> > > > >> > >> >>> >> >> > > >Maybe the specific implementation can be thrown >>away, >> but >> > >> >>> >> >>personally I >> > >> >>> >> >> > > >feel >> > >> >>> >> >> > > >very strongly for the need of a good >> > >>LongDoubleFloatDouble >> > >> >>> >>vertex. >> > >> >>> >> >> > > >It's the base for any serious random walk algorithm. >> > >> >>> >> >> > > > >> > >> >>> >> >> > > >I would call for a refactoring rather than a >>removal. >> > >> >>> >> >> > > > >> > >> >>> >> >> > > >Just my 2c. >> > >> >>> >> >> > > > >> > >> >>> >> >> > > >Cheers, >> > >> >>> >> >> > > > >> > >> >>> >> >> > > >-- >> > >> >>> >> >> > > >Gianmarco >> > >> >>> >> >> > > > >> > >> >>> >> >> > > > >> > >> >>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta >> > >> >>> >> >> > > ><[email protected]>wrote: >> > >> >>> >> >> > > > >> > >> >>> >> >> > > >> Hi all, >> > >> >>> >> >> > > >> >> > >> >>> >> >> > > >> Does anyone feel strongly for >> > >>LongDoubleFloatDoubleVertex? >> > >> >>> >> >> > > >> Reasons why I think it should be removed: >> > >> >>> >> >> > > >> >> > >> >>> >> >> > > >> 1. Right now it's incorrect (returns target >>vertex >> > >>id >> > >> >>>as >> > >> >>> >>edge >> > >> >>> >> >> > value). >> > >> >>> >> >> > > >> 2. Iteration will always be inefficient, since >>the >> > >> >>> >>underlying >> > >> >>> >> >> > Mahout >> > >> >>> >> >> > > >> open-addressing hash map implementation doesn't >> provide >> > >> >>> >> >>iterators. >> > >> >>> >> >> It >> > >> >>> >> >> > > >> provides a way to copy the keys and values to >> external >> > >> >>> >> >>arrays/lists. >> > >> >>> >> >> > > >> 3. It's the only reason why we have Mahout as a >> > >> >>>dependency. >> > >> >>> >> >> > > >> >> > >> >>> >> >> > > >> I think we should strive to provide model >> > >>implementations >> > >> >>>that >> > >> >>> >> >>are >> > >> >>> >> >> > > >>generic >> > >> >>> >> >> > > >> and/or extremely efficient. This one satisfies >> neither. >> > >> >>> >> >> > > >> >> > >> >>> >> >> > > >> Thanks, >> > >> >>> >> >> > > >> >> > >> >>> >> >> > > >> Alessandro >> > >> >>> >> >> > > >> >> > >> >>> >> >> > > >> > >> >>> >> >> > > >> > >> >>> >> >> > >> > >> >>> >> >> > >> > >> >>> >> >> > -- >> > >> >>> >> >> > Claudio Martella >> > >> >>> >> >> > [email protected] >> > >> >>> >> >> > >> > >> >>> >> >> >> > >> >>> >> > >> > >> >>> >> > >> > >> >>> >> > >> > >> >>> >> >-- >> > >> >>> >> > >> > >> >>> >> > -jake >> > >> >>> >> >> > >> >>> >> >> > >> >>> > >> > >> >>> > >> > >> >>> >-- >> > >> >>> > >> > >> >>> > -jake >> > >> >>> >> > >> >>> >> > >> >> >> > >> >> >> > >> >>-- >> > >> >> >> > >> >> -jake >> > >> > >> > >> >> > >> >> > >> > >> >> >> -- >> Claudio Martella >> [email protected] >> > > > >-- > > -jake
