That is true, its not the end of the world to make the optimized edge store impl's throw exceptions when you try to mutate. As long as object-oriented edge stores are possible for the user, they have the option to trade efficiency or resources for mutation if they need it (or choose.)
One thing occured to me: if we ever get GIRAPH-26 up and running, Jake had mentioned we would need the Mahout math libs to do it (Someone mentioned on that thread that the Colt libs will not work for us I think.) so thats one reason to keep depending on Mahout in some form. On Fri, Mar 1, 2013 at 10:53 AM, Alessandro Presta <[email protected]>wrote: > On 3/1/13 1:01 AM, "Claudio Martella" <[email protected]> wrote: > > >Imagine you're infering an "is_a" property. e.g. you have lion - is_a -> > >mammal - is_a -> animal. by propagation you'd infer lion - is_a -> animal. > >Clearly, i'm talking about a graph that does not only have is_a typed > >edges. > > > >Two things can be noted: (1) one could filter out non is_a edges directly > >at loading in this case but (2) you'd want more complex of these, e.g. a > >path (a concatenation of is_a with lives_in or something?) to find out at > >query time or something. > > What would be the vertex id in this case? I don't think it would be "is_a" > or "lives_in", right? That would probably be in the edge value. > Also, wouldn't you still need to read all the edges for this type of > application? > > > > >This is one example. I think most of our algorithms out there touch all > >the > >edges and iterate over (why else using such a big framework and loading > >the > >whole graph at first place?), so I'd optimise for them or avoid penalising > >them, but I would not remove from the API random access to edges either. > >After all, the idea is to have them implement their (to come) Edges class. > > > > The problem is that with every method we add to the API we increase the > burden on the user. > Also, more requirements = more constraints = less opportunities for > optimization. > > getEdgeValue() poses a few different issues: > - it will be slow in most implementations > - if we want it to at least be as space-efficient as getEdges(), it would > have to reuse the E object > - the signature for general graphs would have to be Iterable<E> > getEdgeValues(I); in order to have an easier interface for strict graphs, > we would have to make the inheritance structure more complex (adding > something like a StrictVertex) > > That said, nothing prevents the user from adding that functionality > himself to VertexEdges, and then in Vertex casting getEdges() to the known > VertexEdges implementation, and call whatever method (getEdgeValue(), > whatnot) on it. > > > > >On Fri, Mar 1, 2013 at 9:37 AM, Alessandro Presta <[email protected]> > >wrote: > > > >> On 3/1/13 12:33 AM, "Claudio Martella" <[email protected]> > >>wrote: > >> > >> >On Fri, Mar 1, 2013 at 5:13 AM, Alessandro Presta <[email protected]> > >> >wrote: > >> > > >> >> On 2/28/13 5:08 PM, "Jake Mannix" <[email protected]> wrote: > >> >> > >> >> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta > >> >> ><[email protected]>wrote: > >> >> > > >> >> >> It's not like it causes problems, it's just that it's a pretty big > >> >> >> dependency to justify for a small example. > >> >> >> > >> >> >> As for the motivation, if your point is to prove the framework's > >> >> >> superiority in some context, then you can use the simplest > >>possible > >> >> >> implementation (ArrayList). > >> >> >> > >> >> > > >> >> >This takes up LOTS of memory. Primitives rule, objects drool > >>(memory). > >> >> > >> >> Ok, but having to copy the keys and values to external arrays (which > >>is > >> >> what you have to do with Mahout's hash map) is even worse. A good > >> >> implementation of (long, double) edges (e.g. for RandomWalkVertex) is > >> >> primitive arrays. > >> >> > >> >> > > >> >> > > >> >> >> The Giraph framework is all about iterating over edges, so an > >> >> >> implementation that doesn't support that with reasonable > >>efficiency > >> >> >> doesn't make a lot of sense to me. > >> >> >> > >> >> >> It also follows that hash map-based implementations only make > >>sense > >> >>for > >> >> >> algorithms that make use of mutations. > >> >> >> > >> >> > > >> >> >Agreed, maps aren't absolutely necessary for immutable graphs. But > >> >>random > >> >> >access to collections of integers _is_ necessary for many > >>algorithms. > >> >> >It's > >> >> >not all just iteration over lists. > >> >> > >> >> Can you give me some concrete examples of algorithms where random > >>access > >> >> to the edges is required? > >> >> I'm really interested in this, because I'm considering killing > >> >> getEdgeValue() if there are no use cases. > >> >> > >> > > >> >My example would be typed/labeled graphs (e.g. RDF) for reasoning and > >> >traversals based on particular patterns/path. > >> > > >> > >> Can you be a bit more specific? I'm interested in knowing if this > >> algorithm would call getEdgeValue(id) for some target vertex id. > >> > >> > > >> >> > >> >> > > >> >> > > >> >> >> That said, something like a Trove hash map would probably be more > >> >> >> appropriate (more efficient than the standard Java HashMap, at the > >> >> >>expense > >> >> >> of generality). That could be a good candidate for a > >> >> >> LongDoubleHashMapEdges implementation. > >> >> >> I can give that a shot if it sounds good. > >> >> >> > >> >> > > >> >> >Trove is LGPL, IIRC, so that doesn't work in Apache projects. > >> >> > >> >> Whoops, totally missed that part. > >> >> > >> >> > > >> >> >What does Giraph depend on of Mahout now? Just mahout-collections? > >> >> >That's > >> >> >not a very big dependency, and has all sorts of > >>primitive-to-primitive > >> >> >collections, > >> >> >and from what I've seen benchmarked, just as good or better than > >>Trove. > >> >> > Carrot2's > >> >> >hppc may be better yet, but I'm not sure if that is stably released > >> >>yet. > >> >> > >> >> I'll take a look at HPPC and Mahout collections then. They all seem > >>to > >> >> provide the same stuff, so I'll consider benchmarks and convenience > >>of > >> >>the > >> >> API. Thanks for the pointers. > >> >> > >> >> > > >> >> > > >> >> >> > >> >> >> On 2/28/13 4:03 PM, "Jake Mannix" <[email protected]> wrote: > >> >> >> > >> >> >> >Is the mahout dependency causing problems? > >> >> >> > > >> >> >> >It would be nice if we could actually implement some of the > >> >>algorithms > >> >> >> >that > >> >> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show off > >> >>how > >> >> >>it > >> >> >> >improves things. Using the Mahout primitives can show that it's > >>not > >> >> >>about > >> >> >> >the inner loop implementation, but the framework itself... > >> >> >> > > >> >> >> > > >> >> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman > >> >> >> ><[email protected]>wrote: > >> >> >> > > >> >> >> >> I like the idea of refactoring it into something more > >>appropriate > >> >> >>for us > >> >> >> >> and ditching the Mahout dep. Good looking out. > >> >> >> >> > >> >> >> >> > >> >> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella < > >> >> >> >> [email protected]> wrote: > >> >> >> >> > >> >> >> >> > I agree, at this point we could have a RandomWalkVertex with > >> >>edge > >> >> >> >>values, > >> >> >> >> > and a "null-edged" vertex for the PR benchmarks. > >> >> >> >> > We make everybody happy and avoid code duplication. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta > >> >> >><[email protected] > >> >> >> >> > >wrote: > >> >> >> >> > > >> >> >> >> > > Hi Gianmarco, > >> >> >> >> > > > >> >> >> >> > > Yes, there will be more efficient implementations. > >> >> >> >> > > In the redesign I'm working on (GIRAPH-528), there will be > >> >>only > >> >> >>one > >> >> >> >> > Vertex > >> >> >> >> > > class and edge storage is delegated to a VertexEdges class. > >> >> >> >> > > So far I'm adding some generic implementations > >> >>(ByteArrayEdges, > >> >> >> >> > > ArrayListEdges, HashMapEdges) that work for all types, and > >> >>some > >> >> >> >> optimized > >> >> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges). > >> >> >> >> > > > >> >> >> >> > > Do you specifically need edge values to be float while the > >> >>other > >> >> >> >>types > >> >> >> >> > are > >> >> >> >> > > double? > >> >> >> >> > > It seems to me it would make sense to change > >>RandomWalkVertex > >> >>to > >> >> >>use > >> >> >> >> > > double edge values instead, and avoid code duplication > >>(i.e. > >> >> >>adding > >> >> >> >>a > >> >> >> >> > > LongFloatArrayEdges that's basically the same). We're not > >> >>Trove > >> >> >> >>after > >> >> >> >> > all. > >> >> >> >> > > Makes sense? > >> >> >> >> > > > >> >> >> >> > > Thanks for the feedback, > >> >> >> >> > > > >> >> >> >> > > Alessandro > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales" > >> >> >> >><[email protected]> > >> >> >> >> > > wrote: > >> >> >> >> > > > >> >> >> >> > > >Hi, > >> >> >> >> > > > > >> >> >> >> > > >Maybe the specific implementation can be thrown away, but > >> >> >> >>personally I > >> >> >> >> > > >feel > >> >> >> >> > > >very strongly for the need of a good LongDoubleFloatDouble > >> >> >>vertex. > >> >> >> >> > > >It's the base for any serious random walk algorithm. > >> >> >> >> > > > > >> >> >> >> > > >I would call for a refactoring rather than a removal. > >> >> >> >> > > > > >> >> >> >> > > >Just my 2c. > >> >> >> >> > > > > >> >> >> >> > > >Cheers, > >> >> >> >> > > > > >> >> >> >> > > >-- > >> >> >> >> > > >Gianmarco > >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta > >> >> >> >> > > ><[email protected]>wrote: > >> >> >> >> > > > > >> >> >> >> > > >> Hi all, > >> >> >> >> > > >> > >> >> >> >> > > >> Does anyone feel strongly for > >>LongDoubleFloatDoubleVertex? > >> >> >> >> > > >> Reasons why I think it should be removed: > >> >> >> >> > > >> > >> >> >> >> > > >> 1. Right now it's incorrect (returns target vertex > >>id as > >> >> >>edge > >> >> >> >> > value). > >> >> >> >> > > >> 2. Iteration will always be inefficient, since the > >> >> >>underlying > >> >> >> >> > Mahout > >> >> >> >> > > >> open-addressing hash map implementation doesn't provide > >> >> >> >>iterators. > >> >> >> >> It > >> >> >> >> > > >> provides a way to copy the keys and values to external > >> >> >> >>arrays/lists. > >> >> >> >> > > >> 3. It's the only reason why we have Mahout as a > >> >>dependency. > >> >> >> >> > > >> > >> >> >> >> > > >> I think we should strive to provide model > >>implementations > >> >>that > >> >> >> >>are > >> >> >> >> > > >>generic > >> >> >> >> > > >> and/or extremely efficient. This one satisfies neither. > >> >> >> >> > > >> > >> >> >> >> > > >> Thanks, > >> >> >> >> > > >> > >> >> >> >> > > >> Alessandro > >> >> >> >> > > >> > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > -- > >> >> >> >> > Claudio Martella > >> >> >> >> > [email protected] > >> >> >> >> > > >> >> >> >> > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> >-- > >> >> >> > > >> >> >> > -jake > >> >> >> > >> >> >> > >> >> > > >> >> > > >> >> >-- > >> >> > > >> >> > -jake > >> >> > >> >> > >> > > >> > > >> >-- > >> > Claudio Martella > >> > [email protected] > >> > >> > > > > > >-- > > Claudio Martella > > [email protected] > >
