Maja and I have thought of a way we could include this type of
functionality without it being burdensome of the user.

We could add the following interfaces:

interface StrictRandomAccesVertexEdges extends VertexEdges {
  E getEdgeValue(I)
}

interface MultiRandomAccessVertexEdges extends VertexEdges {
  Iterable<E> getAllEdgeValues(I)
}

We could add the following to the Vertex interface:

// Returns the first edge value for a given target vertex id (or null if
none)
E getEdgeValue(I targetVertexId) {
  // if we have an instance of StrictRandomAccessVertexEdges, we delegate;
  // otherwise, we iterate over the edges to find the first one
}

// Returns an Iterable over all edge values for a given target vertex id
(only useful for multigraphs)
Iterable<E> getAllEdgeValues(I targetVertexId) {
  // if we have an instance of MultiRandomAccessVertexEdges, we delegate;
  // otherwise, we wrap the edges into an Iterable that filters for the
matching ones
}

This way, only VertexEdges implementations where it makes sense would
define the random-access methods, and a default (slow!) implementation
would be available in all other cases.

How does this sound? I could include it in GIRAPH-528.

On 3/3/13 4:57 PM, "Alessandro Presta" <[email protected]> wrote:

>I don't contest the graph-matrix duality, but I'm not yet convinced that
>you need random access to the edges for the scenarios you mentioned.
>PageRank and label propagation, two of the most common applications of our
>framework, have indeed a linear algebraic formulation, the bulk of which
>is always matrix multiplication. The most natural and efficient (if not
>the only) implementations in our model involve simply iterating over the
>edges.
>As for your "dot products" example, maybe it would help if you could be a
>bit more specific: what are your vertices and edges? How do we get the
>"input vectors"?. Everything should be expressed in terms of vertex
>values, edges, and messages, otherwise it doesn't fit the current Giraph
>model anyway.
>In any case, to compute a dot product efficiently you don't need random
>access to both vectors: if you have the non-zero coordinates of vector A
>(equivalent of the edge list), and random access to vector B, you iterate
>over A and lookup in B. This is efficient.
>
>> I'm not sure what this would correspond to in graphland, but I can
>> certainly see
>> wanting to have a big in-memory matrix which you can compute dot
>>products
>> of vectors with each row of it, and Giraph would do this very
>>efficiently
>> (but some
>> of the implementations of this would assume you had random access to the
>> edges (meaning edgeId and edgeWeight of each vertex).
>
>
>Here we are talking about a method with signature
>
>    EdgeValueType getEdgeValue(VertexIdType targetVertexId)
>
>or the multigraph version
>
>    Iterable<EdgeValueType> getEdgeValues(VertexIdType targetVertexId)
>
>I'm looking for examples of algorithms that can't be implemented in Giraph
>because of the lack of this method, and would otherwise be a good fit.
>
>Having said that, I think there *is* still a case for hash-map-backed
>edges, and that is when we need to guarantee a "strict graph" (i.e. no
>parallel edges) or when our algorithm uses remote edge removal requests.
>So I'm not against having an optimized LongDoubleHashMapEdges (and
>LongNullHashSetEdges for the unweighted case) included, provided that it's
>correct and efficient.
>I will make sure I include it. Now it's a matter of picking a primitive
>collections library among HPPC, fastutil, Mahout, Trove, etc...
>
>On 3/1/13 4:12 PM, "Jake Mannix" <[email protected]> wrote:
>
>>Sorry to have missed out on some of this conversation - had some work
>>stuff
>>interrupt (how dare it!)
>>
>>On Thu, Feb 28, 2013 at 8:13 PM, Alessandro Presta
>><[email protected]>wrote:
>>
>>> On 2/28/13 5:08 PM, "Jake Mannix" <[email protected]> wrote:
>>>
>>> >On Thu, Feb 28, 2013 at 4:18 PM, Alessandro Presta
>>> ><[email protected]>wrote:
>>> >
>>> >> It's not like it causes problems, it's just that it's a pretty big
>>> >> dependency to justify for a small example.
>>> >>
>>> >> As for the motivation, if your point is to prove the framework's
>>> >> superiority in some context, then you can use the simplest possible
>>> >> implementation (ArrayList).
>>> >>
>>> >
>>> >This takes up LOTS of memory.  Primitives rule, objects drool
>>>(memory).
>>>
>>> Ok, but having to copy the keys and values to external arrays (which is
>>> what you have to do with Mahout's hash map) is even worse. A good
>>> implementation of (long, double) edges (e.g. for RandomWalkVertex) is
>>> primitive arrays.
>>>
>>
>>So it is certainly true that the *typical* usage of iteration over Mahout
>>maps
>>is by doing this copy of internal values (which is not efficient, no),
>>it's
>>not necessary: OpenLongFloatHashMap.forEachPair(LongFloatProcedure p)
>>passes in a callback which operates directly on the underlying primitive
>>arrays.
>>
>>
>>>
>>> >
>>> >
>>> >> The Giraph framework is all about iterating over edges, so an
>>> >> implementation that doesn't support that with reasonable efficiency
>>> >> doesn't make a lot of sense to me.
>>> >>
>>> >> It also follows that hash map-based implementations only make sense
>>>for
>>> >> algorithms that make use of mutations.
>>> >>
>>> >
>>> >Agreed, maps aren't absolutely necessary for immutable graphs.  But
>>>random
>>> >access to collections of integers _is_ necessary for many algorithms.
>>> >It's
>>> >not all just iteration over lists.
>>>
>>> Can you give me some concrete examples of algorithms where random
>>>access
>>> to the edges is required?
>>> I'm really interested in this, because I'm considering killing
>>> getEdgeValue() if there are no use cases.
>>>
>>
>>So I'm not a graph person, I think in terms of matrices, but since every
>>graph
>>has an adjecency matrix, and every matrix is the adjacency matrix of some
>>(possibly
>>bipartite) graph, lets pretend they're basically the same:
>>
>>If you load a graph into Giraph, and want to compute the matrix product
>>of
>>this
>>graph with collection of input vectors, then you may want to take each of
>>these
>>input vectors and compute their dot products with the vertices of the
>>graph
>>(to
>>mix my metaphors: each vertex is essentially a row or column of the
>>matrix),
>>which can be done by either iterating over the input vector and looking
>>up
>>randomly for nonzero entries in the vertex, or the reverse.
>>
>>I'm not sure what this would correspond to in graphland, but I can
>>certainly see
>>wanting to have a big in-memory matrix which you can compute dot products
>>of vectors with each row of it, and Giraph would do this very efficiently
>>(but some
>>of the implementations of this would assume you had random access to the
>>edges (meaning edgeId and edgeWeight of each vertex).
>>
>>
>>>
>>> >
>>> >
>>> >> That said, something like a Trove hash map would probably be more
>>> >> appropriate (more efficient than the standard Java HashMap, at the
>>> >>expense
>>> >> of generality). That could be a good candidate for a
>>> >> LongDoubleHashMapEdges implementation.
>>> >> I can give that a shot if it sounds good.
>>> >>
>>> >
>>> >Trove is LGPL, IIRC, so that doesn't work in Apache projects.
>>>
>>> Whoops, totally missed that part.
>>>
>>> >
>>> >What does Giraph depend on of Mahout now?  Just mahout-collections?
>>> >That's
>>> >not a very big dependency, and has all sorts of primitive-to-primitive
>>> >collections,
>>> >and from what I've seen benchmarked, just as good or better than
>>>Trove.
>>> > Carrot2's
>>> >hppc may be better yet, but I'm not sure if that is stably released
>>>yet.
>>>
>>> I'll take a look at HPPC and Mahout collections then. They all seem to
>>> provide the same stuff, so I'll consider benchmarks and convenience of
>>>the
>>> API. Thanks for the pointers.
>>>
>>> >
>>> >
>>> >>
>>> >> On 2/28/13 4:03 PM, "Jake Mannix" <[email protected]> wrote:
>>> >>
>>> >> >Is the mahout dependency causing problems?
>>> >> >
>>> >> >It would be nice if we could actually implement some of the
>>>algorithms
>>> >> >that
>>> >> >Mahout does via map-reduce in Giraph's BSP formalism, to show off
>>>how
>>> >>it
>>> >> >improves things.  Using the Mahout primitives can show that it's
>>>not
>>> >>about
>>> >> >the inner loop implementation, but the framework itself...
>>> >> >
>>> >> >
>>> >> >On Thu, Feb 28, 2013 at 1:55 PM, Eli Reisman
>>> >> ><[email protected]>wrote:
>>> >> >
>>> >> >> I like the idea of refactoring it into something more appropriate
>>> >>for us
>>> >> >> and ditching the Mahout dep. Good looking out.
>>> >> >>
>>> >> >>
>>> >> >> On Thu, Feb 28, 2013 at 10:15 AM, Claudio Martella <
>>> >> >> [email protected]> wrote:
>>> >> >>
>>> >> >> > I agree, at this point we could have a RandomWalkVertex with
>>>edge
>>> >> >>values,
>>> >> >> > and a "null-edged" vertex for the PR benchmarks.
>>> >> >> > We make everybody happy and avoid code duplication.
>>> >> >> >
>>> >> >> >
>>> >> >> > On Thu, Feb 28, 2013 at 7:12 PM, Alessandro Presta
>>> >><[email protected]
>>> >> >> > >wrote:
>>> >> >> >
>>> >> >> > > Hi Gianmarco,
>>> >> >> > >
>>> >> >> > > Yes, there will be more efficient implementations.
>>> >> >> > > In the redesign I'm working on (GIRAPH-528), there will be
>>>only
>>> >>one
>>> >> >> > Vertex
>>> >> >> > > class and edge storage is delegated to a VertexEdges class.
>>> >> >> > > So far I'm adding some generic implementations
>>>(ByteArrayEdges,
>>> >> >> > > ArrayListEdges, HashMapEdges) that work for all types, and
>>>some
>>> >> >> optimized
>>> >> >> > > ones (LongDoubleArrayEdges, LongNullArrayEdges).
>>> >> >> > >
>>> >> >> > > Do you specifically need edge values to be float while the
>>>other
>>> >> >>types
>>> >> >> > are
>>> >> >> > > double?
>>> >> >> > > It seems to me it would make sense to change RandomWalkVertex
>>>to
>>> >>use
>>> >> >> > > double edge values instead, and avoid code duplication (i.e.
>>> >>adding
>>> >> >>a
>>> >> >> > > LongFloatArrayEdges that's basically the same). We're not
>>>Trove
>>> >> >>after
>>> >> >> > all.
>>> >> >> > > Makes sense?
>>> >> >> > >
>>> >> >> > > Thanks for the feedback,
>>> >> >> > >
>>> >> >> > > Alessandro
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > On 2/28/13 1:54 AM, "Gianmarco De Francisci Morales"
>>> >> >><[email protected]>
>>> >> >> > > wrote:
>>> >> >> > >
>>> >> >> > > >Hi,
>>> >> >> > > >
>>> >> >> > > >Maybe the specific implementation can be thrown away, but
>>> >> >>personally I
>>> >> >> > > >feel
>>> >> >> > > >very strongly for the need of a good LongDoubleFloatDouble
>>> >>vertex.
>>> >> >> > > >It's the base for any serious random walk algorithm.
>>> >> >> > > >
>>> >> >> > > >I would call for a refactoring rather than a removal.
>>> >> >> > > >
>>> >> >> > > >Just my 2c.
>>> >> >> > > >
>>> >> >> > > >Cheers,
>>> >> >> > > >
>>> >> >> > > >--
>>> >> >> > > >Gianmarco
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > >On Thu, Feb 28, 2013 at 7:54 AM, Alessandro Presta
>>> >> >> > > ><[email protected]>wrote:
>>> >> >> > > >
>>> >> >> > > >> Hi all,
>>> >> >> > > >>
>>> >> >> > > >> Does anyone feel strongly for LongDoubleFloatDoubleVertex?
>>> >> >> > > >> Reasons why I think it should be removed:
>>> >> >> > > >>
>>> >> >> > > >>   1.  Right now it's incorrect (returns target vertex id
>>>as
>>> >>edge
>>> >> >> > value).
>>> >> >> > > >>   2.  Iteration will always be inefficient, since the
>>> >>underlying
>>> >> >> > Mahout
>>> >> >> > > >> open-addressing hash map implementation doesn't provide
>>> >> >>iterators.
>>> >> >> It
>>> >> >> > > >> provides a way to copy the keys and values to external
>>> >> >>arrays/lists.
>>> >> >> > > >>   3.  It's the only reason why we have Mahout as a
>>>dependency.
>>> >> >> > > >>
>>> >> >> > > >> I think we should strive to provide model implementations
>>>that
>>> >> >>are
>>> >> >> > > >>generic
>>> >> >> > > >> and/or extremely efficient. This one satisfies neither.
>>> >> >> > > >>
>>> >> >> > > >> Thanks,
>>> >> >> > > >>
>>> >> >> > > >> Alessandro
>>> >> >> > > >>
>>> >> >> > >
>>> >> >> > >
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> >    Claudio Martella
>>> >> >> >    [email protected]
>>> >> >> >
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> >--
>>> >> >
>>> >> >  -jake
>>> >>
>>> >>
>>> >
>>> >
>>> >--
>>> >
>>> >  -jake
>>>
>>>
>>
>>
>>-- 
>>
>>  -jake
>

Reply via email to