I am going to buck the trend and not inline my thoughts, this is
getting a little too thready :)
Methinks you will want an updateEdgeValue(), too.
On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching <ach...@apache.org> wrote:
> On 9/7/11 3:00 PM, Jake Mannix wrote:
> On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching <ach...@apache.org> wrote:
>> Haha, this really is turning into a movie =). I'll start warming up the
> Yeah, I've got my co-workers wondering if I'm going to ship any actual
> production code *inside* the company this week, at this rate (*shhhhh*
> Dmitriy don't tell!)
> It's only Wednesday...=)
>> On 9/7/11 12:51 PM, Jake Mannix wrote:
>> Maybe a few more examples would help? Cases where you want to do a BSP
>> computation where the total sort (both the vertexes, and the edges for each
>> vertex) is required, as is the random access nature of the Map?
>> I think the range based examples are the ones that immediately come to
>> mind. BSP for graph processing is still pretty new, and I have no idea what
>> kind of interesting algorithms will be tried out on this platform. We are
>> still exploring many possible algorithms to run.
> Ok, cool. I can see how wanting flexibility is important.
>> I think the idea was that after returning the map, users could directly
>> manipulate the map of edges or use the interfaces, there should probably be
>> a removeEdge() too. I'm starting to feel that we should remove Edge from
>> the user perspective, just keep it internally only for the add edge
>> requests. It just makes things a little more complex to the user (too many
>> ways to do the same thing). Perhaps the interfaces you specified could hit
>> most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(), and
>> removeEdge()). If there turns out to be a big need, we can always change it
>> back to a SortedMap or something else more appropriate.
> getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge()
> sound like the right level of flexibility while keeping the data
> encapsulated (so you can try your block compression idea, I can try out
> primitives, etc, but the interface remains the same).
>>> Memory consumption (see above), these are aggregate members for all the
>> Ok, I'll see what it looks like if this data is moved to something like a
>> VertexState object attached to the GraphMapper, which all the vertexes can
>> have a reference to.
>> As I've thought more about primitives vs objects, I think the object
>> flexibility is quite important. The page rank example could probably get
>> away with primitives, but other algorithms will likely require objects for
>> edge values, message values, and vertex values (i.e. maybe storing the
>> inlinks, or a bunch of different values i.e. multiple personalized page
>> ranks run simultaneously). I guess you're thinking that Giraph will have
>> two separate implementations? One that is primitives based and the other
>> that is object based?
> I'm thinking that with the right interface (like discussed above), you can
> have the same base interface, but yeah, for the particular case of
> implementers of BaseVertex<I,V,E,M> where all of I,V, and E are wrappers of
> primitives, that there is some nice memory savings that can be done by
> keeping them primitive (and only instantiating objects / autoboxing when
> accessing via the generic methods, when this is transiently done).
> PageRank isn't the only example where I'd want to get the (suspected) perf
> boost of using primitives, as most cases where I've dealt with graphs
> everything gets normalized at some point - the input features all get
> eventually turned into an edge "weight" of some kind, the vertexes
> themselves maybe keep some small data with them, but the edges just look
> like a target vertex id and an edge weight. For example, for social graphs,
> you can imagine lots and lots of fancy data you associate with users (geo,
> language, account freshness, recent text, topical interests, etc...), and
> lots of things to associate with the edges (there are many ways users can
> interact, beyond the explicit "x is connected to y" way), but when you want
> to run some big monstrous computation, this data is condensed into some
> fixed final "connection strength" combination of weights. On the other
> hand, to actually *compute* that connection weight, maybe non-local
> information gleaned from a graph algorithm would be nice. Similarly,
> computing a nice big topic-sensitive pagerank might require a bunch of
> topic-weight metadata at the vertexes.
> I don't know why I'm arguing this - I agree with you, keeping the *ability*
> to do object stuff is important, yes. I'm not advocating completely
> primitivizing all of the base implementations. I'm just suggesting that it
> be added, as that's a pretty common use case which could benefit from some
> low-hanging fruit memory savings.
> Sounds right to me, just wanted to make sure that I was understanding
> correctly what you wanted to do.
Dmitriy V Ryaboy