I am going to buck the trend and not inline my thoughts, this is getting a little too thready :)
Methinks you will want an updateEdgeValue(), too. D On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching <[email protected]> wrote: > On 9/7/11 3:00 PM, Jake Mannix wrote: > > On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching <[email protected]> wrote: >> >> Haha, this really is turning into a movie =). I'll start warming up the >> popcorn. > > Yeah, I've got my co-workers wondering if I'm going to ship any actual > production code *inside* the company this week, at this rate (*shhhhh* > Dmitriy don't tell!) > > > It's only Wednesday...=) > >> On 9/7/11 12:51 PM, Jake Mannix wrote: >> >> Maybe a few more examples would help? Cases where you want to do a BSP >> computation where the total sort (both the vertexes, and the edges for each >> vertex) is required, as is the random access nature of the Map? >> >> I think the range based examples are the ones that immediately come to >> mind. BSP for graph processing is still pretty new, and I have no idea what >> kind of interesting algorithms will be tried out on this platform. We are >> still exploring many possible algorithms to run. > > Ok, cool. I can see how wanting flexibility is important. > >> >> I think the idea was that after returning the map, users could directly >> manipulate the map of edges or use the interfaces, there should probably be >> a removeEdge() too. I'm starting to feel that we should remove Edge from >> the user perspective, just keep it internally only for the add edge >> requests. It just makes things a little more complex to the user (too many >> ways to do the same thing). Perhaps the interfaces you specified could hit >> most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(), and >> removeEdge()). If there turns out to be a big need, we can always change it >> back to a SortedMap or something else more appropriate. > > > getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge() > sound like the right level of flexibility while keeping the data > encapsulated (so you can try your block compression idea, I can try out > primitives, etc, but the interface remains the same). >>> >>> Memory consumption (see above), these are aggregate members for all the >>> vertices. >> >> Ok, I'll see what it looks like if this data is moved to something like a >> VertexState object attached to the GraphMapper, which all the vertexes can >> have a reference to. >> >> As I've thought more about primitives vs objects, I think the object >> flexibility is quite important. The page rank example could probably get >> away with primitives, but other algorithms will likely require objects for >> edge values, message values, and vertex values (i.e. maybe storing the >> inlinks, or a bunch of different values i.e. multiple personalized page >> ranks run simultaneously). I guess you're thinking that Giraph will have >> two separate implementations? One that is primitives based and the other >> that is object based? > > I'm thinking that with the right interface (like discussed above), you can > have the same base interface, but yeah, for the particular case of > implementers of BaseVertex<I,V,E,M> where all of I,V, and E are wrappers of > primitives, that there is some nice memory savings that can be done by > keeping them primitive (and only instantiating objects / autoboxing when > accessing via the generic methods, when this is transiently done). > PageRank isn't the only example where I'd want to get the (suspected) perf > boost of using primitives, as most cases where I've dealt with graphs > everything gets normalized at some point - the input features all get > eventually turned into an edge "weight" of some kind, the vertexes > themselves maybe keep some small data with them, but the edges just look > like a target vertex id and an edge weight. For example, for social graphs, > you can imagine lots and lots of fancy data you associate with users (geo, > language, account freshness, recent text, topical interests, etc...), and > lots of things to associate with the edges (there are many ways users can > interact, beyond the explicit "x is connected to y" way), but when you want > to run some big monstrous computation, this data is condensed into some > fixed final "connection strength" combination of weights. On the other > hand, to actually *compute* that connection weight, maybe non-local > information gleaned from a graph algorithm would be nice. Similarly, > computing a nice big topic-sensitive pagerank might require a bunch of > topic-weight metadata at the vertexes. > I don't know why I'm arguing this - I agree with you, keeping the *ability* > to do object stuff is important, yes. I'm not advocating completely > primitivizing all of the base implementations. I'm just suggesting that it > be added, as that's a pretty common use case which could benefit from some > low-hanging fruit memory savings. > > Sounds right to me, just wanted to make sure that I was understanding > correctly what you wanted to do. > -- Dmitriy V Ryaboy Twitter Analytics http://twitter.com/squarecog
