On Wed, Sep 7, 2011 at 10:33 PM, Avery Ching <ach...@apache.org> wrote:
> This probably should have been a JIRA =). Yeah, probably! > I agree that update edge is probably useful as well. Maybe a Map is the > right thing then...rather than creating lots of methods to do edge > manipulation... > Or just Edge<I,E> getEdge(I targetVertex) instead of E getEdgeValue(I targetVertex) > > Avery > > > On 9/7/11 3:22 PM, Dmitriy Ryaboy wrote: > >> I am going to buck the trend and not inline my thoughts, this is >> getting a little too thready :) >> >> Methinks you will want an updateEdgeValue(), too. >> >> D >> >> On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching<ach...@apache.org> wrote: >> >>> On 9/7/11 3:00 PM, Jake Mannix wrote: >>> >>> On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching<ach...@apache.org> wrote: >>> >>>> Haha, this really is turning into a movie =). I'll start warming up the >>>> popcorn. >>>> >>> Yeah, I've got my co-workers wondering if I'm going to ship any actual >>> production code *inside* the company this week, at this rate (*shhhhh* >>> Dmitriy don't tell!) >>> >>> >>> It's only Wednesday...=) >>> >>> On 9/7/11 12:51 PM, Jake Mannix wrote: >>>> >>>> Maybe a few more examples would help? Cases where you want to do a BSP >>>> computation where the total sort (both the vertexes, and the edges for >>>> each >>>> vertex) is required, as is the random access nature of the Map? >>>> >>>> I think the range based examples are the ones that immediately come to >>>> mind. BSP for graph processing is still pretty new, and I have no idea >>>> what >>>> kind of interesting algorithms will be tried out on this platform. We >>>> are >>>> still exploring many possible algorithms to run. >>>> >>> Ok, cool. I can see how wanting flexibility is important. >>> >>> I think the idea was that after returning the map, users could directly >>>> manipulate the map of edges or use the interfaces, there should probably >>>> be >>>> a removeEdge() too. I'm starting to feel that we should remove Edge >>>> from >>>> the user perspective, just keep it internally only for the add edge >>>> requests. It just makes things a little more complex to the user (too >>>> many >>>> ways to do the same thing). Perhaps the interfaces you specified could >>>> hit >>>> most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(), >>>> and >>>> removeEdge()). If there turns out to be a big need, we can always >>>> change it >>>> back to a SortedMap or something else more appropriate. >>>> >>> >>> getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge() >>> sound like the right level of flexibility while keeping the data >>> encapsulated (so you can try your block compression idea, I can try out >>> primitives, etc, but the interface remains the same). >>> >>>> Memory consumption (see above), these are aggregate members for all the >>>>> vertices. >>>>> >>>> Ok, I'll see what it looks like if this data is moved to something like >>>> a >>>> VertexState object attached to the GraphMapper, which all the vertexes >>>> can >>>> have a reference to. >>>> >>>> As I've thought more about primitives vs objects, I think the object >>>> flexibility is quite important. The page rank example could probably >>>> get >>>> away with primitives, but other algorithms will likely require objects >>>> for >>>> edge values, message values, and vertex values (i.e. maybe storing the >>>> inlinks, or a bunch of different values i.e. multiple personalized page >>>> ranks run simultaneously). I guess you're thinking that Giraph will >>>> have >>>> two separate implementations? One that is primitives based and the >>>> other >>>> that is object based? >>>> >>> I'm thinking that with the right interface (like discussed above), you >>> can >>> have the same base interface, but yeah, for the particular case of >>> implementers of BaseVertex<I,V,E,M> where all of I,V, and E are wrappers >>> of >>> primitives, that there is some nice memory savings that can be done by >>> keeping them primitive (and only instantiating objects / autoboxing when >>> accessing via the generic methods, when this is transiently done). >>> PageRank isn't the only example where I'd want to get the (suspected) >>> perf >>> boost of using primitives, as most cases where I've dealt with graphs >>> everything gets normalized at some point - the input features all get >>> eventually turned into an edge "weight" of some kind, the vertexes >>> themselves maybe keep some small data with them, but the edges just look >>> like a target vertex id and an edge weight. For example, for social >>> graphs, >>> you can imagine lots and lots of fancy data you associate with users >>> (geo, >>> language, account freshness, recent text, topical interests, etc...), and >>> lots of things to associate with the edges (there are many ways users can >>> interact, beyond the explicit "x is connected to y" way), but when you >>> want >>> to run some big monstrous computation, this data is condensed into some >>> fixed final "connection strength" combination of weights. On the other >>> hand, to actually *compute* that connection weight, maybe non-local >>> information gleaned from a graph algorithm would be nice. Similarly, >>> computing a nice big topic-sensitive pagerank might require a bunch of >>> topic-weight metadata at the vertexes. >>> I don't know why I'm arguing this - I agree with you, keeping the >>> *ability* >>> to do object stuff is important, yes. I'm not advocating completely >>> primitivizing all of the base implementations. I'm just suggesting that >>> it >>> be added, as that's a pretty common use case which could benefit from >>> some >>> low-hanging fruit memory savings. >>> >>> Sounds right to me, just wanted to make sure that I was understanding >>> correctly what you wanted to do. >>> >>> >> >> >