Something else to think about -- in some cases you may want to
implement extreme compression of the edge list. For example, we might
want to calculate n-th degree reach for all nodes in a graph. Given a
graph with a few nodes with lots of incoming edges ("biebernodes", we
call them), anything that jumps across them winds up essentially
replicating large chunks of the graph to each node. Things like RLE
might make sense when you are doing that sort of thing in memory and
only have so many terabytes of that lying around.
On Wed, Sep 7, 2011 at 3:33 PM, Avery Ching <ach...@apache.org> wrote:
> This probably should have been a JIRA =). I agree that update edge is
> probably useful as well. Maybe a Map is the right thing then...rather than
> creating lots of methods to do edge manipulation...
> On 9/7/11 3:22 PM, Dmitriy Ryaboy wrote:
>> I am going to buck the trend and not inline my thoughts, this is
>> getting a little too thready :)
>> Methinks you will want an updateEdgeValue(), too.
>> On Wed, Sep 7, 2011 at 3:14 PM, Avery Ching<ach...@apache.org> wrote:
>>> On 9/7/11 3:00 PM, Jake Mannix wrote:
>>> On Wed, Sep 7, 2011 at 9:26 PM, Avery Ching<ach...@apache.org> wrote:
>>>> Haha, this really is turning into a movie =). I'll start warming up the
>>> Yeah, I've got my co-workers wondering if I'm going to ship any actual
>>> production code *inside* the company this week, at this rate (*shhhhh*
>>> Dmitriy don't tell!)
>>> It's only Wednesday...=)
>>>> On 9/7/11 12:51 PM, Jake Mannix wrote:
>>>> Maybe a few more examples would help? Cases where you want to do a BSP
>>>> computation where the total sort (both the vertexes, and the edges for
>>>> vertex) is required, as is the random access nature of the Map?
>>>> I think the range based examples are the ones that immediately come to
>>>> mind. BSP for graph processing is still pretty new, and I have no idea
>>>> kind of interesting algorithms will be tried out on this platform. We
>>>> still exploring many possible algorithms to run.
>>> Ok, cool. I can see how wanting flexibility is important.
>>>> I think the idea was that after returning the map, users could directly
>>>> manipulate the map of edges or use the interfaces, there should probably
>>>> a removeEdge() too. I'm starting to feel that we should remove Edge
>>>> the user perspective, just keep it internally only for the add edge
>>>> requests. It just makes things a little more complex to the user (too
>>>> ways to do the same thing). Perhaps the interfaces you specified could
>>>> most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(),
>>>> removeEdge()). If there turns out to be a big need, we can always
>>>> change it
>>>> back to a SortedMap or something else more appropriate.
>>> getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge()
>>> sound like the right level of flexibility while keeping the data
>>> encapsulated (so you can try your block compression idea, I can try out
>>> primitives, etc, but the interface remains the same).
>>>>> Memory consumption (see above), these are aggregate members for all the
>>>> Ok, I'll see what it looks like if this data is moved to something like
>>>> VertexState object attached to the GraphMapper, which all the vertexes
>>>> have a reference to.
>>>> As I've thought more about primitives vs objects, I think the object
>>>> flexibility is quite important. The page rank example could probably
>>>> away with primitives, but other algorithms will likely require objects
>>>> edge values, message values, and vertex values (i.e. maybe storing the
>>>> inlinks, or a bunch of different values i.e. multiple personalized page
>>>> ranks run simultaneously). I guess you're thinking that Giraph will
>>>> two separate implementations? One that is primitives based and the
>>>> that is object based?
>>> I'm thinking that with the right interface (like discussed above), you
>>> have the same base interface, but yeah, for the particular case of
>>> implementers of BaseVertex<I,V,E,M> where all of I,V, and E are wrappers
>>> primitives, that there is some nice memory savings that can be done by
>>> keeping them primitive (and only instantiating objects / autoboxing when
>>> accessing via the generic methods, when this is transiently done).
>>> PageRank isn't the only example where I'd want to get the (suspected)
>>> boost of using primitives, as most cases where I've dealt with graphs
>>> everything gets normalized at some point - the input features all get
>>> eventually turned into an edge "weight" of some kind, the vertexes
>>> themselves maybe keep some small data with them, but the edges just look
>>> like a target vertex id and an edge weight. For example, for social
>>> you can imagine lots and lots of fancy data you associate with users
>>> language, account freshness, recent text, topical interests, etc...), and
>>> lots of things to associate with the edges (there are many ways users can
>>> interact, beyond the explicit "x is connected to y" way), but when you
>>> to run some big monstrous computation, this data is condensed into some
>>> fixed final "connection strength" combination of weights. On the other
>>> hand, to actually *compute* that connection weight, maybe non-local
>>> information gleaned from a graph algorithm would be nice. Similarly,
>>> computing a nice big topic-sensitive pagerank might require a bunch of
>>> topic-weight metadata at the vertexes.
>>> I don't know why I'm arguing this - I agree with you, keeping the
>>> to do object stuff is important, yes. I'm not advocating completely
>>> primitivizing all of the base implementations. I'm just suggesting that
>>> be added, as that's a pretty common use case which could benefit from
>>> low-hanging fruit memory savings.
>>> Sounds right to me, just wanted to make sure that I was understanding
>>> correctly what you wanted to do.
Dmitriy V Ryaboy