Haha, this really is turning into a movie =). I'll start warming up the popcorn.

On 9/7/11 12:51 PM, Jake Mannix wrote:


On Wed, Sep 7, 2011 at 6:33 AM, Avery Ching <ach...@apache.org <mailto:ach...@apache.org>> wrote:

      I'm not sure that this is precisely the right API, but exposing
    the inner SortedMap definitely has a "leaky abstraction" smell to
    it to me, especially where there are no examples or algorithm
    implementations in the codebase which currently *need* this to be
    exposed in such a way.
    We could expose just the Map and provide helper methods as you
    suggest.  It's a thought.  You're right that it's not used in this
    way in any examples.


Maybe a few more examples would help? Cases where you want to do a BSP computation where the total sort (both the vertexes, and the edges for each vertex) is required, as is the random access nature of the Map?
I think the range based examples are the ones that immediately come to mind. BSP for graph processing is still pretty new, and I have no idea what kind of interesting algorithms will be tried out on this platform. We are still exploring many possible algorithms to run.

    I can see how the Edge object is nice, it does seem like a better
    api then basically just a Map<I,E>.  Why not a sorted
    List<Edge<I,E>>, then, where the Comparator ignores the edge
    value, but only sorts by the dest vertexId?  Because the contract
    of List doesn't *require* the sort order to remain fixed, I
    guess?  This yet again underscores for me the "spidey-sense"
    wrongness of exposing the SortedMap exactly as it is.  Maybe just
    expose the methods on a map that are expected to be needed, and
    methods can be added as time goes on?
    I think Map over List makes sense for the edges to find them
    quickly (my personal opinion).


Yeah, if there are cases where you're doing searches for paths, this is probably true, some form of random access is probably best.

What about something as simple as:

  /*
   * @return an immutable, sorted view of the target vertices
   */
  public ImmutableList<I> getTargetVertices();

  /*
* @return E edge for the supplied vertex (or null if there isn't an edge to this vertex from the current one)
   */
  public <E> Edge<I,E> getEdge(I vertex);

in conjunction with the already existing

  public boolean addEdge(Edge<I,E> edge);

you have all the actions of a sorted, random-access map, but without exposing the way it's stored internally. Why is there addEdge() but not removeEdge()?
I think the idea was that after returning the map, users could directly manipulate the map of edges or use the interfaces, there should probably be a removeEdge() too. I'm starting to feel that we should remove Edge from the user perspective, just keep it internally only for the add edge requests. It just makes things a little more complex to the user (too many ways to do the same thing). Perhaps the interfaces you specified could hit most of the use cases (getTargetVertices(), getEdgeValue(), addEdge(), and removeEdge()). If there turns out to be a big need, we can always change it back to a SortedMap or something else more appropriate.

      Problem 1)  static mutable state (ew!) - lots of references to
    Vertex.setSuperStep(), Vertex.setNumVertices(),
    Vertex.setNumEdges(), Vertex.setGraphMapper(), etc... which means
    that implemeting BasicVertex / MutableVertex isn't good enough,
    you have to actually be a subclass of Vertex! :\
    Oh, you found our dirty little secret. =)  Yes, we do have static
    mutable state.  The concern is that if every vertex actually keeps
    track of all those things as members, I think a decent amount of
    memory would be wasted.  I'm sure there is a cleaner way to
    implement this though (i.e. Flyweight pattern perhaps), just never
    got around to fixing it.


Yeah, keeping one extra 4 or 8 bytes of a reference to say, the GraphMapper inside of each Vertex, and having the GraphMapper itself centralize all of these generic Vertex attributes might be the way to go. I'll see if I can whip that up as part of my primitive patch.
Cool.

      Problem 2) Vertex.class is directly referenced as the base
    class in places like BspUtils, etc.
    We never expected to have another base Vertex class. =)


True, but in many of these places, it's just the interface type which is required, right? BaseVertex is what they all implement, is there a reason why this can't be what's referred to in these cases?

    Most likely the easiest solution would be to pull out a
    AbstractVertex class, which holds this ugly static state, and
    make said state protected, instead of private, and have Vertex
    and LongDoubleFloatDoubleVertex both subclass it, and change
    reflection references from setClass(VERTEX_CLASS, vertexClass,
    Vertex.class); to setClass(VERTEX_CLASS, vertexClass,
    AbstractVertex.class);
    That sounds reasonable.  One thing we have thought about is
    compressing the vertices in memory at the cost of some CPU.  While
    this is orthogonal, they are both efforts to conserve memory.


Storing the edges as primitive arrays could be viewed as a fairly maximally efficient compression, in cases where you have primitive-able types. I'm not sure how easy it would be to have a generically compressible SortedMap<I, Edge<E,I>>, in a way which allowed mutations of the map...
Still only in the thought phase, but I think the idea would be some kind of block compression of vertices (compress say 10k vertices at a time and keep an index).

    Memory consumption (see above), these are aggregate members for
    all the vertices.


Ok, I'll see what it looks like if this data is moved to something like a VertexState object attached to the GraphMapper, which all the vertexes can have a reference to.
As I've thought more about primitives vs objects, I think the object flexibility is quite important. The page rank example could probably get away with primitives, but other algorithms will likely require objects for edge values, message values, and vertex values (i.e. maybe storing the inlinks, or a bunch of different values i.e. multiple personalized page ranks run simultaneously). I guess you're thinking that Giraph will have two separate implementations? One that is primitives based and the other that is object based?

Reply via email to