On 9/6/11 10:05 PM, Jake Mannix wrote:


On Wed, Sep 7, 2011 at 1:02 AM, Avery Ching <ach...@apache.org <mailto:ach...@apache.org>> wrote:


    Yeah, one edge is pretty silly.  To get some real numbers, I
    should try it out with a more realistic (power-law distributed)
    bit of synthetic data.
    Agreed.


I'll see if I can write up some simple extensions of the PageRank benchmark to test this out. Added https://issues.apache.org/jira/browse/GIRAPH-26 to track it.

    Lots of questions here, I'll try my best. =)  Basically the idea
    is that as a sorted map, the user would be able to know that the
    edges are sorted by their destination ids.  Range based edge
    queries are possible (i.e. the edges with destinations from
    com.yahoo.www to com.zillow.www).  Basically this would give the
    users a bit more functionality than a basic map.  A list would
    require a full scan to find/remove an edge.


I see, that makes sense. I wonder if inverting the API a bit would make sense, instead of exposing such a concrete inner domain object like the actual SortedMap? By this, I mean: if you want to search / subselect edges by some criterion, then instead of having the caller do

  for(Edge<E,V> edge : getOutEdgeMap().values()) {
    // select edges you want to do something with
  }

  you expose to the caller:

public Collection<Edge<E,V>> filterEdges(Predicate<Edge<E,V>> filter), which the caller can use like

  for(Edge<E,V> edge : filterEdges(new Predicate<Edge<E,V>>() {
    boolean apply(Edge<E,V> e) {
return e.getVertexId().compareTo("com.zillow.www") >= 0 && e.getVertexId().compareTo("com.yahoo.www") < 0;
    }
  }) {
    // do stuff on the edges you wanted to get
  }

I'm not sure that this is precisely the right API, but exposing the inner SortedMap definitely has a "leaky abstraction" smell to it to me, especially where there are no examples or algorithm implementations in the codebase which currently *need* this to be exposed in such a way.
We could expose just the Map and provide helper methods as you suggest. It's a thought. You're right that it's not used in this way in any examples.

    As far as why not (Sorted)Map<I, E>, I believe we actually used to
    have that interface and I changed it use the Edge class.  This was
    primarily done since Edge objects would be used for edge requests
    (i.e. void addEdgeRequest(I sourceVertexId, Edge<I, E> edge)
    stores the add edge request with the Edge object in a list.  I
    think I also changed the user interfaces to use the Edge object
    since I initially thought it might make usage it a little clearer
    for users (i.e. edge.getDestVertexid() and some of the
    serialization a little simpler, but looking at it now, that might
    not be the case.  We can probably go back to Map<I, E> or
    SortedMap<I, E> to save some memory and internally use the Edge
    object to store the add edge requests.


I can see how the Edge object is nice, it does seem like a better api then basically just a Map<I,E>. Why not a sorted List<Edge<I,E>>, then, where the Comparator ignores the edge value, but only sorts by the dest vertexId? Because the contract of List doesn't *require* the sort order to remain fixed, I guess? This yet again underscores for me the "spidey-sense" wrongness of exposing the SortedMap exactly as it is. Maybe just expose the methods on a map that are expected to be needed, and methods can be added as time goes on?
I think Map over List makes sense for the edges to find them quickly (my personal opinion).

    For my present case, I can probably hack around it by making a
    virtual
    SortedMap<LongWritable, Edge<LongWritable,FloatWritable>> which
    implements that interface, yet is backed by a
    primitive map, but I'm trying to understand what the API is
    trying to support...
    Hope my explanation helped somewhat.


Yes, definitely. But I've run into some ugly ugliness when I tried to run my code:

Problem 1) static mutable state (ew!) - lots of references to Vertex.setSuperStep(), Vertex.setNumVertices(), Vertex.setNumEdges(), Vertex.setGraphMapper(), etc... which means that implemeting BasicVertex / MutableVertex isn't good enough, you have to actually be a subclass of Vertex! :\

Oh, you found our dirty little secret. =) Yes, we do have static mutable state. The concern is that if every vertex actually keeps track of all those things as members, I think a decent amount of memory would be wasted. I'm sure there is a cleaner way to implement this though (i.e. Flyweight pattern perhaps), just never got around to fixing it.
Problem 2) Vertex.class is directly referenced as the base class in places like BspUtils, etc.

We never expected to have another base Vertex class. =)
So some refactoring may be necessary to even try out this Object -> primitive test (well, my code *compiles*, but won't run on account of problems 1 and 2 above - runtime exceptions galore), even after I "faked" a SortedMap by implementing a slimmed down delegator class.

Most likely the easiest solution would be to pull out a AbstractVertex class, which holds this ugly static state, and make said state protected, instead of private, and have Vertex and LongDoubleFloatDoubleVertex both subclass it, and change reflection references from setClass(VERTEX_CLASS, vertexClass, Vertex.class); to setClass(VERTEX_CLASS, vertexClass, AbstractVertex.class);

That sounds reasonable. One thing we have thought about is compressing the vertices in memory at the cost of some CPU. While this is orthogonal, they are both efforts to conserve memory.
But the *right* solution is to remove that static stuff. Why is it there, anyways? Why doesn't GraphMapper just call
            Vertex.setSuperstep(superstep);
            Vertex.setNumVertices(serviceWorker.getTotalVertices());
            Vertex.setNumEdges(serviceWorker.getTotalEdges());
directly on each of the BasicVertex's in serviceWorker.getVertexRangeMap()?

Memory consumption (see above), these are aggregate members for all the vertices.
  -jake

Reply via email to