On 9/6/11 10:05 PM, Jake Mannix wrote:
On Wed, Sep 7, 2011 at 1:02 AM, Avery Ching <[email protected]
<mailto:[email protected]>> wrote:
Yeah, one edge is pretty silly. To get some real numbers, I
should try it out with a more realistic (power-law distributed)
bit of synthetic data.
Agreed.
I'll see if I can write up some simple extensions of the PageRank
benchmark to test this out. Added
https://issues.apache.org/jira/browse/GIRAPH-26 to track it.
Lots of questions here, I'll try my best. =) Basically the idea
is that as a sorted map, the user would be able to know that the
edges are sorted by their destination ids. Range based edge
queries are possible (i.e. the edges with destinations from
com.yahoo.www to com.zillow.www). Basically this would give the
users a bit more functionality than a basic map. A list would
require a full scan to find/remove an edge.
I see, that makes sense. I wonder if inverting the API a bit would
make sense, instead of exposing such a concrete inner domain object
like the actual SortedMap? By this, I mean: if you want to search /
subselect edges by some criterion, then instead of having the caller do
for(Edge<E,V> edge : getOutEdgeMap().values()) {
// select edges you want to do something with
}
you expose to the caller:
public Collection<Edge<E,V>> filterEdges(Predicate<Edge<E,V>>
filter), which the caller can use like
for(Edge<E,V> edge : filterEdges(new Predicate<Edge<E,V>>() {
boolean apply(Edge<E,V> e) {
return e.getVertexId().compareTo("com.zillow.www") >= 0 &&
e.getVertexId().compareTo("com.yahoo.www") < 0;
}
}) {
// do stuff on the edges you wanted to get
}
I'm not sure that this is precisely the right API, but exposing the
inner SortedMap definitely has a "leaky abstraction" smell to it to
me, especially where there are no examples or algorithm
implementations in the codebase which currently *need* this to be
exposed in such a way.
We could expose just the Map and provide helper methods as you suggest.
It's a thought. You're right that it's not used in this way in any
examples.
As far as why not (Sorted)Map<I, E>, I believe we actually used to
have that interface and I changed it use the Edge class. This was
primarily done since Edge objects would be used for edge requests
(i.e. void addEdgeRequest(I sourceVertexId, Edge<I, E> edge)
stores the add edge request with the Edge object in a list. I
think I also changed the user interfaces to use the Edge object
since I initially thought it might make usage it a little clearer
for users (i.e. edge.getDestVertexid() and some of the
serialization a little simpler, but looking at it now, that might
not be the case. We can probably go back to Map<I, E> or
SortedMap<I, E> to save some memory and internally use the Edge
object to store the add edge requests.
I can see how the Edge object is nice, it does seem like a better api
then basically just a Map<I,E>. Why not a sorted List<Edge<I,E>>,
then, where the Comparator ignores the edge value, but only sorts by
the dest vertexId? Because the contract of List doesn't *require* the
sort order to remain fixed, I guess? This yet again underscores for
me the "spidey-sense" wrongness of exposing the SortedMap exactly as
it is. Maybe just expose the methods on a map that are expected to be
needed, and methods can be added as time goes on?
I think Map over List makes sense for the edges to find them quickly (my
personal opinion).
For my present case, I can probably hack around it by making a
virtual
SortedMap<LongWritable, Edge<LongWritable,FloatWritable>> which
implements that interface, yet is backed by a
primitive map, but I'm trying to understand what the API is
trying to support...
Hope my explanation helped somewhat.
Yes, definitely. But I've run into some ugly ugliness when I tried to
run my code:
Problem 1) static mutable state (ew!) - lots of references to
Vertex.setSuperStep(), Vertex.setNumVertices(), Vertex.setNumEdges(),
Vertex.setGraphMapper(), etc... which means that implemeting
BasicVertex / MutableVertex isn't good enough, you have to actually be
a subclass of Vertex! :\
Oh, you found our dirty little secret. =) Yes, we do have static
mutable state. The concern is that if every vertex actually keeps track
of all those things as members, I think a decent amount of memory would
be wasted. I'm sure there is a cleaner way to implement this though
(i.e. Flyweight pattern perhaps), just never got around to fixing it.
Problem 2) Vertex.class is directly referenced as the base class in
places like BspUtils, etc.
We never expected to have another base Vertex class. =)
So some refactoring may be necessary to even try out this Object ->
primitive test (well, my code *compiles*, but won't run on account of
problems 1 and 2 above - runtime exceptions galore), even after I
"faked" a SortedMap by implementing a slimmed down delegator class.
Most likely the easiest solution would be to pull out a AbstractVertex
class, which holds this ugly static state, and make said state
protected, instead of private, and have Vertex and
LongDoubleFloatDoubleVertex both subclass it, and change reflection
references from setClass(VERTEX_CLASS, vertexClass, Vertex.class);
to setClass(VERTEX_CLASS, vertexClass, AbstractVertex.class);
That sounds reasonable. One thing we have thought about is compressing
the vertices in memory at the cost of some CPU. While this is
orthogonal, they are both efforts to conserve memory.
But the *right* solution is to remove that static stuff. Why is it
there, anyways? Why doesn't GraphMapper just call
Vertex.setSuperstep(superstep);
Vertex.setNumVertices(serviceWorker.getTotalVertices());
Vertex.setNumEdges(serviceWorker.getTotalEdges());
directly on each of the BasicVertex's
in serviceWorker.getVertexRangeMap()?
Memory consumption (see above), these are aggregate members for all the
vertices.
-jake