Haha, this really is turning into a movie =). I'll start warming up the
popcorn.
On 9/7/11 12:51 PM, Jake Mannix wrote:
On Wed, Sep 7, 2011 at 6:33 AM, Avery Ching <ach...@apache.org
<mailto:ach...@apache.org>> wrote:
I'm not sure that this is precisely the right API, but exposing
the inner SortedMap definitely has a "leaky abstraction" smell to
it to me, especially where there are no examples or algorithm
implementations in the codebase which currently *need* this to be
exposed in such a way.
We could expose just the Map and provide helper methods as you
suggest. It's a thought. You're right that it's not used in this
way in any examples.
Maybe a few more examples would help? Cases where you want to do a
BSP computation where the total sort (both the vertexes, and the edges
for each vertex) is required, as is the random access nature of the Map?
I think the range based examples are the ones that immediately come to
mind. BSP for graph processing is still pretty new, and I have no idea
what kind of interesting algorithms will be tried out on this platform.
We are still exploring many possible algorithms to run.
I can see how the Edge object is nice, it does seem like a better
api then basically just a Map<I,E>. Why not a sorted
List<Edge<I,E>>, then, where the Comparator ignores the edge
value, but only sorts by the dest vertexId? Because the contract
of List doesn't *require* the sort order to remain fixed, I
guess? This yet again underscores for me the "spidey-sense"
wrongness of exposing the SortedMap exactly as it is. Maybe just
expose the methods on a map that are expected to be needed, and
methods can be added as time goes on?
I think Map over List makes sense for the edges to find them
quickly (my personal opinion).
Yeah, if there are cases where you're doing searches for paths, this
is probably true, some form of random access is probably best.
What about something as simple as:
/*
* @return an immutable, sorted view of the target vertices
*/
public ImmutableList<I> getTargetVertices();
/*
* @return E edge for the supplied vertex (or null if there isn't an
edge to this vertex from the current one)
*/
public <E> Edge<I,E> getEdge(I vertex);
in conjunction with the already existing
public boolean addEdge(Edge<I,E> edge);
you have all the actions of a sorted, random-access map, but without
exposing the way it's stored internally. Why is there addEdge() but
not removeEdge()?
I think the idea was that after returning the map, users could directly
manipulate the map of edges or use the interfaces, there should probably
be a removeEdge() too. I'm starting to feel that we should remove Edge
from the user perspective, just keep it internally only for the add edge
requests. It just makes things a little more complex to the user (too
many ways to do the same thing). Perhaps the interfaces you specified
could hit most of the use cases (getTargetVertices(), getEdgeValue(),
addEdge(), and removeEdge()). If there turns out to be a big need, we
can always change it back to a SortedMap or something else more appropriate.
Problem 1) static mutable state (ew!) - lots of references to
Vertex.setSuperStep(), Vertex.setNumVertices(),
Vertex.setNumEdges(), Vertex.setGraphMapper(), etc... which means
that implemeting BasicVertex / MutableVertex isn't good enough,
you have to actually be a subclass of Vertex! :\
Oh, you found our dirty little secret. =) Yes, we do have static
mutable state. The concern is that if every vertex actually keeps
track of all those things as members, I think a decent amount of
memory would be wasted. I'm sure there is a cleaner way to
implement this though (i.e. Flyweight pattern perhaps), just never
got around to fixing it.
Yeah, keeping one extra 4 or 8 bytes of a reference to say, the
GraphMapper inside of each Vertex, and having the GraphMapper itself
centralize all of these generic Vertex attributes might be the way to
go. I'll see if I can whip that up as part of my primitive patch.
Cool.
Problem 2) Vertex.class is directly referenced as the base
class in places like BspUtils, etc.
We never expected to have another base Vertex class. =)
True, but in many of these places, it's just the interface type which
is required, right? BaseVertex is what they all implement, is there a
reason why this can't be what's referred to in these cases?
Most likely the easiest solution would be to pull out a
AbstractVertex class, which holds this ugly static state, and
make said state protected, instead of private, and have Vertex
and LongDoubleFloatDoubleVertex both subclass it, and change
reflection references from setClass(VERTEX_CLASS, vertexClass,
Vertex.class); to setClass(VERTEX_CLASS, vertexClass,
AbstractVertex.class);
That sounds reasonable. One thing we have thought about is
compressing the vertices in memory at the cost of some CPU. While
this is orthogonal, they are both efforts to conserve memory.
Storing the edges as primitive arrays could be viewed as a fairly
maximally efficient compression, in cases where you have
primitive-able types. I'm not sure how easy it would be to have a
generically compressible SortedMap<I, Edge<E,I>>, in a way which
allowed mutations of the map...
Still only in the thought phase, but I think the idea would be some kind
of block compression of vertices (compress say 10k vertices at a time
and keep an index).
Memory consumption (see above), these are aggregate members for
all the vertices.
Ok, I'll see what it looks like if this data is moved to something
like a VertexState object attached to the GraphMapper, which all the
vertexes can have a reference to.
As I've thought more about primitives vs objects, I think the object
flexibility is quite important. The page rank example could probably
get away with primitives, but other algorithms will likely require
objects for edge values, message values, and vertex values (i.e. maybe
storing the inlinks, or a bunch of different values i.e. multiple
personalized page ranks run simultaneously). I guess you're thinking
that Giraph will have two separate implementations? One that is
primitives based and the other that is object based?