On 9/6/11 3:27 PM, Jake Mannix wrote:
(changing thread title to reflect current discussion topic)
On Tue, Sep 6, 2011 at 8:49 AM, Avery Ching <[email protected]
<mailto:[email protected]>> wrote:
Answers are inlined. No vacation for you this weekend I guess =).
It was a good "vacation" :)
Which JIRAs?
https://issues.apache.org/jira/browse/GIRAPH-11 - Balancing memory
consumption among workers
https://issues.apache.org/jira/browse/GIRAPH-12 - Communication
threads consuming too much memory
GIRAPH-12 is interesting. That communication could also be possibly
handled by something like Finagle
<https://github.com/twitter/finagle> ("A fault tolerant,
protocol-agnostic RPC system" based on Netty [which I see is already
under consideration], written in scala, but with very mature java
bindings too). We use it internally at Twitter for clusters of
mid-tier servers which have many dozens of machines talking to
hundreds of other machines, without blowing up on thread-stack or
using a gazillion threads. It's mavenized, so it's easy to try out.
We should definitely look at it, thanks for the suggestion. I've added
it to the JIRA.
In the current state, I'd definitely be worried about running with
hundreds of mappers, given the number of threads which would be used...
I agree, although I still have gotten up to 500 workers to run
simultaneously with testing and a reduced thread stack size. Can't wait
to see how big we can go with this issue resolved.
I was able to run the
org.apache.giraph.benchmark.PageRankBenchmark with 300
workers and 1 billion vertices with a single edge and 20
supersteps. Here's the parameters I used for our configuration:
1B vertices with just *one* edge? What kind of graph would this be??
Very lame, I agree. Just for numbers =). Once some of the memory
issues are worked out, we'll test more connected graphs.
Yeah, one edge is pretty silly. To get some real numbers, I should
try it out with a more realistic (power-law distributed) bit of
synthetic data.
Agreed.
That JIRA ticket seems to be talking about sorted inputs - is
this a requirement (that vertex ids are sorted in each split), or
does this just make some things run much better? What does this
have to do with balanced input splits?
Yes, currently inputs must be sorted (requirement) until GIRAPH-12
is finished. Balancing input splits (memory consumed per input
split) will help keep the amount of memory similar on all workers
and assuming a homogenous worker set, this will allow for larger
data sets to be fit into memory.
I see, since you're effectively limited by the size of the biggest
split in this case, if you're not careful.
I guess I'm not sure whether you *need* to give up the OO API,
it's really nice to have completely strongly typed graph
primitives. I'll just see if I can implement the same API using
internal data structures which are nice and compact (as a
performance improvement only) which in the case where all of the
type parameters are primitive-able. So for example, something
like PrimitiveVertex<LongWritable, DoubleWritable, FloatWritable,
DoubleWritable> implements MutableVertex<LongWritable,
DoubleWritable, FloatWritable, DoubleWritable>. Still
API-compatible, but internally takes advantage of the fact that
all types are wrapped primitives.
Interesting idea, would like to see the results of this experiment.
So I've started coding it up, and I ran into an API stumbling block,
which might be avoidable:
in the interface BasicVertex:
SortedMap<I, Edge<I, E>> getOutEdgeMap();
Where is this needed, in this form?
Lots of questions here, I'll try my best. =) Basically the idea is that
as a sorted map, the user would be able to know that the edges are
sorted by their destination ids. Range based edge queries are possible
(i.e. the edges with destinations from com.yahoo.www to
com.zillow.www). Basically this would give the users a bit more
functionality than a basic map. A list would require a full scan to
find/remove an edge.
The only calls I see to this method are of the form
"getOutEdgeMap().size()", and
"for(Edge<> edge : getOutEdgeMap().values()) {...}".
What exactly is the intended use of this map? It's a map from target
vertex id to that exact vertex and it's edge value.
Why is it not just a List<Edge<I,E>>?
Because in theory you could be modifying the structure of the graph on
the fly,
and you don't want to walk around on that list? Then why not just
Map<I, E> ? Why is the vertex type need to be specified
twice, and why does the map need to be sorted?
As far as why not (Sorted)Map<I, E>, I believe we actually used to have
that interface and I changed it use the Edge class. This was primarily
done since Edge objects would be used for edge requests (i.e. void
addEdgeRequest(I sourceVertexId, Edge<I, E> edge) stores the add edge
request with the Edge object in a list. I think I also changed the user
interfaces to use the Edge object since I initially thought it might
make usage it a little clearer for users (i.e. edge.getDestVertexid()
and some of the serialization a little simpler, but looking at it now,
that might not be the case. We can probably go back to Map<I, E> or
SortedMap<I, E> to save some memory and internally use the Edge object
to store the add edge requests.
For my present case, I can probably hack around it by making a virtual
SortedMap<LongWritable, Edge<LongWritable,FloatWritable>> which
implements that interface, yet is backed by a
primitive map, but I'm trying to understand what the API is trying to
support...
Hope my explanation helped somewhat.
-jake