(changing thread title to reflect current discussion topic)

On Tue, Sep 6, 2011 at 8:49 AM, Avery Ching <ach...@yahoo-inc.com> wrote:

> Answers are inlined.  No vacation for you this weekend I guess  =).
>

It was a good "vacation" :)


>   Which JIRAs?
>
> https://issues.apache.org/jira/browse/GIRAPH-11 - Balancing memory
> consumption among workers
> https://issues.apache.org/jira/browse/GIRAPH-12 - Communication threads
> consuming too much memory
>

GIRAPH-12 is interesting.  That communication could also be possibly handled
by something like Finagle <https://github.com/twitter/finagle> ("A fault
tolerant, protocol-agnostic RPC system" based on Netty [which I see is
already under consideration], written in scala, but with very mature java
bindings too).  We use it internally at Twitter for clusters of mid-tier
servers which have many dozens of machines talking to hundreds of other
machines, without blowing up on thread-stack or using a gazillion threads.
 It's mavenized, so it's easy to try out.

In the current state, I'd definitely be worried about running with hundreds
of mappers, given the number of threads which would be used...

 I was able to run the org.apache.giraph.benchmark.PageRankBenchmark with
>> 300 workers and 1 billion vertices with a single edge and 20 supersteps.
>>  Here's the parameters I used for our configuration:
>>
>
> 1B vertices with just *one* edge?  What kind of graph would this be??
>
> Very lame, I agree.  Just for numbers =).  Once some of the memory issues
> are worked out, we'll test more connected graphs.
>

Yeah, one edge is pretty silly.  To get some real numbers, I should try it
out with a more realistic (power-law distributed) bit of synthetic data.

> That JIRA ticket seems to be talking about sorted inputs - is this a
> requirement (that vertex ids are sorted in each split), or does this just
> make some things run much better?  What does this have to do with balanced
> input splits?
>
>
> Yes, currently inputs must be sorted (requirement) until GIRAPH-12 is
> finished.  Balancing input splits (memory consumed per input split) will
> help keep the amount of memory similar on all workers and assuming a
> homogenous worker set, this will allow for larger data sets to be fit into
> memory.
>

I see, since you're effectively limited by the size of the biggest split in
this case, if you're not careful.

> I guess I'm not sure whether you *need* to give up the OO API, it's really
> nice to have completely strongly typed graph primitives.  I'll just see if I
> can implement the same API using internal data structures which are nice and
> compact (as a performance improvement only) which in the case where all of
> the type parameters are primitive-able.  So for example, something like
> PrimitiveVertex<LongWritable, DoubleWritable, FloatWritable, DoubleWritable>
> implements MutableVertex<LongWritable, DoubleWritable, FloatWritable,
> DoubleWritable>.  Still API-compatible, but internally takes advantage of
> the fact that all types are wrapped primitives.
>
> Interesting idea, would like to see the results of this experiment.
>

So I've started coding it up, and I ran into an API stumbling block, which
might be avoidable:

in the interface BasicVertex:

    SortedMap<I, Edge<I, E>> getOutEdgeMap();

Where is this needed, in this form?  The only calls I see to this method are
of the form "getOutEdgeMap().size()", and
"for(Edge<> edge : getOutEdgeMap().values()) {...}".

What exactly is the intended use of this map?  It's a map from target vertex
id to that exact vertex and it's edge value.
Why is it not just a List<Edge<I,E>>?   Because in theory you could be
modifying the structure of the graph on the fly,
and you don't want to walk around on that list?  Then why not just Map<I, E>
?  Why is the vertex type need to be specified
twice, and why does the map need to be sorted?

For my present case, I can probably hack around it by making a virtual
SortedMap<LongWritable, Edge<LongWritable,FloatWritable>> which implements
that interface, yet is backed by a
primitive map, but I'm trying to understand what the API is trying to
support...

  -jake

Reply via email to