On 9/6/11 3:27 PM, Jake Mannix wrote:
(changing thread title to reflect current discussion topic)

On Tue, Sep 6, 2011 at 8:49 AM, Avery Ching <ach...@yahoo-inc.com <mailto:ach...@yahoo-inc.com>> wrote:

    Answers are inlined.  No vacation for you this weekend I guess  =).

It was a good "vacation" :)

      Which JIRAs?
    https://issues.apache.org/jira/browse/GIRAPH-11 - Balancing memory
    consumption among workers
    https://issues.apache.org/jira/browse/GIRAPH-12 - Communication
    threads consuming too much memory

GIRAPH-12 is interesting. That communication could also be possibly handled by something like Finagle <https://github.com/twitter/finagle> ("A fault tolerant, protocol-agnostic RPC system" based on Netty [which I see is already under consideration], written in scala, but with very mature java bindings too). We use it internally at Twitter for clusters of mid-tier servers which have many dozens of machines talking to hundreds of other machines, without blowing up on thread-stack or using a gazillion threads. It's mavenized, so it's easy to try out.

We should definitely look at it, thanks for the suggestion. I've added it to the JIRA.
In the current state, I'd definitely be worried about running with hundreds of mappers, given the number of threads which would be used...

I agree, although I still have gotten up to 500 workers to run simultaneously with testing and a reduced thread stack size. Can't wait to see how big we can go with this issue resolved.

        I was able to run the
        org.apache.giraph.benchmark.PageRankBenchmark with 300
        workers and 1 billion vertices with a single edge and 20
        supersteps.  Here's the parameters I used for our configuration:

    1B vertices with just *one* edge?  What kind of graph would this be??
    Very lame, I agree.  Just for numbers =).  Once some of the memory
    issues are worked out, we'll test more connected graphs.

Yeah, one edge is pretty silly. To get some real numbers, I should try it out with a more realistic (power-law distributed) bit of synthetic data.


    That JIRA ticket seems to be talking about sorted inputs - is
    this a requirement (that vertex ids are sorted in each split), or
    does this just make some things run much better?  What does this
    have to do with balanced input splits?
    Yes, currently inputs must be sorted (requirement) until GIRAPH-12
    is finished.  Balancing input splits (memory consumed per input
    split) will help keep the amount of memory similar on all workers
    and assuming a homogenous worker set, this will allow for larger
    data sets to be fit into memory.

I see, since you're effectively limited by the size of the biggest split in this case, if you're not careful.

    I guess I'm not sure whether you *need* to give up the OO API,
    it's really nice to have completely strongly typed graph
    primitives.  I'll just see if I can implement the same API using
    internal data structures which are nice and compact (as a
    performance improvement only) which in the case where all of the
    type parameters are primitive-able.  So for example, something
    like PrimitiveVertex<LongWritable, DoubleWritable, FloatWritable,
    DoubleWritable> implements MutableVertex<LongWritable,
    DoubleWritable, FloatWritable, DoubleWritable>.  Still
    API-compatible, but internally takes advantage of the fact that
    all types are wrapped primitives.
    Interesting idea, would like to see the results of this experiment.

So I've started coding it up, and I ran into an API stumbling block, which might be avoidable:

in the interface BasicVertex:

    SortedMap<I, Edge<I, E>> getOutEdgeMap();
Where is this needed, in this form?
Lots of questions here, I'll try my best. =) Basically the idea is that as a sorted map, the user would be able to know that the edges are sorted by their destination ids. Range based edge queries are possible (i.e. the edges with destinations from com.yahoo.www to com.zillow.www). Basically this would give the users a bit more functionality than a basic map. A list would require a full scan to find/remove an edge.
The only calls I see to this method are of the form "getOutEdgeMap().size()", and
"for(Edge<> edge : getOutEdgeMap().values()) {...}".

What exactly is the intended use of this map? It's a map from target vertex id to that exact vertex and it's edge value.
Why is it not just a List<Edge<I,E>>?
Because in theory you could be modifying the structure of the graph on the fly, and you don't want to walk around on that list? Then why not just Map<I, E> ? Why is the vertex type need to be specified
twice, and why does the map need to be sorted?
As far as why not (Sorted)Map<I, E>, I believe we actually used to have that interface and I changed it use the Edge class. This was primarily done since Edge objects would be used for edge requests (i.e. void addEdgeRequest(I sourceVertexId, Edge<I, E> edge) stores the add edge request with the Edge object in a list. I think I also changed the user interfaces to use the Edge object since I initially thought it might make usage it a little clearer for users (i.e. edge.getDestVertexid() and some of the serialization a little simpler, but looking at it now, that might not be the case. We can probably go back to Map<I, E> or SortedMap<I, E> to save some memory and internally use the Edge object to store the add edge requests.
For my present case, I can probably hack around it by making a virtual
SortedMap<LongWritable, Edge<LongWritable,FloatWritable>> which implements that interface, yet is backed by a primitive map, but I'm trying to understand what the API is trying to support...

Hope my explanation helped somewhat.

Reply via email to