Re: Primitives vs Objects (the Movie!)

Avery Ching Tue, 06 Sep 2011 18:03:02 -0700

On 9/6/11 3:27 PM, Jake Mannix wrote:

(changing thread title to reflect current discussion topic)
On Tue, Sep 6, 2011 at 8:49 AM, Avery Ching <[email protected]<mailto:[email protected]>> wrote:
    Answers are inlined.  No vacation for you this weekend I guess  =).


It was a good "vacation" :)
      Which JIRAs?
    https://issues.apache.org/jira/browse/GIRAPH-11 - Balancing memory
    consumption among workers
    https://issues.apache.org/jira/browse/GIRAPH-12 - Communication
    threads consuming too much memory
GIRAPH-12 is interesting. That communication could also be possiblyhandled by something like Finagle<https://github.com/twitter/finagle> ("A fault tolerant,protocol-agnostic RPC system" based on Netty [which I see is alreadyunder consideration], written in scala, but with very mature javabindings too). We use it internally at Twitter for clusters ofmid-tier servers which have many dozens of machines talking tohundreds of other machines, without blowing up on thread-stack orusing a gazillion threads. It's mavenized, so it's easy to try out.

We should definitely look at it, thanks for the suggestion. I've addedit to the JIRA.

In the current state, I'd definitely be worried about running withhundreds of mappers, given the number of threads which would be used...

I agree, although I still have gotten up to 500 workers to runsimultaneously with testing and a reduced thread stack size. Can't waitto see how big we can go with this issue resolved.

        I was able to run the
        org.apache.giraph.benchmark.PageRankBenchmark with 300
        workers and 1 billion vertices with a single edge and 20
        supersteps.  Here's the parameters I used for our configuration:


    1B vertices with just *one* edge?  What kind of graph would this be??

    Very lame, I agree.  Just for numbers =).  Once some of the memory
    issues are worked out, we'll test more connected graphs.

Yeah, one edge is pretty silly. To get some real numbers, I shouldtry it out with a more realistic (power-law distributed) bit ofsynthetic data.


Agreed.

    That JIRA ticket seems to be talking about sorted inputs - is
    this a requirement (that vertex ids are sorted in each split), or
    does this just make some things run much better?  What does this
    have to do with balanced input splits?

    Yes, currently inputs must be sorted (requirement) until GIRAPH-12
    is finished.  Balancing input splits (memory consumed per input
    split) will help keep the amount of memory similar on all workers
    and assuming a homogenous worker set, this will allow for larger
    data sets to be fit into memory.

I see, since you're effectively limited by the size of the biggestsplit in this case, if you're not careful.

    I guess I'm not sure whether you *need* to give up the OO API,
    it's really nice to have completely strongly typed graph
    primitives.  I'll just see if I can implement the same API using
    internal data structures which are nice and compact (as a
    performance improvement only) which in the case where all of the
    type parameters are primitive-able.  So for example, something
    like PrimitiveVertex<LongWritable, DoubleWritable, FloatWritable,
    DoubleWritable> implements MutableVertex<LongWritable,
    DoubleWritable, FloatWritable, DoubleWritable>.  Still
    API-compatible, but internally takes advantage of the fact that
    all types are wrapped primitives.

    Interesting idea, would like to see the results of this experiment.

So I've started coding it up, and I ran into an API stumbling block,which might be avoidable:


in the interface BasicVertex:

    SortedMap<I, Edge<I, E>> getOutEdgeMap();
Where is this needed, in this form?

Lots of questions here, I'll try my best. =) Basically the idea is thatas a sorted map, the user would be able to know that the edges aresorted by their destination ids. Range based edge queries are possible(i.e. the edges with destinations from com.yahoo.www tocom.zillow.www). Basically this would give the users a bit morefunctionality than a basic map. A list would require a full scan tofind/remove an edge.

The only calls I see to this method are of the form"getOutEdgeMap().size()", and
"for(Edge<> edge : getOutEdgeMap().values()) {...}".
What exactly is the intended use of this map? It's a map from targetvertex id to that exact vertex and it's edge value.
Why is it not just a List<Edge<I,E>>?
Because in theory you could be modifying the structure of the graph onthe fly,and you don't want to walk around on that list? Then why not justMap<I, E> ? Why is the vertex type need to be specified
twice, and why does the map need to be sorted?

As far as why not (Sorted)Map<I, E>, I believe we actually used to havethat interface and I changed it use the Edge class. This was primarilydone since Edge objects would be used for edge requests (i.e. voidaddEdgeRequest(I sourceVertexId, Edge<I, E> edge) stores the add edgerequest with the Edge object in a list. I think I also changed the userinterfaces to use the Edge object since I initially thought it mightmake usage it a little clearer for users (i.e. edge.getDestVertexid()and some of the serialization a little simpler, but looking at it now,that might not be the case. We can probably go back to Map<I, E> orSortedMap<I, E> to save some memory and internally use the Edge objectto store the add edge requests.

For my present case, I can probably hack around it by making a virtual
SortedMap<LongWritable, Edge<LongWritable,FloatWritable>> whichimplements that interface, yet is backed by aprimitive map, but I'm trying to understand what the API is trying tosupport...

Hope my explanation helped somewhat.

  -jake

Re: Primitives vs Objects (the Movie!)

Reply via email to