Hi,

Over the last couple of weeks Daniel Kuppitz and I have been doing repeated 
Gremlin traversal over the Friendster graph using SparkGraphComputer.

        
http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#sparkgraphcomputer

The Friendster graph has 124 million vertices and 2.5 billion edges. We are 
using 3 Blades each with 39gigs of memory and 24 cores. 

        
https://ia600500.us.archive.org/27/items/friendster-dataset-201107/README-friends.txt

We use ScriptInputFormat to read the text-base representation of the Friendster 
data to generate VertexWritables which are (when on the wire) Gryo adjacency 
list serializations. 

        http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#gremlin-kryo

Here is the amount of data that is generated:

        raw text based input: 22.6 gigs
        gryo graph generated from parsed text data: 94.6 gigs
        
Next, here are the runtimes for the various stages:

        g.V() -- 2.5hours
        g.V.out() -- 6 hours (+3.5 hours)
        g.V.out().out() -- 9.4 hours (+3.4 hours)
        g.V.out().out().out() -- 12.8 hours (+3.4 hours) --- SPECULATED 
(running right now -- but on target for that amount of time)

The first computation has no incoming message so there is no join() involved 
and thus, its just the raw processing of VertexProgram.execute(). The others 
each incur a join for each out(). The cost of each join is the same as this is 
the nature of the Gremlin OLAP algorithm. 

        
http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#traversalvertexprogram
                
http://thinkaurelius.com/2012/11/11/faunus-provides-big-graph-data-analytics/ 
(read the section called The Traversal Mechanics of Faunus) 

Here is the amount of data being shuffled on each join:

        393.5 gigs

That is how much data is created by a single message pass. That is a lot -- 
well, an out() will send a message over each edge in the graph and there are 
2.5 billion edges. What is that message that is being sent? -- Its a 
Traverser<Vertex> which has a respective Gryo representation (--a 
DetachedVertex).

I believe if we can get our Gryo serialization size down, we can expect much 
better runtimes. An "off the cuff" calculation estimates that we can get the 
Gryo file down by ~8x. 

        https://issues.apache.org/jira/browse/TINKERPOP3-609

That will not incur a 8x increase in speed as VertexProgram.execute() still 
takes a prescribed amount of time that is irrespective of the serialization 
size, but perhaps a ?5x+ …? Or perhaps more than 8x ?!?! as more data can be 
held in memory by Spark and partitions will not be constantly written and read 
from disk.

Anywho -- was IMing this with Stephen and Daniel and decided to make it an 
email to the groups in case people have thoughts or which to help on this 
problem.

Enjoy!,
Marko.

http://markorodriguez.com

Reply via email to