Hi,
Over the last couple of weeks Daniel Kuppitz and I have been doing repeated
Gremlin traversal over the Friendster graph using SparkGraphComputer.
http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#sparkgraphcomputer
The Friendster graph has 124 million vertices and 2.5 billion edges. We are
using 3 Blades each with 39gigs of memory and 24 cores.
https://ia600500.us.archive.org/27/items/friendster-dataset-201107/README-friends.txt
We use ScriptInputFormat to read the text-base representation of the Friendster
data to generate VertexWritables which are (when on the wire) Gryo adjacency
list serializations.
http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#gremlin-kryo
Here is the amount of data that is generated:
raw text based input: 22.6 gigs
gryo graph generated from parsed text data: 94.6 gigs
Next, here are the runtimes for the various stages:
g.V() -- 2.5hours
g.V.out() -- 6 hours (+3.5 hours)
g.V.out().out() -- 9.4 hours (+3.4 hours)
g.V.out().out().out() -- 12.8 hours (+3.4 hours) --- SPECULATED
(running right now -- but on target for that amount of time)
The first computation has no incoming message so there is no join() involved
and thus, its just the raw processing of VertexProgram.execute(). The others
each incur a join for each out(). The cost of each join is the same as this is
the nature of the Gremlin OLAP algorithm.
http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#traversalvertexprogram
http://thinkaurelius.com/2012/11/11/faunus-provides-big-graph-data-analytics/
(read the section called The Traversal Mechanics of Faunus)
Here is the amount of data being shuffled on each join:
393.5 gigs
That is how much data is created by a single message pass. That is a lot --
well, an out() will send a message over each edge in the graph and there are
2.5 billion edges. What is that message that is being sent? -- Its a
Traverser<Vertex> which has a respective Gryo representation (--a
DetachedVertex).
I believe if we can get our Gryo serialization size down, we can expect much
better runtimes. An "off the cuff" calculation estimates that we can get the
Gryo file down by ~8x.
https://issues.apache.org/jira/browse/TINKERPOP3-609
That will not incur a 8x increase in speed as VertexProgram.execute() still
takes a prescribed amount of time that is irrespective of the serialization
size, but perhaps a ?5x+ …? Or perhaps more than 8x ?!?! as more data can be
held in memory by Spark and partitions will not be constantly written and read
from disk.
Anywho -- was IMing this with Stephen and Daniel and decided to make it an
email to the groups in case people have thoughts or which to help on this
problem.
Enjoy!,
Marko.
http://markorodriguez.com