Giraph currently uses a lot of memory, but we're working on it in a few JIRAs.
That being said, there are a few things that you can do to get some fairly
large data sets going.
If you have a 64-bit JVM for your task trackers, that is much better, otherwise
you are limited to 4 GB (like me).
I was able to run the org.apache.giraph.benchmark.PageRankBenchmark with 300
workers and 1 billion vertices with a single edge and 20 supersteps. Here's
the parameters I used for our configuration:
hadoop jar giraph-0.70-jar-with-dependencies.jar
ss64k" -Dmapred.job.map.memory.mb=4608 -Dgiraph.checkpointFrequency=0
-Dgiraph.pollAttempts=20 -e 1 -s 20 -v -V 1000000000 -w 300
Your parameters will likely vary based on how much memory you have and your
Hadoop configuration. Our machines have 16 GB I think, but I only have 4 GB
due to the 32-bit limit. Using mapred.job.map.memory.mb allows me to steal
more map slots per node to give me more memory per map slot. -Xss to reduce
the thread stack size will help a LOT.
Another thing that could cause memory issues is an imbalance in the input data
across the input splits (until JIRA
https://issues.apache.org/jira/browse/GIRAPH-11 is resolved). Hopefully each
input split is fairly balanced for now, otherwise, you might want to rebalance
the input splits for now.
We haven't investigated memory improvements using primitives versus objects,
I'm curious myself to see how much extra memory we are using at the cost of
flexibility. That being said, I think that flexibility is pretty important for
users and I'm not sure how to maintain both choices nicely.
I'm glad to hear you're trying out Giraph at Twitter. Please keep us aware of
any problems you run into and we'll try to help.
On Sep 5, 2011, at 10:49 PM, Jake Mannix wrote:
> Greetings Giraphians!
> I'm trying out some some simple pagerank tests of Giraph on our cluster
> here at Twitter, and I'm wondering what the data-size blow-up is usually
> expected to be for the on-disk to in-memory graph representation. I tried
> running a pretty tiny (a single part-file, 2GB big, which had 8 splits)
> SequenceFile of my own binary data (if you're curious, it's a Mahout
> SequenceFile<IntWritable, VectorWritable>), which stores the data pretty
> minimally - on-disk primitive int "vertex id", target vertex id also just an
> int, and the edges have only an 8byte double as payload.
> But we've got 3GB of RAM for our mappers, and some of my 8 workers are
> running out of memory. Even if the *entire* part file was in one split, it's
> only 2GB on disk, so I'm wondering how much attention has been paid to memory
> usage in the abstract base class org.apache.giraph.graph.Vertex? It looks
> like, on account of being very flexible in terms of types for the vertices
> and edges, keeping a big TreeMap means each int-double pair (dest vertex id +
> edge weight) is getting turned into a bunch of java objects, and this is
> where the blow-up is coming from?
> I wonder if a few special purpose java primitive MutableVertex
> implementations would be useful for me to contribute to conserve a bit of
> memory? If I'm mistaken in my assumptions here (or there is already work
> done on this), just let me know. But if not, I'd love to help get Giraph
> running on some nice beefy data sets (with simplistic data models: vertex ids
> being simply ints / longs, and edge weights and messages to pass being
> similarly just booleans, floats, or doubles), because I've got some stuff I'd
> love to throw in memory and crank some distributed computations on. :)
> - jake / @pbrane