> Giraph currently uses a lot of memory, but we're working on it in a few
> JIRAs.  That being said, there are a few things that you can do to get some
> fairly large data sets going.

  Which JIRAs?

> If you have a 64-bit JVM for your task trackers, that is much better,
> otherwise you are limited to 4 GB (like me).

  Limiting each mapper to 4GB is fine, because in theory, most clusters run
with totalRAMperBox / numCores < 4GB anyways (certainly true for our
What happens in Giraph when multiple mappers are on the same physical box,
do they still communicate via RPC?

I was able to run the org.apache.giraph.benchmark.PageRankBenchmark with 300
> workers and 1 billion vertices with a single edge and 20 supersteps.  Here's
> the parameters I used for our configuration:

1B vertices with just *one* edge?  What kind of graph would this be???

The PageRankBenchMark code runs with synthetic graph data it generates on
the fly, right?

I'm having it read the graph data from HDFS, where I can see how big it is
on disk, into RAM, by subclassing SimplePageRankVertex.  So my graph may be
a bit poorly balanced (I'll add some logging to see).

hadoop jar giraph-0.70-jar-with-dependencies.jar
> org.apache.giraph.benchmark.PageRankBenchmark
> -Dgiraph.totalInputSplitMultiplier=0.001 -Dmapred.child.java.opts="-Xms1800m
> -Xmx1800m -X
> ss64k" -Dmapred.job.map.memory.mb=4608 -Dgiraph.checkpointFrequency=0
> -Dgiraph.pollAttempts=20 -e 1 -s 20 -v -V 1000000000 -w 300
> Your parameters will likely vary based on how much memory you have and your
> Hadoop configuration.  Our machines have 16 GB I think, but I only have 4 GB
> due to the 32-bit limit.  Using mapred.job.map.memory.mb allows me to steal
> more map slots per node to give me more memory per map slot.  -Xss to reduce
> the thread stack size will help a LOT.

I'll try to see if -Xss64k helps, thanks.  We typically run with 3GB heap
per mapper, but they're beefy machines, so this is really what everyone gets
(a bit overkill, probably, but we have some folk running pretty memory
intensive tasks...)

> Another thing that could cause memory issues is an imbalance in the input
> data across the input splits (until JIRA
> https://issues.apache.org/jira/browse/GIRAPH-11 is resolved).  Hopefully
> each input split is fairly balanced for now, otherwise, you might want to
> rebalance the input splits for now.

That JIRA ticket seems to be talking about sorted inputs - is this a
requirement (that vertex ids are sorted in each split), or does this just
make some things run much better?  What does this have to do with balanced
input splits?

> We haven't investigated memory improvements using primitives versus
> objects, I'm curious myself to see how much extra memory we are using at the
> cost of flexibility.  That being said, I think that flexibility is pretty
> important for users and I'm not sure how to maintain both choices nicely.

In Mahout, we've had to spend a fair amount of time early on to trim down
all of our java objects, and live in a world where a lot of the time, all we
have are arrays of primitives.  It's helped quite a bit with performance,
but it's not really that limiting, actually, as long as you follow the "one
additional layer of indirection" tactic: translate all of your static state
into "ids" of some sort (ie normalize your data), and things like Strings
get turned into int termIds, ditto for various other Feature objects.  It
does require keeping track of a dictionary at the end of the day, to
translate all of your internal ids into User objects, or Documents, etc.
 But this is what is done in Lucene and databases anyways.

I guess I'm not sure whether you *need* to give up the OO API, it's really
nice to have completely strongly typed graph primitives.  I'll just see if I
can implement the same API using internal data structures which are nice and
compact (as a performance improvement only) which in the case where all of
the type parameters are primitive-able.  So for example, something like
PrimitiveVertex<LongWritable, DoubleWritable, FloatWritable, DoubleWritable>
implements MutableVertex<LongWritable, DoubleWritable, FloatWritable,
DoubleWritable>.  Still API-compatible, but internally takes advantage of
the fact that all types are wrapped primitives.

> I'm glad to hear you're trying out Giraph at Twitter.  Please keep us aware
> of any problems you run into and we'll try to help.

Definitely, thanks.  We've got some relatively big graphs, I'd be happy to
report our "stress-testing" of this project. :)


> > Greetings Giraphians!
> >
> >   I'm trying out some some simple pagerank tests of Giraph on our cluster
> here at Twitter, and I'm wondering what the data-size blow-up is usually
> expected to be for the on-disk to in-memory graph representation.  I tried
> running a pretty tiny (a single part-file, 2GB big, which had 8 splits)
> SequenceFile of my own binary data (if you're curious, it's a Mahout
> SequenceFile<IntWritable, VectorWritable>), which stores the data pretty
> minimally - on-disk primitive int "vertex id",  target vertex id also just
> an int, and the edges have only an 8byte double as payload.
> >
> >   But we've got 3GB of RAM for our mappers, and some of my 8 workers are
> running out of memory.  Even if the *entire* part file was in one split,
> it's only 2GB on disk, so I'm wondering how much attention has been paid to
> memory usage in the abstract base class org.apache.giraph.graph.Vertex?  It
> looks like, on account of being very flexible in terms of types for the
> vertices and edges, keeping a big TreeMap means each int-double pair (dest
> vertex id + edge weight) is getting turned into a bunch of java objects, and
> this is where the blow-up is coming from?
> >
> >   I wonder if a few special purpose java primitive MutableVertex
> implementations would be useful for me to contribute to conserve a bit of
> memory?  If I'm mistaken in my assumptions here (or there is already work
> done on this), just let me know.  But if not, I'd love to help get Giraph
> running on some nice beefy data sets (with simplistic data models: vertex
> ids being simply ints / longs, and edge weights and messages to pass being
> similarly just booleans, floats, or doubles), because I've got some stuff
> I'd love to throw in memory and crank some distributed computations on. :)
> >
> >   - jake / @pbrane

