Greetings Giraphians! I'm trying out some some simple pagerank tests of Giraph on our cluster here at Twitter, and I'm wondering what the data-size blow-up is usually expected to be for the on-disk to in-memory graph representation. I tried running a pretty tiny (a single part-file, 2GB big, which had 8 splits) SequenceFile of my own binary data (if you're curious, it's a Mahout SequenceFile<IntWritable, VectorWritable>), which stores the data pretty minimally - on-disk primitive int "vertex id", target vertex id also just an int, and the edges have only an 8byte double as payload.
But we've got 3GB of RAM for our mappers, and some of my 8 workers are running out of memory. Even if the *entire* part file was in one split, it's only 2GB on disk, so I'm wondering how much attention has been paid to memory usage in the abstract base class org.apache.giraph.graph.Vertex? It looks like, on account of being very flexible in terms of types for the vertices and edges, keeping a big TreeMap means each int-double pair (dest vertex id + edge weight) is getting turned into a bunch of java objects, and this is where the blow-up is coming from? I wonder if a few special purpose java primitive MutableVertex implementations would be useful for me to contribute to conserve a bit of memory? If I'm mistaken in my assumptions here (or there is already work done on this), just let me know. But if not, I'd love to help get Giraph running on some nice beefy data sets (with simplistic data models: vertex ids being simply ints / longs, and edge weights and messages to pass being similarly just booleans, floats, or doubles), because I've got some stuff I'd love to throw in memory and crank some distributed computations on. :) - jake / @pbrane
