Greetings Giraphians!

  I'm trying out some some simple pagerank tests of Giraph on our cluster
here at Twitter, and I'm wondering what the data-size blow-up is usually
expected to be for the on-disk to in-memory graph representation.  I tried
running a pretty tiny (a single part-file, 2GB big, which had 8 splits)
SequenceFile of my own binary data (if you're curious, it's a Mahout
SequenceFile<IntWritable, VectorWritable>), which stores the data pretty
minimally - on-disk primitive int "vertex id",  target vertex id also just
an int, and the edges have only an 8byte double as payload.

  But we've got 3GB of RAM for our mappers, and some of my 8 workers are
running out of memory.  Even if the *entire* part file was in one split,
it's only 2GB on disk, so I'm wondering how much attention has been paid to
memory usage in the abstract base class org.apache.giraph.graph.Vertex?  It
looks like, on account of being very flexible in terms of types for the
vertices and edges, keeping a big TreeMap means each int-double pair (dest
vertex id + edge weight) is getting turned into a bunch of java objects, and
this is where the blow-up is coming from?

  I wonder if a few special purpose java primitive MutableVertex
implementations would be useful for me to contribute to conserve a bit of
memory?  If I'm mistaken in my assumptions here (or there is already work
done on this), just let me know.  But if not, I'd love to help get Giraph
running on some nice beefy data sets (with simplistic data models: vertex
ids being simply ints / longs, and edge weights and messages to pass being
similarly just booleans, floats, or doubles), because I've got some stuff
I'd love to throw in memory and crank some distributed computations on. :)

  - jake / @pbrane

Reply via email to