Jake Mannix commented on GIRAPH-28:

Yeah, I don't know what's eating up the space in my Primitive class, but I 
*bet* it has something to do with my "wrapper" SortedMap I return when you call 
getOutEdgeMap().  This object *should* be lazily instantiated, but as it has a 
"get***" name to it, I wonder if the ObjectSizeCalculator is considering the 
returned value to be part of the object's memory?  If so, then it *should* be 
what we see.  Dmitriy, can you try re-running your *original* test, on the real 
LongDoubleFloatDoubleVertex, but with getOutEdgeMap() implemented as just 
"return null;", and similarly with getMsgList()?  If you test doesn't use those 
methods.  My guess is that the measured size will drop down to close to what 
your results for "Primitive Map" are.

Regarding long[] and float[], it's actually not a totally crazy implementation, 
for the "final state" of the in-memory representation: most graph algorithms 
leave the graph structure alone (ie don't add or delete edges during 
iteration), in which case compacting the Vertex down to a pair of parallel 
sorted arrays after loading from the VertexReader is not such a unreasonable 
situation.  Of course, it then requires that random-access reads to the 
outbound edges do a log(numOutEdges) binary search, but not all graph 
algorithms require lots of random access to specific edges (as opposed to bulk 
access to all edges).  

I could imagine two separate primitive implementations - one based on a 
map-like structure (a la OpenXxxYyyHashMap), which gives random access to the 
edges (but then *not* sorted access!  Note that the current code in my 
primitive vertex throws UnsupportedOperationException for lots of SortedMap 
methods, which happen to be non exercised in the unit tests), and one based on 
parallel arrays, which allows for very fast sequential read-only access, but 
slower random read-only access, and very much slower mutation speed.  I wrote 
implementations like this for the Mahout Vector interface, and named them 
almost exactly like that: RandomAccessSparseVector, and 
SequentialAccessSparseVector (although I should have called the latter 
something like ReadOnlySequentialAccessSparseVector, but it would have been a 
lie, as you *can* modify it, you just shouldn't).

These numbers now make a lot of sense - Objects Are Big, that's the moral of 
the story.  Any place you need to hang onto gazillions of things in memory on 
the JVM, you should do your best to stuff them into primitive arrays.  

> Introduce new primitive-specific MutableVertex subclasses
> ---------------------------------------------------------
>                 Key: GIRAPH-28
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-28
>             Project: Giraph
>          Issue Type: New Feature
>          Components: graph
>    Affects Versions: 0.70.0
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>         Attachments: GIRAPH-28.diff
> As discussed on the list, 
> MutableVertex<LongWritable,DoubleWritable,FloatWritable,DoubleWritable> (for 
> example) could be highly optimized in its memory footprint if the vertex and 
> edge data were held in a form which minimized Java object usage.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to