[ 
https://issues.apache.org/jira/browse/HAMA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13580454#comment-13580454
 ] 

Thomas Jungblut edited comment on HAMA-704 at 2/18/13 6:46 AM:
---------------------------------------------------------------

Edward,

a single vertex is composed out of multiple objects:
 - Vertex ID
 - Vertex Value
 - A List for the outgoing edges (containing default array of 10 objects)
 - Plus 3 objects per Edge (Edge itself, ID and Value).

Let's sum the memory together that is only the Java object overhead (8 bytes 
per object[1]).

5 * 8 (for the private references, including ref to self) + n * 8 * 3 (for the 
edges) = 24 * n + 40 bytes.

So you have 40 bytes for each vertex and 24 bytes for each edge ONLY object 
overhead. If you want to assume that everything is an IntWritable (4 byte core 
size), you can calculate for yourself how much memory is occupied by all 
vertices with a sparse graph of 4 outlinks of each vertex. (Tip, it is nearly a 
KB per vertex(!)).

bq. I don't know why you think "disk-based vertices" is helpful?

It is helpful, because you can just hold a single vertex in memory and refill 
it constantly with data from disk. The problem Pregel has, is that vertex/edge 
values can (should) change during computation. So you would either have to 
write all the file completely in every iteration (that is mapreduce without 
sorting), or take multiple files and and constantly rewrite a "value file" that 
contains only the vertex values in a smaller file. On SSD systems (rarely used, 
but your BDA has them?) you can use a random access file to directly change the 
value (that is what graphchi does), this file can be mmap'ed by the OS to main 
memory everytime it is needed and if enough memory is available.

The question is also, if memory should only be allocated for fast messaging and 
the computation can completely be done on disk with a cache for frequently 
active vertices or if everything should be in memory first and spilled if 
needed.

Good luck with that.

[1] 
http://stackoverflow.com/questions/258120/what-is-the-memory-consumption-of-an-object-in-java
                
      was (Author: thomas.jungblut):
    Edward,

a single vertex is composed out of multiple objects:
 - Vertex ID
 - Vertex Value
 - A List for the outgoing edges (containing default array of 10 objects)
 - Plus 3 objects per Edge (Edge itself, ID and Value).

Let's sum the memory together that is only the Java object overhead (8 bytes 
per object[1]).

5 * 8 (for the private references, including ref to self) + n * 8 * 3 (for the 
edges) = 24 * n + 40 bytes.

So you have 40 bytes for each vertex and 24 bytes for each edge ONLY object 
overhead. If you want to assume that everything is an IntWritable (4 byte core 
size), you can calculate for yourself how much memory is occupied by all 
vertices with a sparse graph of 4 outlinks of each vertex. (Tip, it is nearly a 
KB (!)).

bq. I don't know why you think "disk-based vertices" is helpful?

It is helpful, because you can just hold a single vertex in memory and refill 
it constantly with data from disk. The problem Pregel has, is that vertex/edge 
values can (should) change during computation. So you would either have to 
write all the file completely in every iteration (that is mapreduce without 
sorting), or take multiple files and and constantly rewrite a "value file" that 
contains only the vertex values in a smaller file. On SSD systems (rarely used, 
but your BDA has them?) you can use a random access file to directly change the 
value (that is what graphchi does), this file can be mmap'ed by the OS to main 
memory everytime it is needed and if enough memory is available.

The question is also, if memory should only be allocated for fast messaging and 
the computation can completely be done on disk with a cache for frequently 
active vertices or if everything should be in memory first and spilled if 
needed.

Good luck with that.

[1] 
http://stackoverflow.com/questions/258120/what-is-the-memory-consumption-of-an-object-in-java
                  
> Optimization of memory usage during message processing
> ------------------------------------------------------
>
>                 Key: HAMA-704
>                 URL: https://issues.apache.org/jira/browse/HAMA-704
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>            Reporter: Edward J. Yoon
>            Assignee: Edward J. Yoon
>            Priority: Critical
>             Fix For: 0.6.1
>
>         Attachments: hama-704_v05.patch, localdisk.patch, mytest.patch, 
> patch.txt, patch.txt, removeMsgMap.patch
>
>
> <vertex, message> map seems consume a lot of memory. We should figure out an 
> efficient way to reduce memory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to