[
https://issues.apache.org/jira/browse/HAMA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13580454#comment-13580454
]
Thomas Jungblut edited comment on HAMA-704 at 2/18/13 6:46 AM:
---------------------------------------------------------------
Edward,
a single vertex is composed out of multiple objects:
- Vertex ID
- Vertex Value
- A List for the outgoing edges (containing default array of 10 objects)
- Plus 3 objects per Edge (Edge itself, ID and Value).
Let's sum the memory together that is only the Java object overhead (8 bytes
per object[1]).
5 * 8 (for the private references, including ref to self) + n * 8 * 3 (for the
edges) = 24 * n + 40 bytes.
So you have 40 bytes for each vertex and 24 bytes for each edge ONLY object
overhead. If you want to assume that everything is an IntWritable (4 byte core
size), you can calculate for yourself how much memory is occupied by all
vertices with a sparse graph of 4 outlinks of each vertex. (Tip, it is nearly a
KB per vertex(!)).
bq. I don't know why you think "disk-based vertices" is helpful?
It is helpful, because you can just hold a single vertex in memory and refill
it constantly with data from disk. The problem Pregel has, is that vertex/edge
values can (should) change during computation. So you would either have to
write all the file completely in every iteration (that is mapreduce without
sorting), or take multiple files and and constantly rewrite a "value file" that
contains only the vertex values in a smaller file. On SSD systems (rarely used,
but your BDA has them?) you can use a random access file to directly change the
value (that is what graphchi does), this file can be mmap'ed by the OS to main
memory everytime it is needed and if enough memory is available.
The question is also, if memory should only be allocated for fast messaging and
the computation can completely be done on disk with a cache for frequently
active vertices or if everything should be in memory first and spilled if
needed.
Good luck with that.
[1]
http://stackoverflow.com/questions/258120/what-is-the-memory-consumption-of-an-object-in-java
was (Author: thomas.jungblut):
Edward,
a single vertex is composed out of multiple objects:
- Vertex ID
- Vertex Value
- A List for the outgoing edges (containing default array of 10 objects)
- Plus 3 objects per Edge (Edge itself, ID and Value).
Let's sum the memory together that is only the Java object overhead (8 bytes
per object[1]).
5 * 8 (for the private references, including ref to self) + n * 8 * 3 (for the
edges) = 24 * n + 40 bytes.
So you have 40 bytes for each vertex and 24 bytes for each edge ONLY object
overhead. If you want to assume that everything is an IntWritable (4 byte core
size), you can calculate for yourself how much memory is occupied by all
vertices with a sparse graph of 4 outlinks of each vertex. (Tip, it is nearly a
KB (!)).
bq. I don't know why you think "disk-based vertices" is helpful?
It is helpful, because you can just hold a single vertex in memory and refill
it constantly with data from disk. The problem Pregel has, is that vertex/edge
values can (should) change during computation. So you would either have to
write all the file completely in every iteration (that is mapreduce without
sorting), or take multiple files and and constantly rewrite a "value file" that
contains only the vertex values in a smaller file. On SSD systems (rarely used,
but your BDA has them?) you can use a random access file to directly change the
value (that is what graphchi does), this file can be mmap'ed by the OS to main
memory everytime it is needed and if enough memory is available.
The question is also, if memory should only be allocated for fast messaging and
the computation can completely be done on disk with a cache for frequently
active vertices or if everything should be in memory first and spilled if
needed.
Good luck with that.
[1]
http://stackoverflow.com/questions/258120/what-is-the-memory-consumption-of-an-object-in-java
> Optimization of memory usage during message processing
> ------------------------------------------------------
>
> Key: HAMA-704
> URL: https://issues.apache.org/jira/browse/HAMA-704
> Project: Hama
> Issue Type: Improvement
> Components: graph
> Reporter: Edward J. Yoon
> Assignee: Edward J. Yoon
> Priority: Critical
> Fix For: 0.6.1
>
> Attachments: hama-704_v05.patch, localdisk.patch, mytest.patch,
> patch.txt, patch.txt, removeMsgMap.patch
>
>
> <vertex, message> map seems consume a lot of memory. We should figure out an
> efficient way to reduce memory.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira