[
https://issues.apache.org/jira/browse/HAMA-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456630#comment-13456630
]
Thomas Jungblut commented on HAMA-642:
--------------------------------------
Okay I failed, mainly because of the caching of a single instance.
Replaced this by normal serialization that creates new objects everytime. This
is quite sad, because it deserializes full buckets at runtime.
I also removed the internal locking system, because we don't need multithreaded
access.
Here is my benchmark with a 10gb graph file, write 12gb, read 10k keys randomly
(4ghz processor + 7.2k RPM disk):
{noformat}
Writing of ~12GB took 447s!
Null entries? 0
Random read of 10.000 keys took 155s!
PAGES:
3008231 used pages with size 12GB
7220 record translation pages with size 30MB
0 free (unused) pages with size 0B
31 free (phys) pages with size 124KB
0 free (logical) pages with size 0B
Total number of pages is 3015482 with size 12GB
RECORDS:
Contains 4909036 records and 564 free slots.
Total space occupied by data is 12GB
Average data size in record is 2492B
Maximal data size in record is 526KB
Space wasted in record fragmentation is 25MB
Maximal space wasted in single record fragmentation is 253B
{noformat}
Same with 100k key lookups:
{noformat}
Writing of ~12GB took 367s!
Null entries? 0
Random read of 100.000 keys took 352s!
PAGES:
3008231 used pages with size 12GB
7220 record translation pages with size 30MB
0 free (unused) pages with size 0B
31 free (phys) pages with size 124KB
0 free (logical) pages with size 0B
Total number of pages is 3015482 with size 12GB
RECORDS:
Contains 4909036 records and 564 free slots.
Total space occupied by data is 12GB
Average data size in record is 2492B
Maximal data size in record is 526KB
Space wasted in record fragmentation is 25MB
Maximal space wasted in single record fragmentation is 253B
{noformat}
I will setup a GraphJobRunner prototype and see if I can crunch my 10g file
without any big RAM problems.
> Make GraphRunner disk based
> ---------------------------
>
> Key: HAMA-642
> URL: https://issues.apache.org/jira/browse/HAMA-642
> Project: Hama
> Issue Type: Improvement
> Components: graph
> Reporter: Thomas Jungblut
>
> To improve scalability we can improve the graph runner to be disk based.
> Which basically means:
> - We have just a single Vertex instance that get's refilled.
> - We directly write vertices to disk after partitioning
> - In every superstep we iterate over the vertices on disk, fill the vertex
> instance and call the users compute functions
> Problems:
> - State other than vertex value can't be stored easy
> - How do we deal with random access after messages have arrived?
> So I think we should make the graph runner more hybrid, like using the queues
> we have implemented in the messaging. So the graphrunner can be configured to
> run completely on disk, in cached mode or in in-memory mode.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira