[jira] [Commented] (HAMA-642) Make GraphRunner disk based

Thomas Jungblut (JIRA) Sun, 16 Sep 2012 12:00:09 -0700

    [ 
https://issues.apache.org/jira/browse/HAMA-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456630#comment-13456630
 ]


Thomas Jungblut commented on HAMA-642:
--------------------------------------

Okay I failed, mainly because of the caching of a single instance.
Replaced this by normal serialization that creates new objects everytime. This 
is quite sad, because it deserializes full buckets at runtime.
I also removed the internal locking system, because we don't need multithreaded 
access.

Here is my benchmark with a 10gb graph file, write 12gb, read 10k keys randomly 
(4ghz processor + 7.2k RPM disk):

{noformat}
Writing of ~12GB took 447s!
Null entries? 0
Random read of 10.000 keys took 155s!
PAGES:
  3008231 used pages with size 12GB
  7220 record translation pages with size 30MB
  0 free (unused) pages with size 0B
  31 free (phys) pages with size 124KB
  0 free (logical) pages with size 0B
  Total number of pages is 3015482 with size 12GB
RECORDS:
  Contains 4909036 records and 564 free slots.
  Total space occupied by data is 12GB
  Average data size in record is 2492B
  Maximal data size in record is 526KB
  Space wasted in record fragmentation is 25MB
  Maximal space wasted in single record fragmentation is 253B

{noformat}


Same with 100k key lookups:

{noformat}

Writing of ~12GB took 367s!
Null entries? 0
Random read of 100.000 keys took 352s!
PAGES:
  3008231 used pages with size 12GB
  7220 record translation pages with size 30MB
  0 free (unused) pages with size 0B
  31 free (phys) pages with size 124KB
  0 free (logical) pages with size 0B
  Total number of pages is 3015482 with size 12GB
RECORDS:
  Contains 4909036 records and 564 free slots.
  Total space occupied by data is 12GB
  Average data size in record is 2492B
  Maximal data size in record is 526KB
  Space wasted in record fragmentation is 25MB
  Maximal space wasted in single record fragmentation is 253B

{noformat}

I will setup a GraphJobRunner prototype and see if I can crunch my 10g file 
without any big RAM problems.
                
> Make GraphRunner disk based
> ---------------------------
>
>                 Key: HAMA-642
>                 URL: https://issues.apache.org/jira/browse/HAMA-642
>             Project: Hama
>          Issue Type: Improvement
>          Components: graph
>            Reporter: Thomas Jungblut
>
> To improve scalability we can improve the graph runner to be disk based.
> Which basically means:
> - We have just a single Vertex instance that get's refilled.
> - We directly write vertices to disk after partitioning
> - In every superstep we iterate over the vertices on disk, fill the vertex 
> instance and call the users compute functions
> Problems:
> - State other than vertex value can't be stored easy
> - How do we deal with random access after messages have arrived?
> So I think we should make the graph runner more hybrid, like using the queues 
> we have implemented in the messaging. So the graphrunner can be configured to 
> run completely on disk, in cached mode or in in-memory mode.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HAMA-642) Make GraphRunner disk based

Reply via email to