Re: [jira] [Commented] (HAMA-642) Make GraphRunner disk based

Edward J. Yoon Fri, 28 Sep 2012 15:38:52 -0700

> - Does this fail always or just sometimes?

Always
> - When it finishes, is the result wrong? Just curios, how do you compare 20gb 
> of text files?;D


Never finishes.

> - In case it is really the combiner, does pagerank work without problems?

Never finishes if input is large.

Sent from my iPad

On Sep 29, 2012, at 5:07 AM, "Thomas Jungblut (JIRA)" <[email protected]> wrote:

> 
>    [ 
> https://issues.apache.org/jira/browse/HAMA-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465866#comment-13465866
>  ] 
> 
> Thomas Jungblut commented on HAMA-642:
> --------------------------------------
> 
> A race is not good. We have to investigate a bit deeper I guess. I don't 
> think that there is a concurrency problem inside of jdbm, but I will have a 
> look, maybe there is some resources that is static, however each task has its 
> own mutal exclusive "database". so I don't see a problem there. 
> 
> My first guess was the use of the combiner. So here my questions:
> - Does this fail always or just sometimes?
> - When it finishes, is the result wrong? Just curios, how do you compare 20gb 
> of text files?;D
> - In case it is really the combiner, does pagerank work without problems?
> 
> I will build a smaller cluster in near future to test these things more 
> efficiently.
> 
>> Make GraphRunner disk based
>> ---------------------------
>> 
>>                Key: HAMA-642
>>                URL: https://issues.apache.org/jira/browse/HAMA-642
>>            Project: Hama
>>         Issue Type: Improvement
>>         Components: graph
>>   Affects Versions: 0.5.0
>>           Reporter: Thomas Jungblut
>>           Assignee: Edward J. Yoon
>>        Attachments: HAMA-642_unix_1.patch, HAMA-642_unix_2.patch, 
>> HAMA-scale_1.patch, HAMA-scale_2.patch, HAMA-scale_3.patch, 
>> HAMA-scale_4.patch
>> 
>> 
>> To improve scalability we can improve the graph runner to be disk based.
>> Which basically means:
>> - We have just a single Vertex instance that get's refilled.
>> - We directly write vertices to disk after partitioning
>> - In every superstep we iterate over the vertices on disk, fill the vertex 
>> instance and call the users compute functions
>> Problems:
>> - State other than vertex value can't be stored easy
>> - How do we deal with random access after messages have arrived?
>> So I think we should make the graph runner more hybrid, like using the 
>> queues we have implemented in the messaging. So the graphrunner can be 
>> configured to run completely on disk, in cached mode or in in-memory mode.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (HAMA-642) Make GraphRunner disk based

Reply via email to