It always fails if the input is large?^^ Do you have stacktraces? If there are filesystem problems, this isn't unexpected... Maybe a disk filled up.
2012/9/29 Edward J. Yoon <[email protected]> > > - Does this fail always or just sometimes? > > Always > > - When it finishes, is the result wrong? Just curios, how do you compare > 20gb of text files?;D > > Never finishes. > > > - In case it is really the combiner, does pagerank work without problems? > > Never finishes if input is large. > > Sent from my iPad > > On Sep 29, 2012, at 5:07 AM, "Thomas Jungblut (JIRA)" <[email protected]> > wrote: > > > > > [ > https://issues.apache.org/jira/browse/HAMA-642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13465866#comment-13465866] > > > > Thomas Jungblut commented on HAMA-642: > > -------------------------------------- > > > > A race is not good. We have to investigate a bit deeper I guess. I don't > think that there is a concurrency problem inside of jdbm, but I will have a > look, maybe there is some resources that is static, however each task has > its own mutal exclusive "database". so I don't see a problem there. > > > > My first guess was the use of the combiner. So here my questions: > > - Does this fail always or just sometimes? > > - When it finishes, is the result wrong? Just curios, how do you compare > 20gb of text files?;D > > - In case it is really the combiner, does pagerank work without problems? > > > > I will build a smaller cluster in near future to test these things more > efficiently. > > > >> Make GraphRunner disk based > >> --------------------------- > >> > >> Key: HAMA-642 > >> URL: https://issues.apache.org/jira/browse/HAMA-642 > >> Project: Hama > >> Issue Type: Improvement > >> Components: graph > >> Affects Versions: 0.5.0 > >> Reporter: Thomas Jungblut > >> Assignee: Edward J. Yoon > >> Attachments: HAMA-642_unix_1.patch, HAMA-642_unix_2.patch, > HAMA-scale_1.patch, HAMA-scale_2.patch, HAMA-scale_3.patch, > HAMA-scale_4.patch > >> > >> > >> To improve scalability we can improve the graph runner to be disk based. > >> Which basically means: > >> - We have just a single Vertex instance that get's refilled. > >> - We directly write vertices to disk after partitioning > >> - In every superstep we iterate over the vertices on disk, fill the > vertex instance and call the users compute functions > >> Problems: > >> - State other than vertex value can't be stored easy > >> - How do we deal with random access after messages have arrived? > >> So I think we should make the graph runner more hybrid, like using the > queues we have implemented in the messaging. So the graphrunner can be > configured to run completely on disk, in cached mode or in in-memory mode. > > > > -- > > This message is automatically generated by JIRA. > > If you think it was sent incorrectly, please contact your JIRA > administrators > > For more information on JIRA, see: > http://www.atlassian.com/software/jira >
