Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hama Wiki" for change notification.
The "WriteHamaGraphFile" page has been changed by thomasjungblut: http://wiki.apache.org/hama/WriteHamaGraphFile?action=diff&rev1=3&rev2=4 For this example, the Wikipedia link dataset is used (http://haselgrove.id.au/wikipedia.htm) / (http://users.on.net/~henry/pagerank/links-simple-sorted.zip). - The dataset contains 5,716,808 pages and 130,160,392 links and is unzipped ~1gb large. You should use a smallish cluster to crunch this dataset with Hama, based on the blocksize of HDFS a slot number of 8-32 is required. We tell you later how to fine tune this to use fewer slots if you don't have them currently. + The dataset contains 5,716,808 pages and 130,160,392 links and is unzipped ~1gb large. You should use a smallish cluster to crunch this dataset with Hama, based on the blocksize of HDFS a slot number of 16-32 is required. The file is formatted like this @@ -218, +218 @@ '''Troubleshooting''' - If your job does not execute, your cluster may not have enough resources (task slots). + If your job does not execute, your cluster may not have enough resources (task slots). - You can either increase them, or decrease the minimum split size by setting: + Symptoms may look like this in the bsp master log: {{{ - pageJob.set("bsp.min.split.size", (512 * 1024 * 1024) + ""); + 2012-05-27 20:00:51,228 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable + 2012-05-27 20:00:51,288 INFO org.apache.hama.bsp.JobInProgress: num BSPTasks: 16 + 2012-05-27 20:00:51,305 INFO org.apache.hama.bsp.JobInProgress: Job is initialized. + 2012-05-27 20:00:51,313 ERROR org.apache.hama.bsp.SimpleTaskScheduler: Scheduling of job Pagerank could not be done successfully. Killing it! + 2012-05-27 20:01:08,334 INFO org.apache.hama.bsp.JobInProgress: num BSPTasks: 16 + 2012-05-27 20:01:08,339 INFO org.apache.hama.bsp.JobInProgress: Job is initialized. + 2012-05-27 20:01:08,340 ERROR org.apache.hama.bsp.SimpleTaskScheduler: Scheduling of job Pagerank could not be done successfully. Killing it! }}} - This will set the split size to 512mb, thus having 2 tasks and not 32 or 16. + + This was run on a 8 slot cluster, but it required 16 slots because of 64m chunk size of HDFS. + Either you can reupload the file with higher chunksize so the slots match the blocks or you can increase the slots in your Hama cluster. If you sort the result descending by pagerank you can see the following top 10 sites:
