[
https://issues.apache.org/jira/browse/HAMA-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Edward J. Yoon resolved HAMA-973.
---------------------------------
Resolution: Fixed
Fix Version/s: 0.7.1
> GraphJob and RandBench example works incorrectly when FT is enabled.
> --------------------------------------------------------------------
>
> Key: HAMA-973
> URL: https://issues.apache.org/jira/browse/HAMA-973
> Project: Hama
> Issue Type: Bug
> Components: bsp core
> Affects Versions: 0.7.0
> Reporter: Edward J. Yoon
> Assignee: Edward J. Yoon
> Priority: Critical
> Fix For: 0.7.1
>
> Attachments: patch.txt
>
>
> Today I tested fault tolerance function with RandBench. FT works fine but I
> just found that there is a bug in RandBench program.
> {code}
> [root@cluster-0 hama-0.7.0]# bin/hama jar hama-examples-0.7.0.jar bench 100
> 100 100
> 15/09/03 12:59:57 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 15/09/03 12:59:58 INFO Configuration.deprecation: user.name is deprecated.
> Instead, use mapreduce.job.user.name
> 15/09/03 12:59:58 INFO bsp.BSPJobClient: Running job: job_201509031258_0002
> 15/09/03 13:00:01 INFO bsp.BSPJobClient: Current supersteps number: 0
> 15/09/03 13:00:22 INFO bsp.BSPJobClient: Current supersteps number: 2
> 15/09/03 13:00:26 INFO bsp.BSPJobClient: Current supersteps number: 5
> 15/09/03 13:00:29 INFO bsp.BSPJobClient: Current supersteps number: 11
> 15/09/03 13:00:32 INFO bsp.BSPJobClient: Current supersteps number: 16
> 15/09/03 13:00:35 INFO bsp.BSPJobClient: Current supersteps number: 21
> 15/09/03 13:00:38 INFO bsp.BSPJobClient: Current supersteps number: 28
> 15/09/03 13:00:41 INFO bsp.BSPJobClient: Current supersteps number: 35
> 15/09/03 13:00:44 INFO bsp.BSPJobClient: Current supersteps number: 42
> 15/09/03 13:00:47 INFO bsp.BSPJobClient: Current supersteps number: 49
> 15/09/03 13:00:50 INFO bsp.BSPJobClient: Current supersteps number: 56
> 15/09/03 13:02:05 INFO bsp.BSPJobClient: Current supersteps number: 0
> 15/09/03 13:02:08 INFO bsp.BSPJobClient: Current supersteps number: 56
> 15/09/03 13:02:11 INFO bsp.BSPJobClient: Current supersteps number: 0
> 15/09/03 13:02:20 INFO bsp.BSPJobClient: Current supersteps number: 57
> 15/09/03 13:02:23 INFO bsp.BSPJobClient: Current supersteps number: 61
> 15/09/03 13:02:26 INFO bsp.BSPJobClient: Current supersteps number: 67
> 15/09/03 13:02:29 INFO bsp.BSPJobClient: Current supersteps number: 72
> 15/09/03 13:02:32 INFO bsp.BSPJobClient: Current supersteps number: 77
> 15/09/03 13:02:35 INFO bsp.BSPJobClient: Current supersteps number: 84
> 15/09/03 13:02:38 INFO bsp.BSPJobClient: Current supersteps number: 91
> 15/09/03 13:02:41 INFO bsp.BSPJobClient: Current supersteps number: 97
> 15/09/03 13:02:44 INFO bsp.BSPJobClient: Current supersteps number: 106
> 15/09/03 13:02:47 INFO bsp.BSPJobClient: Current supersteps number: 113
> 15/09/03 13:02:50 INFO bsp.BSPJobClient: Current supersteps number: 125
> 15/09/03 13:02:53 INFO bsp.BSPJobClient: Current supersteps number: 134
> 15/09/03 13:02:56 INFO bsp.BSPJobClient: Current supersteps number: 144
> 15/09/03 13:02:59 INFO bsp.BSPJobClient: Current supersteps number: 152
> 15/09/03 13:03:02 INFO bsp.BSPJobClient: Current supersteps number: 156
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: The total number of supersteps: 156
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: Counters: 6
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:
> org.apache.hama.bsp.JobInProgress$JobCounter
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: SUPERSTEPS=156
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: LAUNCHED_TASKS=160
> 15/09/03 13:03:05 INFO bsp.BSPJobClient:
> org.apache.hama.bsp.BSPPeerImpl$PeerCounter
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: SUPERSTEP_SUM=24960
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=1943366
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: TOTAL_MESSAGES_SENT=1600000
> 15/09/03 13:03:05 INFO bsp.BSPJobClient: TOTAL_MESSAGES_RECEIVED=1600000
> Job Finished in 187.453 seconds
> {code}
> I ran with set the max iteration to 100. At 56 superstep, I killed one task
> manually and I checked that failed task has automatically recovered. By the
> way, the total num of supersteps was 156, not 100.
> The reason is simple, i always starts from 0. To fix this issue, we have to
> set the i to (int) peer.getSuperstepCount().
> {code}
> public void bsp(
> BSPPeer<NullWritable, NullWritable, NullWritable, NullWritable,
> BytesWritable> peer)
> throws IOException, SyncException, InterruptedException {
> byte[] dummyData = new byte[sizeOfMsg];
> String[] peers = peer.getAllPeerNames();
> for (int i = 0; i < nSupersteps; i++) {
> {code}
> GraphJobRunner also have similar problem. When the task is relaunched,
> setup() method will be called. Below should be called only when initial phase.
> {code}
> long startTime = System.currentTimeMillis();
> loadVertices(peer);
> LOG.info("Total time spent for loading vertices: "
> + (System.currentTimeMillis() - startTime) + " ms");
> startTime = System.currentTimeMillis();
> countGlobalVertexCount(peer);
> LOG.info("Total time spent for broadcasting global vertex count: "
> + (System.currentTimeMillis() - startTime) + " ms");
> startTime = System.currentTimeMillis();
> doInitialSuperstep(peer);
> LOG.info("Total time spent for initial superstep: "
> + (System.currentTimeMillis() - startTime) + " ms");
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)