[
https://issues.apache.org/jira/browse/HAMA-387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107570#comment-13107570
]
Edward J. Yoon commented on HAMA-387:
-------------------------------------
Job hangs again in the patch test.
{code}
root@Cnode1:/usr/local/src/hama-trunk# core/bin/hama jar
examples/target/hama-exampleSNAPSHOT.jar bench 160 10000 64
11/09/19 09:34:31 DEBUG bsp.BSPJobClient: BSPJobClient.submitJobDir:
hdfs://hnode15:9/bsp/system/submit_z5c7vt
11/09/19 09:34:31 INFO bsp.BSPJobClient: Running job: job_201109190912_0005
11/09/19 09:34:34 INFO bsp.BSPJobClient: Current supersteps number: 0
11/09/19 09:34:40 INFO bsp.BSPJobClient: Current supersteps number: 1
11/09/19 09:34:43 INFO bsp.BSPJobClient: Current supersteps number: 3
11/09/19 09:34:46 INFO bsp.BSPJobClient: Current supersteps number: 5
11/09/19 09:34:49 INFO bsp.BSPJobClient: Current supersteps number: 6
11/09/19 09:34:52 INFO bsp.BSPJobClient: Current supersteps number: 8
11/09/19 09:34:55 INFO bsp.BSPJobClient: Current supersteps number: 10
11/09/19 09:34:58 INFO bsp.BSPJobClient: Current supersteps number: 12
11/09/19 09:35:01 INFO bsp.BSPJobClient: Current supersteps number: 13
11/09/19 09:35:04 INFO bsp.BSPJobClient: Current supersteps number: 14
----
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx
enterBarrier() list.size():45 children in the
list:[attempt_201109190912_0005_000020_0, attempt_201109190912_0005_000005_0,
attempt_201109190912_0005_000030_0, attempt_201109190912_0005_000021_0,
attempt_201109190912_0005_000023_0, attempt_201109190912_0005_000004_0,
attempt_201109190912_0005_000010_0, attempt_201109190912_0005_000014_0,
attempt_201109190912_0005_000015_0, attempt_201109190912_0005_000039_0,
attempt_201109190912_0005_000006_0, attempt_201109190912_0005_000007_0,
attempt_201109190912_0005_000019_0, attempt_201109190912_0005_000044_0,
attempt_201109190912_0005_000024_0, attempt_201109190912_0005_000013_0,
attempt_201109190912_0005_000025_0, attempt_201109190912_0005_000016_0,
attempt_201109190912_0005_000034_0, attempt_201109190912_0005_000042_0,
attempt_201109190912_0005_000026_0, attempt_201109190912_0005_000035_0,
attempt_201109190912_0005_000008_0, attempt_201109190912_0005_000018_0,
attempt_201109190912_0005_000033_0, attempt_201109190912_0005_000009_0,
attempt_201109190912_0005_000002_0, attempt_201109190912_0005_000041_0,
attempt_201109190912_0005_000036_0, attempt_201109190912_0005_000012_0,
attempt_201109190912_0005_000003_0, attempt_201109190912_0005_000011_0,
attempt_201109190912_0005_000038_0, attempt_201109190912_0005_000029_0,
attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000040_0,
attempt_201109190912_0005_000017_0, attempt_201109190912_0005_000043_0,
attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0,
attempt_201109190912_0005_000001_0, attempt_201109190912_0005_000031_0,
attempt_201109190912_0005_000037_0, attempt_201109190912_0005_000022_0,
attempt_201109190912_0005_000032_0]
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: =====>
jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000005_0 after
enterBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000003_0 11/09/19 09:35:07 INFO bsp.BSPPeer: =====>
jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000003_0 after
enterBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: =====>
jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000005_0 before
leaveBarrier()
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx
leaveBarrier() list.size:11 children in the
list[attempt_201109190912_0005_000007_0, attempt_201109190912_0005_000044_0,
attempt_201109190912_0005_000018_0, attempt_201109190912_0005_000009_0,
attempt_201109190912_0005_000041_0, attempt_201109190912_0005_000003_0,
attempt_201109190912_0005_000011_0, attempt_201109190912_0005_000028_0,
attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0,
attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,480 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000001_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx
enterBarrier() list.size():11 children in the
list:[attempt_201109190912_0005_000007_0, attempt_201109190912_0005_000044_0,
attempt_201109190912_0005_000018_0, attempt_201109190912_0005_000009_0,
attempt_201109190912_0005_000041_0, attempt_201109190912_0005_000003_0,
attempt_201109190912_0005_000011_0, attempt_201109190912_0005_000028_0,
attempt_201109190912_0005_000027_0, attempt_201109190912_0005_000000_0,
attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,617 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000003_0 11/09/19 09:35:07 INFO bsp.BSPPeer: =====>
jobid:job_201109190912_0005 taskid:attempt_201109190912_0005_000003_0 before
leaveBarrier()
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000003_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx
leaveBarrier() list.size:3 children in the
list[attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0,
attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000001_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx
enterBarrier() list.size():3 children in the
list:[attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0,
attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,661 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx
leaveBarrier() list.size:3 children in the
list[attempt_201109190912_0005_000028_0, attempt_201109190912_0005_000027_0,
attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000003_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx
leaveBarrier() list.size:1 children in the
list[attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000001_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxx
enterBarrier() list.size():1 children in the
list:[attempt_201109190912_0005_000001_0]
2011-09-19 09:35:07,836 INFO org.apache.hama.bsp.TaskRunner:
attempt_201109190912_0005_000005_0 11/09/19 09:35:07 INFO bsp.BSPPeer: xxxxx
leaveBarrier() list.size:1 children in the
list[attempt_201109190912_0005_000001_0]
{code}
> Advanced Barrier Synchronization
> --------------------------------
>
> Key: HAMA-387
> URL: https://issues.apache.org/jira/browse/HAMA-387
> Project: Hama
> Issue Type: Improvement
> Components: bsp
> Affects Versions: 0.3.0
> Reporter: Edward J. Yoon
> Assignee: Edward J. Yoon
> Fix For: 0.4.0
>
> Attachments: HAMA-387_v02.patch, HAMA-387_v03.patch,
> HAMA-387_v04.patch, new.patch, sleepless.patch, x.PNG, x.patch
>
>
> I think, the lock file must include:
> * the job ID
> * the task ID of the lock file owner
> * the current superstep count
> to check ownership and validation.
> Currently they are named by hostname, but multi-tasks can be run per one
> groomserver in the future.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira