Is there any "Ignore because znode may be deleted." sentence just above the NoNodeException? This exception is thrown as warning which should not stop the computation.
Also, I test with pseudo-distributed mode as below for((i=0;i<20;i++)) ; do hama jar hama-examples-0.4.0-incubating-SNAPSHOT.jar pi; done It works ok. http://pastebin.com/CxGSfzHN And the log has exception which doesn't cause computation to hang http://pastebin.com/5HVwx6A1 attempt_201109221848_0020_000000_0 11/09/22 18:57:37 WARN bsp.BSPPeer: Ignore because znode may be deleted. 2011-09-22 18:57:37,331 INFO org.apache.hama.bsp.TaskRunner: attempt_201109221848_0020_000000_0 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /bsp/job_201109221848_0020/0/ready Can we have the full log post? And how it is executed, env, etc. Maybe the problem stems from somewhere else. -----Original message----- From:Thomas Jungblut <[email protected]> To:[email protected],[email protected] Date:Thu, 22 Sep 2011 10:43:13 +0200 Subject:Re: Awesome bench results after removing Thread.sleep in sync() method. I think when just changing the log level, log4j will take care of the if(isEnabled) stuff, so we don't need to fragment our code. Yes the current rev in trunk contains this snippet. I give you the rest of the exception: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > NoNode for /bsp/job_201109220959_0001/224/ready > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) > at org.apache.hama.bsp.BSPPeer$1.process(BSPPeer.java:396) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488) > Here is the part of the log of our zookeeper deamon: > 2011-09-22 09:59:59,435 INFO > org.apache.zookeeper.server.PrepRequestProcessor: Got user-level > KeeperException when processing sessionid:0x1329025208e0003 type:delete > cxid:0xc01 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error > Path:/bsp/job_201109220959_0001/222/ready Error:KeeperErrorCode = NoNode for > /bsp/job_201109220959_0001/222/ready > 2011-09-22 09:59:59,499 INFO > org.apache.zookeeper.server.PrepRequestProcessor: Got user-level > KeeperException when processing sessionid:0x1329025208e0003 type:create > cxid:0xc0e zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error > Path:/bsp/job_201109220959_0001/223/ready Error:KeeperErrorCode = NodeExists > for /bsp/job_201109220959_0001/223/ready > 2011-09-22 09:59:59,627 INFO > org.apache.zookeeper.server.PrepRequestProcessor: Got user-level > KeeperException when processing sessionid:0x1329025208e0004 type:delete > cxid:0xc22 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error > Path:/bsp/job_201109220959_0001/224/ready Error:KeeperErrorCode = NoNode for > /bsp/job_201109220959_0001/224/ready > 2011/9/22 ChiaHung Lin <[email protected]> > We might need to change log method by adding > > if(LOG.isInfoEnabled()){ > ... > } > > at least it can prevent string concatenation for performance optimization. > (debug can be changed to if(LOG.isDebugEnabled()){} for performance > optimization, too.) > > In addition, can you help check if enterBarrier() contains the following > code snippet? > > ... > zk.exists(pathToSuperstepZnode+"/ready", new Watcher() { > @Override > public void process(WatchedEvent event) { > // check if /ready znode exists, then delete it. > ... > } catch(KeeperException.NoNodeException nne) { > LOG.warn("Ignore because znode may be deleted.", nne); > }... > } > }); > zk.create(getNodeName(), null, Ids.OPEN_ACL_UNSAFE, > CreateMode.EPHEMERAL); > ... > > It looks like bsp peer is trying to remove /ready znode which may have > already been removed by other bsp peer. Or stack trace in log would be > helpful. > > > -----Original message----- > From:Thomas Jungblut <[email protected]> > To:[email protected] > Date:Thu, 22 Sep 2011 10:05:52 +0200 > Subject:Re: Awesome bench results after removing Thread.sleep in sync() > method. > > You're going to laugh, but we spend 80% of the time, logging the messages. > Let's change the log level to debug or remove the logging in the bench > example. > > Sadly I still receive > > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > > NoNode for /bsp/job_201109220959_0001/224/ready > > > > and it hangs forever. Current version is after you committed ChiaHung's > patch. > I'm in pseudo-distributed mode with 3 tasks. > > Are you going to bench this without the logging? That would be interesting > though ;D > > 2011/9/22 Thomas Jungblut <[email protected]> > > > That is great. I think we can push this under 200s. > > I attach a profiler and send you a list of hotspots. > > > > lg. > > > > 2011/9/22 Edward J. Yoon <[email protected]> > > > > By ChiaHung's HAMA-387.patch, hang problem is fixed. > >> > >> And also, on same environment (1 rack, 256 cores), a bench example > >> result is dramatically improved. (184.076 seconds from 307.129 > >> seconds) > >> > >> ---- > >> # core/bin/hama jar > >> examples/target/hama-examples-0.4.0-incubating-SNAPSHOT.jar bench 16 > >> 1000 512 > >> .. > >> 11/09/22 10:27:32 INFO bsp.BSPJobClient: Current supersteps number: 504 > >> 11/09/22 10:27:35 INFO bsp.BSPJobClient: Current supersteps number: 508 > >> 11/09/22 10:27:38 INFO bsp.BSPJobClient: Current supersteps number: 512 > >> 11/09/22 10:27:38 INFO bsp.BSPJobClient: The total number of supersteps: > >> 512 > >> Job Finished in 184.076 seconds > >> > >> Hama 0.4 (r.1163903) was: > >> > >> 16 bytes | 1000 | 512 | 307.129 seconds > >> > >> -- > >> Best Regards, Edward J. Yoon > >> @eddieyoon > >> > > > > > > > > -- > > Thomas Jungblut > > Berlin > > > > mobile: 0170-3081070 > > > > business: [email protected] > > private: [email protected] > > > > > > -- > Thomas Jungblut > Berlin > > mobile: 0170-3081070 > > business: [email protected] > private: [email protected] > > > -- > ChiaHung Lin > Department of Information Management > National University of Kaohsiung > Taiwan > -- Thomas Jungblut Berlin mobile: 0170-3081070 business: [email protected] private: [email protected] -- ChiaHung Lin Department of Information Management National University of Kaohsiung Taiwan
