Hi Eric, Can you take a look at GIRAPH-747 I think it relates. We had some issues in the original pre-Hadoop-2.2 GA YARN profile that I had to shim. I think something changed in the switchover to 2.2 or on the non-YARN side that made this an issue again. We produce an ApplicationMaster _and_ a master node while non-YARN only needs an extra master node, and as I recall this is the underlying issue. The current proposed fix (as I read it) breaks non-YARN Giraph.
Mohammed, any input on this? If any non-YARN committers could take a peek at this email thread and Eric's error and at GIRAPH-747 and confirm my suspicion that this solution won't work as-is, that would be great. Thanks all, Eli On Thu, Jan 30, 2014 at 9:49 AM, Eric Kimbrel <[email protected]> wrote: > Hello, I am currently not a contributor to this project but have noticed > an issue i wanted to report here instead of on the users mailing list. > > using 1.1.0-SNAPSHOT built for PURE YARN and cdh5.0.0 > > I have an intermittent problem that, when it occurs, causes the job to > stall after completion (but prior to vertices writing their output). > Looking into the logs (posted below) I see that i go from 7 of 8 workers > reporting completion to 9 of 8. The code in BspServiceMaster:1740 users > cleanedUpChildrenList.size() == maxTasks inside of a while true loop, so > the job gets stuck here forever and will never progress again. > > I plan on changing this locally to a >= for my own use to prevent this > problem, but i don't know how 9 of 8 is being reported and how this problem > is really happening. > > Thanks for any ideas, > Eric > > > 14/01/30 09:35:37 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 1 of > 8 desired children from > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir > 14/01/30 09:35:37 INFO master.BspServiceMaster: cleanedUpZooKeeper: > Waiting for the children of > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to > change since only got 1 nodes. > 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged > signaled > 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 2 of > 8 desired children from > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir > 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: > Waiting for the children of > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to > change since only got 2 nodes. > 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged > signaled > 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 5 of > 8 desired children from > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir > 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: > Waiting for the children of > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to > change since only got 5 nodes. > 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged > signaled > 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 6 of > 8 desired children from > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir > 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: > Waiting for the children of > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to > change since only got 6 nodes. > 14/01/30 09:35:38 INFO bsp.BspService: process: cleanedUpChildrenChanged > signaled > 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanUpZooKeeper: Got 9 of > 8 desired children from > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir > 14/01/30 09:35:38 INFO master.BspServiceMaster: cleanedUpZooKeeper: > Waiting for the children of > /_hadoopBsp/giraph_yarn_application_1390861968364_0050/_cleanedUpDir to > change since only got 9 nodes.
