[ 
https://issues.apache.org/jira/browse/GIRAPH-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843640#comment-13843640
 ] 

Alexandre Fonseca commented on GIRAPH-806:
------------------------------------------

Yes GIRAPH-811 covers a different problem entirely. This problem appears to be 
related with Netty. In particular the application seems to get stuck waiting 
for an opened channel to close in line 360 of NettyServer.java:

{code}ProgressableUtils.awaitChannelGroupFuture(accepted.close(), 
progressable);{code}

Unfortunately, I'm not familiar with Netty nor can I replicate the problem 
under a pure Yarn compilation. Were there no indications of other errors/client 
crashes that might have kept a channel indefinitely open?

Another thing that seems strange to me is, in ProgressableUtils.java's 
waitFor(...) function, the following flow:

{code}
    int currentWaitMsecs;
    while (true) {
      progressable.progress();
      currentWaitMsecs = Math.min(msecs, msecsPeriod);
      try {
        waitable.waitFor(currentWaitMsecs);
        if (waitable.isFinished()) {
          return waitable.getResult();
        }
      } 
      // (...)
      msecs = Math.max(0, msecs - currentWaitMsecs);
    }
{code}

Given that the function waitForever in line 136 of the same file is the one 
that calls waitFor() and supplies msecs = msecsPeriod = DEFUALT_MSEC_PERIOD = 
60 * 1000, then we'll have the following:

* First iteration
** currentWaitMsecs = msecs = 60 * 1000
** Wait for 60 seconds 
** msecs = Math.max(0, msecs - currentWaitMsecs) = Math.max(0, msecs - msecs) = 0
* Second (and following) iterations
** currentWaitMsecs = msecs = 0
** Wait for 0 seconds
** msecs = 0

So after 1 minute, if the Waitable we're waiting for still hasn't finished, we 
start effectively hammering it thousands of times per second? Not only is this 
a waste of CPU cycles, it might incur on some starvation issues which might 
keep the Waitable from finishing somehow?

> Application goes into infinite loop after success, waits for: Waiting for 
> org.apache.giraph.utils.ProgressableUtils
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-806
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-806
>             Project: Giraph
>          Issue Type: Bug
>          Components: examples, mapreduce
>    Affects Versions: 1.1.0
>            Reporter: Rafal Wojdyla
>
> Run example:
> {noformat}
> hadoop jar 
> giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.2.0.2.0.6.0-76-jar-with-dependencies.jar
>  org.apache.giraph.GiraphRunner -D giraph.zkList="zkNode.net:2181" -D 
> yarn.app.mapreduce.am.log.level=DEBUG -D mapreduce.map.log.level=DEBUG -D 
> mapreduce.reduce.log.level=DEBUG 
> org.apache.giraph.examples.SimplePageRankComputation -w 1 -yh 1024 -yj 
> ./giraph-examples-1.1.0-SNAPSHOT-for-hadoop-2.2.0.2.0.6.0-76-jar-with-dependencies.jar
>  -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat 
> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -mc 
> org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
>  -vip /user/rav/giraph/input/giraph.in -op /user/rav/giraph/output
> {noformat}
> After application successfully computer result it goes to infinite loop of:
> {noformat}
> 2013-11-28 20:09:51,764 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.master.BspServiceMaster: setJobState: 
> {"_stateKey":"FINISHED","_applicationAttemptKey":-1,"_superstepKey":-1} on 
> superstep 31
> 2013-11-28 20:09:51,821 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.master.BspServiceMaster: cleanup: Notifying master its okay 
> to cleanup with /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir/0_master
> 2013-11-28 20:09:51,829 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.master.BspServiceMaster: cleanUpZooKeeper: Node 
> /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir already exists, no need to 
> create.
> 2013-11-28 20:09:51,831 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.master.BspServiceMaster: cleanUpZooKeeper: Got 1 of 2 
> desired children from /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir
> 2013-11-28 20:09:51,831 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.master.BspServiceMaster: cleanedUpZooKeeper: Waiting for 
> the children of /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir to change 
> since only got 1 nodes.
> 2013-11-28 20:09:51,909 INFO [main-EventThread] 
> org.apache.giraph.bsp.BspService: process: cleanedUpChildrenChanged signaled
> 2013-11-28 20:09:51,910 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.master.BspServiceMaster: cleanUpZooKeeper: Got 2 of 2 
> desired children from /_hadoopBsp/job_1385571160481_11413/_cleanedUpDir
> 2013-11-28 20:09:51,911 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.master.BspServiceMaster: cleanupZooKeeper: Removing the 
> following path and all children - /_hadoopBsp/job_1385571160481_11413 from 
> ZooKeeper list zkNode.net:2181
> 2013-11-28 20:09:52,329 INFO [main-EventThread] 
> org.apache.giraph.bsp.BspService: process: masterElectionChildrenChanged 
> signaled
> 2013-11-28 20:09:52,379 INFO [main-EventThread] 
> org.apache.giraph.bsp.BspService: process: cleanedUpChildrenChanged signaled
> 2013-11-28 20:09:52,447 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.master.BspServiceMaster: cleanup: Removed HDFS checkpoint 
> directory (_bsp/_checkpoints//job_1385571160481_11413) with return = false 
> since the job Giraph: org.apache.giraph.examples.SimplePageRankComputation 
> succeeded 
> 2013-11-28 20:09:52,449 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.comm.netty.NettyClient: stop: reached wait threshold, 1 
> connections closed, releasing NettyClient.bootstrap resources now.
> 2013-11-28 20:09:52,454 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.comm.netty.NettyServer: stop: Halting netty server
> 2013-11-28 20:10:52,463 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:11:52,463 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:12:52,463 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:13:52,464 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:14:52,464 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:15:52,464 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:16:52,465 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:17:52,465 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:18:52,465 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:19:52,466 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:20:52,466 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:21:52,466 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> 2013-11-28 20:22:52,467 INFO [org.apache.giraph.master.MasterThread] 
> org.apache.giraph.utils.ProgressableUtils: waitFor: Waiting for 
> org.apache.giraph.utils.ProgressableUtils$ExecutorServiceWaitable@2f368c5d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to