We get a lot of these ChannelClosedExceptions as well, and have also problems reproducing it. So far "it just happens". What we have seen is that it is happening a lot less (# of incidents per build) on masters with <240 slaves connected at once, but there is also less building going on in general on those masters, so the number of slaves theory is not fully supported by any evidence yet. We are starting by splitting up the cluster into one more master and are also working towards automatically disconnecting idle slaves and connect them when needed (that setting is very easy to do in Jenkins already but we have some auto-maintenance scripts that needs to adapt to that kind of setup first) And we'll see if that helps.
Robert Sandell Software Tools Engineer - SW Environment and Product Configuration Sony Mobile Communications > -----Original Message----- > From: [email protected] [mailto:jenkinsci- > [email protected]] On Behalf Of Kohsuke Kawaguchi > Sent: den 18 april 2013 01:49 > To: [email protected] > Cc: hajush > Subject: Re: Any ideas how to fix JENKINS-12235 > > > "hudson.remoting.ChannelClosedException: channel is already closed" > indicates an unexpected loss of connection to the slave. The nested > "Caused by: java.io.EOFException" indicates that the slave side has > shut down the communication with the slave. > > The thing is, the communication to the slave (InputStream that Channel > reads) is tunneled over several layers, and the way this part of the > code discovers the problem is by InputStream.read() returning -1. > > This design of InputStream does not allow us to report the underlying > cause of the communication problem through a chained exception, so we > really can't properly report the root cause. > > The slave console log does normally capture the last dying message from > the slave JVM or a transport level errors, but this gets rotated > quickly as soon as the next connection attempt starts, and while on > $JENKINS_HOME this file is still available, there's no way to look at > this from the web UI. Jenkins does pretty aggressively auto-reconnect > slaves that fail, and it takes some time for someone to notice a build > failure by ChannelClosedException and try to understand what's going on, > so that makes the trouble-shooting even more tricky. > > I was just sweeping the ssh-slaves plugin ticket backlog, and there are > many reports of this same issue, so this clearly is a gap in the > diagnosability of the slave connectivity. > > If anyone has a good idea of how to capture the errors, that'd be > greatly appreciated. > > > One approach that I think about is to introduce a proper log rotation > mechanism (that handles LargeText.doProgressText() correctly), and > somehow use that to let people scroll back the slave console log. > > Perhaps another possibility is to let the ComputerLauncher record a > connection loss as an Exception on a failing Channel. > > > > On 04/17/2013 02:41 PM, hajush wrote: > > The intermittent failure of slave jobs due to issue 12235 > > <https://issues.jenkins-ci.org/browse/JENKINS-12235> looks like it > might > > start undoing progress in getting my work teams to adopt Jenkins. > > > > Has anyone given any thought to the issue and how to address it? Some > folks > > had luck by increasing the ClientInterval on unix masters - but > others did > > not. > > > > I see that late last month Kohsuke increased the pipe window size in > > hudson.remoting.Channel - though I'm not sure that would address this > - and > > since it's intermittent - it's hard to test. Here's what our stack > trace > > failure looks like. > > > > FATAL: Unable to delete script file > c:\temp\hudson985794291407431615.bat > > hudson.util.IOException2: remote file operation failed: > > c:\temp\hudson985794291407431615.bat at > > hudson.remoting.Channel@e553b0:vcvmwin061 > > at hudson.FilePath.act(FilePath.java:848) > > at hudson.FilePath.act(FilePath.java:825) > > at hudson.FilePath.delete(FilePath.java:1202) > > at > hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:101) > > at > hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:60) > > at > hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19) > > at > > > hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild > .java:810) > > at hudson.model.Build$BuildExecution.build(Build.java:199) > > at hudson.model.Build$BuildExecution.doRun(Build.java:160) > > at > > > hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.jav > a:592) > > at hudson.model.Run.execute(Run.java:1543) > > at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) > > at > hudson.model.ResourceController.execute(ResourceController.java:88) > > at hudson.model.Executor.run(Executor.java:236) > > Caused by: hudson.remoting.ChannelClosedException: channel is already > closed > > at hudson.remoting.Channel.send(Channel.java:494) > > at hudson.remoting.Request.call(Request.java:129) > > at hudson.remoting.Channel.call(Channel.java:672) > > at hudson.FilePath.act(FilePath.java:841) > > > > > > > > > > -- > > View this message in context: > http://jenkins.361315.n4.nabble.com/Any-ideas-how-to-fix-JENKINS-12235- > tp4663279.html > > Sent from the Jenkins dev mailing list archive at Nabble.com. > > > > > -- > Kohsuke Kawaguchi | CloudBees, Inc. | http://cloudbees.com/ > Try Nectar, our professional version of Jenkins > > -- > You received this message because you are subscribed to the Google > Groups "Jenkins Developers" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "Jenkins Developers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
