|
The Jenkins Master reports that all its JNLP (and all our nodes are such) are offline.
On the nodes, they report that they are connected. The only way out is to restart the Master. Of the 6 (or so) times this has occurred, 1/2 the time all the slaves need to have their slave process restarted to recover.
We also see cases where after a restarting Jenkins, it recovers for a short time. Then the problem re-occurs. However, if it's running 10-minutes after a restart, we seem to be fine for 3-4 days.
We were running on version 565 when this first occurred. We ran fine for 3-months. What changed for us is that we increased the number of nodes. We now have 93 nodes, up from about 50. There was also an increase in the number of jobs.
We use the vSphere Cloud Plugin. However, we changed one slave to use ssh instead of jnlp. The problem was resolved for this slave, and it is not disconnected when the problem occurs. We did not find the same for a vsphere/jnlp slave where we removed the vsphere configuration. (Well, recreated the slave without vsphere).
This seems to be similar to:
https://issues.jenkins-ci.org/browse/JENKINS-24155
https://issues.jenkins-ci.org/browse/JENKINS-24050
https://issues.jenkins-ci.org/browse/JENKINS-22714
https://issues.jenkins-ci.org/browse/JENKINS-22932
https://issues.jenkins-ci.org/browse/JENKINS-23384
We have examined the VM logs, the network logs and the firewalls. There is no obvious issue.
I'm attaching the err.log of one of the incidents. Though it is clear that there is a problem with the slave connections, there is no clear 'cause'.
|