Hi all,

Issue #1: Jobs that need more than one hour

Of the last 30 Trove-Gate<https://rdjenkins.dyndns.org/job/Trove-Gate/> builds 
(spanning three days), 7 have failed due to a Jenkins job-level timeout (not a 
proboscis timeout). These jobs had no failed tests when the timeout occurred.

Not having access to the job config to see what the job looks like, I used the 
console output to guess what was going on. It appears that a Jenkins plugin 
named 
boot-hpcloud-vm<https://github.com/mrhoades/boot-hpcloud-vm/blob/2272770b0ce54752eabb84229dc8939d79b2be50/models/boot_vm_concurrent.rb#L181>
 is booting a VM and running the commands given, including redstack int-tests. 
From the console output, it states that it was supplied with an 
ssh_shell_timeout="7200". This is passed down to another library called 
net-ssh-simple<https://github.com/busyloop/net-ssh-simple/blob/e3834f259a47606bfb06a487ca701fc20dbad8a5/lib/net/ssh/simple.rb#L632>.
 net-ssh-simple has two timeouts: an idle timeout and an operation timeout.

In the latest 
boot-hpcloud-vm<https://github.com/mrhoades/boot-hpcloud-vm/blob/2272770b0ce54752eabb84229dc8939d79b2be50/models/boot_vm_concurrent.rb#L182>,
 ssh_shell_timeout is passed down to net-ssh-simple for both the idle timeout 
and the operation timeout. But in older versions of 
boot-hp-cloud-vm<https://github.com/mrhoades/boot-hpcloud-vm/blob/9260e957d6c54142c33dd9e9632b86e17fd5c02f/models/boot_vm_concurrent.rb#L141>,
 ssh_shell_timeout is passed down to net-ssh-simple for only the idle timeout, 
leaving a default operation timeout of 3600. This is why I believe these jobs 
are failing after exactly one hour.

FYI: Here are the jobs that failed due to the Jenkins job-level timeout (and 
had no test failures when the timeout occurred) along with their associated 
patch sets:
https://rdjenkins.dyndns.org/job/Trove-Gate/2532/console 
(http://review.openstack.org/73786)
https://rdjenkins.dyndns.org/job/Trove-Gate/2530/console 
(http://review.openstack.org/73736)
https://rdjenkins.dyndns.org/job/Trove-Gate/2517/console 
(http://review.openstack.org/63789)
https://rdjenkins.dyndns.org/job/Trove-Gate/2514/console 
(https://review.openstack.org/50944)
https://rdjenkins.dyndns.org/job/Trove-Gate/2513/console 
(https://review.openstack.org/50944)
https://rdjenkins.dyndns.org/job/Trove-Gate/2504/console 
(https://review.openstack.org/73147)
https://rdjenkins.dyndns.org/job/Trove-Gate/2503/console 
(https://review.openstack.org/73147)

Suggested action items:

  *   If it is acceptable to have jobs that run over one hour, then install the 
latest boot-hpcloud-vm plugin for Jenkins which will increase the make the 
operation timeout match the idle timeout.

Issue #2: The running time of all jobs is 1 hr 1 min

While the Jenkins job-level timeout will end the job after one hour, it also 
appears to keep every job running for a minimum of one hour.  To be more 
precise, the timeout (or minimum running time) occurs on the part of the 
Jenkins job that runs commands on the VM; the VM provision (which takes about 
one minute) is excluded from this timeout which is why the running time of all 
jobs is around 1 hr 1 
min<https://rdjenkins.dyndns.org/job/Trove-Gate/buildTimeTrend>. A sampling of 
console logs showing the time the int-tests completed and when the timeout 
kicks in:

https://rdjenkins.dyndns.org/job/Trove-Gate/2531/console (00:01:03 wasted)

04:51:12 COMMAND_0: echo refs/changes/36/73736/2

...

05:50:10     335.41     proboscis.case.MethodTest (test_instance_created)
05:50:10     194.05     proboscis.case.MethodTest 
(test_instance_returns_to_active_after_resize)
05:51:13 **************************************
05:51:13 ****** STDERR-BEGIN ******

https://rdjenkins.dyndns.org/job/Trove-Gate/2521/console (00:06:44 wasted)

21:11:44 COMMAND_0: echo refs/changes/89/63789/13

...

22:05:00     195.11     proboscis.case.MethodTest 
(test_instance_returns_to_active_after_resize)
22:05:00     186.89     proboscis.case.MethodTest (test_resize_down)
22:11:44 **************************************
22:11:44 ****** STDERR-BEGIN ******


https://rdjenkins.dyndns.org/job/Trove-Gate/2518/consoleFull (00:06:01 wasted)

17:46:59 COMMAND_0: echo refs/changes/02/64302/20

...

18:40:57     210.03     proboscis.case.MethodTest 
(test_instance_returns_to_active_after_resize)
18:40:57     187.89     proboscis.case.MethodTest (test_resize_down)
18:46:58 **************************************
18:46:58 ****** STDERR-BEGIN ******


Suggested action items:

  *

Given that the minimum running time is one hour, I assume the problem is in the 
net-ssh-simple library. Needs more investigation.


Issue #3: Jenkins console log line timestamps different between full and 
truncated views

I assume this is due to 
JENKINS-17779<https://issues.jenkins-ci.org/browse/JENKINS-17779>.

Suggested action items:

  *   Upgrade the timestamper 
plugin<https://wiki.jenkins-ci.org/display/JENKINS/Timestamper>.
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to