Hi

1. We don't know to be frank. The VMs aren't live migrated, the SAN hasn't been 
proved to be down or even the bottleneck, at least for standard hardware. Apple 
hardware is another issue, and here SAN seems to either slow down or possibly 
even break up sometimes. Some Macs slow down to the point that reading or 
writing to the hard drive goes through the SAN that slows it down to 10 MB/s. I 
did encounter curl not being able to download a tar ball, but that was on a 
buggy version of 10.12. The problem behind curl got fixed in 10.12.2. However, 
I can't confirm or deny if this has happened after that.

We also don't over allocate resources. If we have a 20 core server, we assign 
only 5 x 4vcpu VMs to it, even if hyperthreading shows the host to have 40 
cores. However, I just realized a week ago, that provisioning doesn't follow 
this "rule". If we have maxed out a server, and we launch provisioning for a 
template that is assigned to that host, we launch the VM to be provisioned 
there as well. And if we launch 2 or 3 branches simultaneously, there's a 
possibility that we allocate 6-8 VMs on that 20 core server. It _should_ cope 
with it easily, but in theory things start to slow down. That's because the 
VMWare hypervisor allocates all the CPU cycles the guest might need, even if it 
only uses 1 core in reality. That's because the hypervisor can't know if the 
guest fully needs all 4 or not. But even if we doubled the amount of VMs on a 
host, and all of them ran 100%, that would mean that we reduce the amount of 
CPU cycles in half. Even then an action taking a split second won't 
 take 2 seconds, not to mention 2 minutes.

2 years ago we debugged one of these oddities where network tests failed for no 
good reason. It looked like vSphere's underlying virtual LAN caused problems 
dropping packages. However, when debugging, we found out that it was in fact 
Qt's network code that was buggy. It had something to do with 2 IP packages 
coming one after another so fast that they both existed in our network code's 
buffer. But the bug was that only the first one was ever handled. The second 
one was discarded. At the time no one knew how to fix it, so it was left there. 
I still don't know if it has been fixed. Someone said that the entire network 
code should be rewritten, because the current one was a mess.

We've also seen GUI tests failing where sleeps seem to help. Here something 
like waitUntilExposed (or whatever), didn't work as expected in Macs. And 
obviously this caused problems once or twice in thousands of runs looking like 
something in the hardware is causing these flaky runs. It is still possible 
that only certain servers cause these small timing differences. Even likely, 
since we have lots of servers with several generations between them and 
repeating these failures on one VM on one host seems to be very difficult when 
actually starting to debug the tests. But we can't go replacing all the 
hardware we have in one go and throw away perfectly fine hardware just because 
it's not identical to the next one.  In an ideal world we'd have controlled 
runs on older and newer generations etc, but in reality I think we have to 
imagine the hardware being a constant non changing factor beneath the 
virtualization layer. Broken hardware is another story, and for that we need to 
gathe
 r metrics to see what failed and where. If it's always the same hardware and 
not even the same generation of hardware, then it's most likely a broken unit.

With all that said, I didn't really provide you with any answers did I? :P

Regards,
-Tony
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Marc Mutz
Sent: torstaina 16. maaliskuuta 2017 11.01
To: [email protected]
Cc: Qt CI <[email protected]>
Subject: Need advise on acceptable timeouts for autotests

Hi,

We repeatedly have the problem that timeouts that developers think are ample 
(because they exceed typical runtime by, say, two orders of magnitude) are 
found to be insufficient on the CI.

Latest example: 
http://testresults.qt.io/coin/integration/qt/qtbase/tasks/1489618366

The timeout to run update-mime-database was recently increased to 2mins. But 
that still does not seem to be enough. For a call that hardly takes a second to 
run on unloaded machines.

We can of course crank up timeouts to insane amounts like 1h, but that means 
that a test will just sit there idling for an hour in the worst case.

I have two questions:

1. Where do these huge slowdowns come from? Is the VM live-migrated? Is the
   SAN, if any, down? At this point it lools like no overcommitting of CPU/RAM
   could ever explain how update-mime-database can take 2mins to run.

2. What should we choose as timeouts? I understand that tests which are stuck
   are killed after some time (how long?). Maybe timeouts should be set to the
   same value?

Thanks,
Marc

--
Marc Mutz <[email protected]> | Senior Software Engineer KDAB (Deutschland) 
GmbH & Co.KG, a KDAB Group Company
Tel: +49-30-521325470
KDAB - The Qt, C++ and OpenGL Experts
_______________________________________________
Development mailing list
[email protected]
http://lists.qt-project.org/mailman/listinfo/development

Reply via email to