Re: Raise test timeouts?
Hi All, if I am parsing this thread correctly it seems we have a number of options to attack and some are already progressing: tmp misconfig, docker misconfig, unmatched resources in different CI envs, no definition of minimal HW requiremenets, etc. But so far nothing against merging CASSANDRA-17729, in fact it has a +1 already, as tests seem to indicate it may reveal legit bugs. Correct me if I am wrong but I will assume lazy consensus and merge by the end of the week if nobody objects. Given we're on holidays season I will have no problem to revert, it's quite easy in fact, if I missed sthg. Regards On 7/7/22 22:44, Mick Semb Wever wrote: However, the docker space issue needs to be resolved first since we don't have the capacity to experiment with those nodes out of commission. ETA on fixing the docker space issues is this/next week. Once that lands we can take a look at the abnormal CPU usage on some nodes.
Re: Raise test timeouts?
However, the docker space > issue needs to be resolved first since we don't have the capacity to > experiment with those nodes out of commission. > ETA on fixing the docker space issues is this/next week. Once that lands we can take a look at the abnormal CPU usage on some nodes.
Re: Raise test timeouts?
> Having parity between CI systems is important, no matter how we approach it. How much does the hardware allocation (cpu, memory, disk throughput, network throughput) differ between ASF Jenkins and circle midres? How much does the container isolation differ? i.e. why are we seeing bugged tests that flake out in ASF that don't fail in Circle midres for example? On Wed, Jul 6, 2022, at 1:31 PM, Mick Semb Wever wrote: >> What I mean by that specifically: if you under-provision a node with 2 cpus, >> 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors, and the >> nodes take so long with GC pauses, compaction, streaming, etc that they >> don't correctly complete certain operations in expected time, completely >> time out, fall over, or otherwise *preserve correctness but die or don't >> complete operations in time* - is that a bug? > > > I'd say it is a bug in the test if we can't distinguish between the test > failing and the test not completing/crashing. How much time folk want to > spend on the different test frameworks we have to improve such things (on a > distributed system), or what the expected time saving such improvements would > provide, I leave to others. I appreciate how demotivating it is. > > Having parity between CI systems is important, no matter how we approach it.
Re: Raise test timeouts?
> > What I mean by that specifically: if you under-provision a node with 2 > cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors, > and the nodes take so long with GC pauses, compaction, streaming, etc that > they don't correctly complete certain operations in expected time, > completely time out, fall over, or otherwise *preserve correctness but > die or don't complete operations in time* - is that a bug? > I'd say it is a bug in the test if we can't distinguish between the test failing and the test not completing/crashing. How much time folk want to spend on the different test frameworks we have to improve such things (on a distributed system), or what the expected time saving such improvements would provide, I leave to others. I appreciate how demotivating it is. Having parity between CI systems is important, no matter how we approach it.
Re: Raise test timeouts?
Just wanted to bring up that we actually started seeing a trend pre-4.0 and it keeps showing up now on the way to 4.1 - legit bugs are found more in CircleCI when they do not pop up at all in Jenkins. So my appeal is to keep checking thoroughly also CircleCI even if some failures are not visible in Butler. On Wed, 6 Jul 2022 at 11:27, Josh McKenzie wrote: > Bringing discussion from JIRA (CASSANDRA-17729) to here: > > Mick said: > > Agree with the notion that Jenkins (lower resources/more contention) is > better at exposing flakies, but that there's a trade-off between > encouraging flakies and creating difficult-to-deal-with noise. > > I come back to the question: what minimum spec of hardware do we want to > support for C*, and how can we best configure our CI infrastructure to be > representative of that? Given the complexity and temporal relationships > w/multiple actors in a distributed system, there's *always* going to be > "defects" that show up if you sufficiently under-provision a host. That > doesn't necessarily mean it's a user-facing bug that needs to be fixed. > > What I mean by that specifically: if you under-provision a node with 2 > cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors, > and the nodes take so long with GC pauses, compaction, streaming, etc that > they don't correctly complete certain operations in expected time, > completely time out, fall over, or otherwise *preserve correctness but > die or don't complete operations in time* - is that a bug? > > And if the angle is more "the test isn't deterministic and fails on > under-provisioned hosts; there's a bug *in the test*", well, that's just > our lives. We have a lot of technical debt in the form of brittle > non-deterministic tests we'd have to target excising to get past this if we > keep our container provisioning where it is. > > If in the lead up to 4.0 we saw a sub 20% hit rate in product defects from > flaky tests vs. test environment flakes alone, we have to consider how much > effort from how many engineers it's taking in the run up to a release to > hammer all these "flaky due to provisioning" tests back down vs. using > other methodologies of testing to uncover correctness defects in timing, > schema propagation, consistency level guarantees, etc. > > On Wed, Jul 6, 2022, at 10:43 AM, Brandon Williams wrote: > > I suspect there's another problem with some of the Jenkins nodes where > the system CPU usage is high and drives the load much higher than > other nodes, possibly causing timeouts. However, the docker space > issue needs to be resolved first since we don't have the capacity to > experiment with those nodes out of commission. > > On Tue, Jul 5, 2022 at 10:53 AM Josh McKenzie > wrote: > > > > Another option would be to increase the resources dedicated to each > agent container and run less in parallel. Or, best yet, do both (up > timeouts and lower parallelization / up resources). > > > > As far as I can tell the failures on Jenkins aren't value-add compared > to what we're seeing on circleci and are just generating busywork. > > > > There's a reasonable discussion to be had about "what's the smallest > footprint of hardware we consider C* supported on" and targeting ASF CI to > validate that. I believe the noisy env + low resources on ASF CI currently > are lower than whatever floor we'd reasonably agree on. > > > > On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote: > > > > Hi All, > > > > bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML > > for visibility as this has been a discussion point with some of you. > > > > I noticed tests timeout much more on jenkins that circle. I was > > wondering if legit bugs were hiding behind those timeouts and it might > > be the case. Feel free to jump in the ticket :-) > > > > Regards > > > > > > > > >
Re: Raise test timeouts?
Bringing discussion from JIRA (CASSANDRA-17729) to here: Mick said: > Agree with the notion that Jenkins (lower resources/more contention) is > better at exposing flakies, but that there's a trade-off between encouraging > flakies and creating difficult-to-deal-with noise. I come back to the question: what minimum spec of hardware do we want to support for C*, and how can we best configure our CI infrastructure to be representative of that? Given the complexity and temporal relationships w/multiple actors in a distributed system, there's *always* going to be "defects" that show up if you sufficiently under-provision a host. That doesn't necessarily mean it's a user-facing bug that needs to be fixed. What I mean by that specifically: if you under-provision a node with 2 cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors, and the nodes take so long with GC pauses, compaction, streaming, etc that they don't correctly complete certain operations in expected time, completely time out, fall over, or otherwise *preserve correctness but die or don't complete operations in time* - is that a bug? And if the angle is more "the test isn't deterministic and fails on under-provisioned hosts; there's a bug *in the test*", well, that's just our lives. We have a lot of technical debt in the form of brittle non-deterministic tests we'd have to target excising to get past this if we keep our container provisioning where it is. If in the lead up to 4.0 we saw a sub 20% hit rate in product defects from flaky tests vs. test environment flakes alone, we have to consider how much effort from how many engineers it's taking in the run up to a release to hammer all these "flaky due to provisioning" tests back down vs. using other methodologies of testing to uncover correctness defects in timing, schema propagation, consistency level guarantees, etc. On Wed, Jul 6, 2022, at 10:43 AM, Brandon Williams wrote: > I suspect there's another problem with some of the Jenkins nodes where > the system CPU usage is high and drives the load much higher than > other nodes, possibly causing timeouts. However, the docker space > issue needs to be resolved first since we don't have the capacity to > experiment with those nodes out of commission. > > On Tue, Jul 5, 2022 at 10:53 AM Josh McKenzie wrote: > > > > Another option would be to increase the resources dedicated to each agent > > container and run less in parallel. Or, best yet, do both (up timeouts and > > lower parallelization / up resources). > > > > As far as I can tell the failures on Jenkins aren't value-add compared to > > what we're seeing on circleci and are just generating busywork. > > > > There's a reasonable discussion to be had about "what's the smallest > > footprint of hardware we consider C* supported on" and targeting ASF CI to > > validate that. I believe the noisy env + low resources on ASF CI currently > > are lower than whatever floor we'd reasonably agree on. > > > > On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote: > > > > Hi All, > > > > bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML > > for visibility as this has been a discussion point with some of you. > > > > I noticed tests timeout much more on jenkins that circle. I was > > wondering if legit bugs were hiding behind those timeouts and it might > > be the case. Feel free to jump in the ticket :-) > > > > Regards > > > > > > >
Re: Raise test timeouts?
I suspect there's another problem with some of the Jenkins nodes where the system CPU usage is high and drives the load much higher than other nodes, possibly causing timeouts. However, the docker space issue needs to be resolved first since we don't have the capacity to experiment with those nodes out of commission. On Tue, Jul 5, 2022 at 10:53 AM Josh McKenzie wrote: > > Another option would be to increase the resources dedicated to each agent > container and run less in parallel. Or, best yet, do both (up timeouts and > lower parallelization / up resources). > > As far as I can tell the failures on Jenkins aren't value-add compared to > what we're seeing on circleci and are just generating busywork. > > There's a reasonable discussion to be had about "what's the smallest > footprint of hardware we consider C* supported on" and targeting ASF CI to > validate that. I believe the noisy env + low resources on ASF CI currently > are lower than whatever floor we'd reasonably agree on. > > On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote: > > Hi All, > > bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML > for visibility as this has been a discussion point with some of you. > > I noticed tests timeout much more on jenkins that circle. I was > wondering if legit bugs were hiding behind those timeouts and it might > be the case. Feel free to jump in the ticket :-) > > Regards > > >
Re: Raise test timeouts?
Another option would be to increase the resources dedicated to each agent container and run less in parallel. Or, best yet, do both (up timeouts and lower parallelization / up resources). As far as I can tell the failures on Jenkins aren't value-add compared to what we're seeing on circleci and are just generating busywork. There's a reasonable discussion to be had about "what's the smallest footprint of hardware we consider C* supported on" and targeting ASF CI to validate that. I believe the noisy env + low resources on ASF CI currently are lower than whatever floor we'd reasonably agree on. On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote: > Hi All, > > bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML > for visibility as this has been a discussion point with some of you. > > I noticed tests timeout much more on jenkins that circle. I was > wondering if legit bugs were hiding behind those timeouts and it might > be the case. Feel free to jump in the ticket :-) > > Regards > >