Re: CEP-21 - Transactional cluster metadata merged to trunk

Josh McKenzie Mon, 27 Nov 2023 13:33:54 -0800

> on our internal CI system
Some more context:

This environment adheres to the requirements we laid out in pre-commit CI on 
Cassandra 
<https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit>
 with a couple required differences. We don't yet include the resource 
restriction detail in the test report; it's on my backlog of things to do but I 
can confirm that less CPU and <= equivalent ASFCI memory is being allocated for 
each test suite. I also had to go the route of extracting a blend of what's in 
circle and what's in ASF CI (in terms of test suites, filtering, etc) since 
neither represented a complete view of our CI ecosystem; there are currently 
things executed in either environment not executed in the other.


I've been tracking the upstreaming of that declarative combination in 
CASSANDRA-18731 but have had some other priorities take front-seat (i.e. 
getting a new CI system based on that working since neither upstream ASF CI nor 
circle are re-usable in their current form) and will be upstreaming that ASAP. 
https://issues.apache.org/jira/browse/CASSANDRA-18731

I've left a pretty long comment on CASSANDRA-18731 about the structure of 
things and where my opinion falls; *I think we need a separate DISCUSS thread 
on the ML about CI and what we require for pre-commit smoke* suites: 
https://issues.apache.org/jira/browse/CASSANDRA-18731?focusedCommentId=17790270&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17790270

The TL;DR:
> With an *incredibly large* patch in the form of TCM (88k+ LoC, 900+ files 
> touched), we have less than a .002% test failure injection rate using the 
> above restricted smoke heuristic, and many of them look to be circle ci env 
> specific and not asf ci.

>From a cursory inspection it looks like most of the breakages being tracked on 
>the ticket Sam linked for TCM are likely to be circle env specific (new *nix 
>optimized deletion having a race, OOM's, etc). The TCM merge is actually a 
>great forcing function for us to surface anything env specific in terms of 
>timing and resourcing up-front; I'm glad we have this opportunity but it's 
>unfortunate that it's been interpreted as merging w/out passing CI as opposed 
>to having some env-difference specific kinks to work out.

*This was an incredibly huge merge.* For comparison, I just did a --stat on the 
merge for CASSANDRA-8099:
> 645 files changed, 49381 insertions(+), 42227 deletions(-)

TCM from the C* repo:
>  934 files changed, 66185 insertions(+), 21669 deletions(-)
My gut tells me it's basically impossible to have a merge of this size that 
doesn't disrupt what it's merging into, or the authors just end up slowly dying 
in rebase hell. Or both. This was a massive undertaking and compared to our 
past on this project, has had an incredibly low impact on the target it was 
merged into and the authors are rapidly burning down failures.

To the authors - great work, and thanks for being so diligent on following up 
on any disruptions this body of work has caused to other contributors' 
environments.

To the folks who were disrupted - I get it. This is deeply frustrating, green 
CI has long been many of our white whale's, and having something merge over a 
US holiday week with an incredibly active project where we don't all have time 
to keep up with everything can make things like this feel like a huge surprise. 
It's incredibly unfortunate that the timing on us transitioning to this new CI 
system and working out the kinks is when this behemoth of a merge needed to 
come through, but silver-lining.

We're making great strides. Let's not lose sight of our growth because of the 
pain in the moment of it.

~Josh

p.s. - for the record, I don't think we should hold off on merging things just 
because some folks are on holiday. :)

On Mon, Nov 27, 2023, at 3:38 PM, Sam Tunnicliffe wrote:
> I ought to clarify, we did actually have green CI modulo 3 flaky tests on our 
> internal CI system. I've attached the test artefacts to CASSANDRA-18330 
> now[1][2]: 2 of the 3 failures are upgrade dtests, with 1 other python dtest 
> failure noted. None of these were reproducible in a dev setup, so we 
> suspected them to be environmental and intended to merge before returning to 
> confirm that. The "known" failures that we mentioned in the email that 
> started this thread were ones observed by Mick running the cep-21-tcm branch 
> through Circle before merging.  
> 
> As the CEP-21 changeset was approaching 88k LoC touching over 900 files, 
> permanently rebasing as we tried to eradicate every flaky test was simply 
> unrealistic, especially as other significant patches continued to land in 
> trunk. With that in mind, we took the decision to merge so that we could 
> focus on actually removing any remaining instability.
> 
> [1] https://issues.apache.org/jira/secure/attachment/13064727/ci_summary.html
> [2] 
> https://issues.apache.org/jira/secure/attachment/13064728/result_details.tar.gz
> 
> 
>> On 27 Nov 2023, at 10:28, Berenguer Blasi <berenguerbl...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> I have written this email like 10 times before sending it and I can't manage 
>> to avoid making it sound with a negative spin to it. So pardon my English or 
>> poor choice of words in advance and try to read it in a positive way.
>> 
>> It is really demotivating to me seeing things getting merged without green 
>> CI. I had to go through an herculean effort and pain (at least to me) to 
>> keep rebasing the TTL patch continuously (a huge one imo) when it would have 
>> been altogether much easier to merge, post-fix and post-add downgradability 
>> along the TCM merge lines.
>> 
>> If this merge-post fix approach is a thing I would like it clarified so we 
>> can all benefit from it and to avoid the big-patch rebase pain.
>> 
>> Regards
>> 
>> On 27/11/23 10:38, Jacek Lewandowski wrote:
>>> Hi,
>>> 
>>> I'm happy to hear that the feature got merged. Though, I share Benjamin's 
>>> worries about that being a bad precedent.
>>> 
>>> I don't think it makes sense to do repeated runs in this particular case. 
>>> Detecting flaky tests would not prove anything; they can be caused by this 
>>> patch, but we would not know that for sure. We would have to have a similar 
>>> build with the same tests repeated to compare. It would take time and 
>>> resources, and in the end, we will have to fix those flaky tests regardless 
>>> of whether they were caused by this change. IMO, it makes sense to do a 
>>> repeated run of the new tests, though. Aside from that, we can also 
>>> consider making it easier and more automated for the developer to determine 
>>> whether a particular flakiness comes from a feature branch one wants to 
>>> merge.
>>> thanks,
>>> Jacek
>>> 
>>> 
>>> pon., 27 lis 2023 o 10:15 Benjamin Lerer <ble...@apache.org> napisał(a):
>>>> Hi,
>>>> 
>>>> I must admit that I have been surprised by this merge and this following 
>>>> email. We had lengthy discussions recently and the final agreement was 
>>>> that the requirement for a merge was a green CI.
>>>> I could understand that for some reasons as a community we could wish to 
>>>> make some exceptions. In this present case there was no official 
>>>> discussion to ask for an exception.
>>>> I believe that this merge creates a bad precedent where anybody can feel 
>>>> entitled to merge without a green CI and disregard any previous community 
>>>> agreement.
>>>> 
>>>> Le sam. 25 nov. 2023 à 09:22, Mick Semb Wever <m...@apache.org> a écrit :
>>>>> 
>>>>> Great work Sam, Alex & Marcus !
>>>>>  
>>>>>> There are about 15-20 flaky or failing tests in total, spread over 
>>>>>> several test jobs[2] (i.e. single digit failures in a few of these). We 
>>>>>> have filed JIRAs for the failures and are working on getting those fixed 
>>>>>> as a top priority. CASSANDRA-19055[3] is the umbrella ticket for this 
>>>>>> follow up work.
>>>>>> 
>>>>>> There are also a number of improvements we will work on in the coming 
>>>>>> weeks, we will file JIRAs for those early next week and add them as 
>>>>>> subtasks to CASSANDRA-19055.
>>>>> 
>>>>> 
>>>>> Can we get these tests temporarily annotated as skipped while all the 
>>>>> subtickets to 19055 are being worked on ? 
>>>>> 
>>>>> As we have seen from CASSANDRA-18166 and CASSANDRA-19034 there's a lot of 
>>>>> overhead now on 5.0 tickets having to navigate around these failures in 
>>>>> trunk CI runs.
>>>>> 
>>>>> Also, we're still trying to figure out how to do repeated runs for a 
>>>>> patch so big… (the list of touched tests was too long for circleci, i 
>>>>> need to figure out what the limit is and chunk it into separate circleci 
>>>>> configs) … and it probably makes sense to wait until most of 19055 is 
>>>>> done (or tests are temporarily annotated as skipped).
>>>>> 
>>>>>

Re: CEP-21 - Transactional cluster metadata merged to trunk

Reply via email to