Re: CEP-21 - Transactional cluster metadata merged to trunk

2023-11-28 Thread Ekaterina Dimitrova
I hate to say it, but I was disappointed that this email thread was started
after the TCM work had already been committed. Especially knowing how we
had even an epic with patches spread around the codebase, which are waiting
on TCM to get committed first so that we do not disturb any rebase. The
latest commits in trunk were relatively isolated. We could have committed
the last outstanding tickets part of the push for the next release and
still made some exceptions for the TCM work merge after that — just a
constructive approach and visibility.

On a positive note, thank you to everyone who worked and keeps working on
TCM! This is an extreme effort!! I'm super excited!

"Can we get these tests temporarily annotated as skipped while all the
subtickets to 19055 are being worked on"

I am against that. Considering I can see Sam, Alex, and Marcus working
around the clock to solve any test-related tickets, I think there is no
need to do this. Also, in general, ignoring tests also leads to risks of
something being forgotten, a ticket being closed by mistake, and new types
of failures being missed.

"it makes sense to do a repeated run of the new tests."
I do agree with Jacek here.

"I also had to go the route of extracting a blend of what's in circle and
what's in ASF CI (in terms of test suites, filtering, etc) since neither
represented a complete view of our CI ecosystem;"
There is a single test suite in CircleCI, and the packaging needs to be
included.

"From a cursory inspection it looks like most of the breakages being
tracked on the ticket Sam linked for TCM are likely to be circle env
specific (new *nix optimized deletion having a race, OOM's, etc). The TCM
merge is actually a great forcing function for us to surface anything env
specific in terms of timing and resourcing up-front;"
I spent my afternoon triaging the tickets and running repeated runs on
reported failures in the CASSANDRA-19055, as I do not believe we can/should
blindly blame all new failures/flakies on TCM. Unfortunately, all the
tickets I triaged were accurately assigned to TCM, and I don't think there
were more than 1 or 2 OOM tickets.

"My gut tells me it's basically impossible to have a merge of this size
that doesn't disrupt what it's merging into"
I agree with this. There is no way that everything will end smoothly and
perfectly with such extensive work. But we should all recognize the
excellent work that Sam, Alex, and Marcus are doing here to identify and
fix outstanding issues in the past two days. Thank you! I am sure everyone
appreciates that! And I would like to appeal to the community members to
keep triaging and bisecting any new test flakies/failures as we were doing
and not just blindly assign everything to the TCM follow-up epic. We should
be constructive.

"for the record, I don't think we should hold off on merging things just
because some folks are on holiday. :)"

I do not believe anyone advocates for that. I personally even often commit
disruptive patches on the weekend. This gives me the time to deal with the
fallout and reduce the potential disruption to people's work during
office hours. My main concern is what was mentioned already by Benjamin and
Jacek - creating precedent here after all the discussions that happened.


On Mon, 27 Nov 2023 at 16:34, Josh McKenzie  wrote:

> on our internal CI system
>
> Some more context:
>
> This environment adheres to the requirements we laid out in pre-commit CI
> on Cassandra
> 
>  with
> a couple required differences. We don't yet include the resource
> restriction detail in the test report; it's on my backlog of things to do
> but I can confirm that less CPU and <= equivalent ASFCI memory is being
> allocated for each test suite. I also had to go the route of extracting a
> blend of what's in circle and what's in ASF CI (in terms of test suites,
> filtering, etc) since neither represented a complete view of our CI
> ecosystem; there are currently things executed in either environment not
> executed in the other.
>
> I've been tracking the upstreaming of that declarative combination in
> CASSANDRA-18731 but have had some other priorities take front-seat (i.e.
> getting a new CI system based on that working since neither upstream ASF CI
> nor circle are re-usable in their current form) and will be upstreaming
> that ASAP. https://issues.apache.org/jira/browse/CASSANDRA-18731
>
> I've left a pretty long comment on CASSANDRA-18731 about the structure of
> things and where my opinion falls; *I think we need a separate DISCUSS
> thread on the ML about CI and what we require for pre-commit smoke*
> suites:
> https://issues.apache.org/jira/browse/CASSANDRA-18731?focusedCommentId=17790270=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17790270
>
> The TL;DR:
>
> With an *incredibly large* patch in the form of TCM (88k+ LoC, 900+ files
> touched), we have less than a 

Re: CEP-21 - Transactional cluster metadata merged to trunk

2023-11-27 Thread Josh McKenzie
> on our internal CI system
Some more context:

This environment adheres to the requirements we laid out in pre-commit CI on 
Cassandra 

 with a couple required differences. We don't yet include the resource 
restriction detail in the test report; it's on my backlog of things to do but I 
can confirm that less CPU and <= equivalent ASFCI memory is being allocated for 
each test suite. I also had to go the route of extracting a blend of what's in 
circle and what's in ASF CI (in terms of test suites, filtering, etc) since 
neither represented a complete view of our CI ecosystem; there are currently 
things executed in either environment not executed in the other.

I've been tracking the upstreaming of that declarative combination in 
CASSANDRA-18731 but have had some other priorities take front-seat (i.e. 
getting a new CI system based on that working since neither upstream ASF CI nor 
circle are re-usable in their current form) and will be upstreaming that ASAP. 
https://issues.apache.org/jira/browse/CASSANDRA-18731

I've left a pretty long comment on CASSANDRA-18731 about the structure of 
things and where my opinion falls; *I think we need a separate DISCUSS thread 
on the ML about CI and what we require for pre-commit smoke* suites: 
https://issues.apache.org/jira/browse/CASSANDRA-18731?focusedCommentId=17790270=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17790270

The TL;DR:
> With an *incredibly large* patch in the form of TCM (88k+ LoC, 900+ files 
> touched), we have less than a .002% test failure injection rate using the 
> above restricted smoke heuristic, and many of them look to be circle ci env 
> specific and not asf ci.

>From a cursory inspection it looks like most of the breakages being tracked on 
>the ticket Sam linked for TCM are likely to be circle env specific (new *nix 
>optimized deletion having a race, OOM's, etc). The TCM merge is actually a 
>great forcing function for us to surface anything env specific in terms of 
>timing and resourcing up-front; I'm glad we have this opportunity but it's 
>unfortunate that it's been interpreted as merging w/out passing CI as opposed 
>to having some env-difference specific kinks to work out.

*This was an incredibly huge merge.* For comparison, I just did a --stat on the 
merge for CASSANDRA-8099:
> 645 files changed, 49381 insertions(+), 42227 deletions(-)

TCM from the C* repo:
>  934 files changed, 66185 insertions(+), 21669 deletions(-)
My gut tells me it's basically impossible to have a merge of this size that 
doesn't disrupt what it's merging into, or the authors just end up slowly dying 
in rebase hell. Or both. This was a massive undertaking and compared to our 
past on this project, has had an incredibly low impact on the target it was 
merged into and the authors are rapidly burning down failures.

To the authors - great work, and thanks for being so diligent on following up 
on any disruptions this body of work has caused to other contributors' 
environments.

To the folks who were disrupted - I get it. This is deeply frustrating, green 
CI has long been many of our white whale's, and having something merge over a 
US holiday week with an incredibly active project where we don't all have time 
to keep up with everything can make things like this feel like a huge surprise. 
It's incredibly unfortunate that the timing on us transitioning to this new CI 
system and working out the kinks is when this behemoth of a merge needed to 
come through, but silver-lining.

We're making great strides. Let's not lose sight of our growth because of the 
pain in the moment of it.

~Josh

p.s. - for the record, I don't think we should hold off on merging things just 
because some folks are on holiday. :)

On Mon, Nov 27, 2023, at 3:38 PM, Sam Tunnicliffe wrote:
> I ought to clarify, we did actually have green CI modulo 3 flaky tests on our 
> internal CI system. I've attached the test artefacts to CASSANDRA-18330 
> now[1][2]: 2 of the 3 failures are upgrade dtests, with 1 other python dtest 
> failure noted. None of these were reproducible in a dev setup, so we 
> suspected them to be environmental and intended to merge before returning to 
> confirm that. The "known" failures that we mentioned in the email that 
> started this thread were ones observed by Mick running the cep-21-tcm branch 
> through Circle before merging.  
> 
> As the CEP-21 changeset was approaching 88k LoC touching over 900 files, 
> permanently rebasing as we tried to eradicate every flaky test was simply 
> unrealistic, especially as other significant patches continued to land in 
> trunk. With that in mind, we took the decision to merge so that we could 
> focus on actually removing any remaining instability.
> 
> [1] https://issues.apache.org/jira/secure/attachment/13064727/ci_summary.html
> [2] 
> 

Re: CEP-21 - Transactional cluster metadata merged to trunk

2023-11-27 Thread Sam Tunnicliffe
I ought to clarify, we did actually have green CI modulo 3 flaky tests on our 
internal CI system. I've attached the test artefacts to CASSANDRA-18330 
now[1][2]: 2 of the 3 failures are upgrade dtests, with 1 other python dtest 
failure noted. None of these were reproducible in a dev setup, so we suspected 
them to be environmental and intended to merge before returning to confirm 
that. The "known" failures that we mentioned in the email that started this 
thread were ones observed by Mick running the cep-21-tcm branch through Circle 
before merging.  

As the CEP-21 changeset was approaching 88k LoC touching over 900 files, 
permanently rebasing as we tried to eradicate every flaky test was simply 
unrealistic, especially as other significant patches continued to land in 
trunk. With that in mind, we took the decision to merge so that we could focus 
on actually removing any remaining instability.

[1] https://issues.apache.org/jira/secure/attachment/13064727/ci_summary.html
[2] 
https://issues.apache.org/jira/secure/attachment/13064728/result_details.tar.gz


> On 27 Nov 2023, at 10:28, Berenguer Blasi  wrote:
> 
> Hi,
> 
> I have written this email like 10 times before sending it and I can't manage 
> to avoid making it sound with a negative spin to it. So pardon my English or 
> poor choice of words in advance and try to read it in a positive way.
> 
> It is really demotivating to me seeing things getting merged without green 
> CI. I had to go through an herculean effort and pain (at least to me) to keep 
> rebasing the TTL patch continuously (a huge one imo) when it would have been 
> altogether much easier to merge, post-fix and post-add downgradability along 
> the TCM merge lines.
> 
> If this merge-post fix approach is a thing I would like it clarified so we 
> can all benefit from it and to avoid the big-patch rebase pain.
> 
> Regards
> 
> On 27/11/23 10:38, Jacek Lewandowski wrote:
>> Hi,
>> 
>> I'm happy to hear that the feature got merged. Though, I share Benjamin's 
>> worries about that being a bad precedent.
>> 
>> I don't think it makes sense to do repeated runs in this particular case. 
>> Detecting flaky tests would not prove anything; they can be caused by this 
>> patch, but we would not know that for sure. We would have to have a similar 
>> build with the same tests repeated to compare. It would take time and 
>> resources, and in the end, we will have to fix those flaky tests regardless 
>> of whether they were caused by this change. IMO, it makes sense to do a 
>> repeated run of the new tests, though. Aside from that, we can also consider 
>> making it easier and more automated for the developer to determine whether a 
>> particular flakiness comes from a feature branch one wants to merge.
>> 
>> thanks,
>> Jacek
>> 
>> 
>> pon., 27 lis 2023 o 10:15 Benjamin Lerer > > napisał(a):
>>> Hi,
>>> 
>>> I must admit that I have been surprised by this merge and this following 
>>> email. We had lengthy discussions recently and the final agreement was that 
>>> the requirement for a merge was a green CI.
>>> I could understand that for some reasons as a community we could wish to 
>>> make some exceptions. In this present case there was no official discussion 
>>> to ask for an exception.
>>> I believe that this merge creates a bad precedent where anybody can feel 
>>> entitled to merge without a green CI and disregard any previous community 
>>> agreement.
>>> 
>>> Le sam. 25 nov. 2023 à 09:22, Mick Semb Wever >> > a écrit :
 
 Great work Sam, Alex & Marcus !
 
  
> There are about 15-20 flaky or failing tests in total, spread over 
> several test jobs[2] (i.e. single digit failures in a few of these). We 
> have filed JIRAs for the failures and are working on getting those fixed 
> as a top priority. CASSANDRA-19055[3] is the umbrella ticket for this 
> follow up work.
> 
> There are also a number of improvements we will work on in the coming 
> weeks, we will file JIRAs for those early next week and add them as 
> subtasks to CASSANDRA-19055.
 
 
 Can we get these tests temporarily annotated as skipped while all the 
 subtickets to 19055 are being worked on ? 
 
 As we have seen from CASSANDRA-18166 and CASSANDRA-19034 there's a lot of 
 overhead now on 5.0 tickets having to navigate around these failures in 
 trunk CI runs.
 
 Also, we're still trying to figure out how to do repeated runs for a patch 
 so big… (the list of touched tests was too long for circleci, i need to 
 figure out what the limit is and chunk it into separate circleci configs) 
 … and it probably makes sense to wait until most of 19055 is done (or 
 tests are temporarily annotated as skipped).
 
 



Re: CEP-21 - Transactional cluster metadata merged to trunk

2023-11-27 Thread Berenguer Blasi

Hi,

I have written this email like 10 times before sending it and I can't 
manage to avoid making it sound with a negative spin to it. So pardon my 
English or poor choice of words in advance and try to read it in a 
positive way.


It is really demotivating to me seeing things getting merged without 
green CI. I had to go through an herculean effort and pain (at least to 
me) to keep rebasing the TTL patch continuously (a huge one imo) when it 
would have been altogether much easier to merge, post-fix and post-add 
downgradability along the TCM merge lines.


If this merge-post fix approach is a thing I would like it clarified so 
we can all benefit from it and to avoid the big-patch rebase pain.


Regards

On 27/11/23 10:38, Jacek Lewandowski wrote:

Hi,

I'm happy to hear that the feature got merged. Though, I share 
Benjamin's worries about that being a bad precedent.


I don't think it makes sense to do repeated runs in this particular 
case. Detecting flaky tests would not prove anything; they can be 
caused by this patch, but we would not know that for sure. We would 
have to have a similar build with the same tests repeated to compare. 
It would take time and resources, and in the end, we will have to fix 
those flaky tests regardless of whether they were caused by this 
change. IMO, it makes sense to do a repeated run of the new tests, 
though. Aside from that, we can also consider making it easier and 
more automated for the developer to determine whether a particular 
flakiness comes from a feature branch one wants to merge.


thanks,
Jacek


pon., 27 lis 2023 o 10:15 Benjamin Lerer  napisał(a):

Hi,

I must admit that I have been surprised by this merge and this
following email. We had lengthy discussions recently and the final
agreement was that the requirement for a merge was a green CI.
I could understand that for some reasons as a community we could
wish to make some exceptions. In this present case there was no
official discussion to ask for an exception.
I believe that this merge creates a bad precedent where anybody
can feel entitled to merge without a green CI and disregard any
previous community agreement.

Le sam. 25 nov. 2023 à 09:22, Mick Semb Wever  a
écrit :


Great work Sam, Alex & Marcus !

There are about 15-20 flaky or failing tests in total,
spread over several test jobs[2] (i.e. single digit
failures in a few of these). We have filed JIRAs for the
failures and are working on getting those fixed as a top
priority. CASSANDRA-19055[3] is the umbrella ticket for
this follow up work.

There are also a number of improvements we will work on in
the coming weeks, we will file JIRAs for those early next
week and add them as subtasks to CASSANDRA-19055.



Can we get these tests temporarily annotated as skipped while
all the subtickets to 19055 are being worked on ?

As we have seen from CASSANDRA-18166 and CASSANDRA-19034
there's a lot of overhead now on 5.0 tickets having to
navigate around these failures in trunk CI runs.

Also, we're still trying to figure out how to do repeated runs
for a patch so big… (the list of touched tests was too long
for circleci, i need to figure out what the limit is and chunk
it into separate circleci configs) … and it probably makes
sense to wait until most of 19055 is done (or tests are
temporarily annotated as skipped).



Re: CEP-21 - Transactional cluster metadata merged to trunk

2023-11-27 Thread Jacek Lewandowski
Hi,

I'm happy to hear that the feature got merged. Though, I share Benjamin's
worries about that being a bad precedent.

I don't think it makes sense to do repeated runs in this particular case.
Detecting flaky tests would not prove anything; they can be caused by this
patch, but we would not know that for sure. We would have to have a similar
build with the same tests repeated to compare. It would take time and
resources, and in the end, we will have to fix those flaky tests regardless
of whether they were caused by this change. IMO, it makes sense to do a
repeated run of the new tests, though. Aside from that, we can also
consider making it easier and more automated for the developer to determine
whether a particular flakiness comes from a feature branch one wants to
merge.

thanks,
Jacek


pon., 27 lis 2023 o 10:15 Benjamin Lerer  napisał(a):

> Hi,
>
> I must admit that I have been surprised by this merge and this following
> email. We had lengthy discussions recently and the final agreement was that
> the requirement for a merge was a green CI.
> I could understand that for some reasons as a community we could wish to
> make some exceptions. In this present case there was no official discussion
> to ask for an exception.
> I believe that this merge creates a bad precedent where anybody can feel
> entitled to merge without a green CI and disregard any previous community
> agreement.
>
> Le sam. 25 nov. 2023 à 09:22, Mick Semb Wever  a écrit :
>
>>
>> Great work Sam, Alex & Marcus !
>>
>>
>>
>>> There are about 15-20 flaky or failing tests in total, spread over
>>> several test jobs[2] (i.e. single digit failures in a few of these). We
>>> have filed JIRAs for the failures and are working on getting those fixed as
>>> a top priority. CASSANDRA-19055[3] is the umbrella ticket for this follow
>>> up work.
>>>
>>> There are also a number of improvements we will work on in the coming
>>> weeks, we will file JIRAs for those early next week and add them as
>>> subtasks to CASSANDRA-19055.
>>>
>>
>>
>> Can we get these tests temporarily annotated as skipped while all the
>> subtickets to 19055 are being worked on ?
>>
>> As we have seen from CASSANDRA-18166 and CASSANDRA-19034 there's a lot of
>> overhead now on 5.0 tickets having to navigate around these failures in
>> trunk CI runs.
>>
>> Also, we're still trying to figure out how to do repeated runs for a
>> patch so big… (the list of touched tests was too long for circleci, i need
>> to figure out what the limit is and chunk it into separate circleci
>> configs) … and it probably makes sense to wait until most of 19055 is done
>> (or tests are temporarily annotated as skipped).
>>
>>
>>


Re: CEP-21 - Transactional cluster metadata merged to trunk

2023-11-27 Thread Benjamin Lerer
Hi,

I must admit that I have been surprised by this merge and this following
email. We had lengthy discussions recently and the final agreement was that
the requirement for a merge was a green CI.
I could understand that for some reasons as a community we could wish to
make some exceptions. In this present case there was no official discussion
to ask for an exception.
I believe that this merge creates a bad precedent where anybody can feel
entitled to merge without a green CI and disregard any previous community
agreement.

Le sam. 25 nov. 2023 à 09:22, Mick Semb Wever  a écrit :

>
> Great work Sam, Alex & Marcus !
>
>
>
>> There are about 15-20 flaky or failing tests in total, spread over
>> several test jobs[2] (i.e. single digit failures in a few of these). We
>> have filed JIRAs for the failures and are working on getting those fixed as
>> a top priority. CASSANDRA-19055[3] is the umbrella ticket for this follow
>> up work.
>>
>> There are also a number of improvements we will work on in the coming
>> weeks, we will file JIRAs for those early next week and add them as
>> subtasks to CASSANDRA-19055.
>>
>
>
> Can we get these tests temporarily annotated as skipped while all the
> subtickets to 19055 are being worked on ?
>
> As we have seen from CASSANDRA-18166 and CASSANDRA-19034 there's a lot of
> overhead now on 5.0 tickets having to navigate around these failures in
> trunk CI runs.
>
> Also, we're still trying to figure out how to do repeated runs for a patch
> so big… (the list of touched tests was too long for circleci, i need to
> figure out what the limit is and chunk it into separate circleci configs) …
> and it probably makes sense to wait until most of 19055 is done (or tests
> are temporarily annotated as skipped).
>
>
>


Re: CEP-21 - Transactional cluster metadata merged to trunk

2023-11-25 Thread Mick Semb Wever
Great work Sam, Alex & Marcus !



> There are about 15-20 flaky or failing tests in total, spread over several
> test jobs[2] (i.e. single digit failures in a few of these). We have filed
> JIRAs for the failures and are working on getting those fixed as a top
> priority. CASSANDRA-19055[3] is the umbrella ticket for this follow up work.
>
> There are also a number of improvements we will work on in the coming
> weeks, we will file JIRAs for those early next week and add them as
> subtasks to CASSANDRA-19055.
>


Can we get these tests temporarily annotated as skipped while all the
subtickets to 19055 are being worked on ?

As we have seen from CASSANDRA-18166 and CASSANDRA-19034 there's a lot of
overhead now on 5.0 tickets having to navigate around these failures in
trunk CI runs.

Also, we're still trying to figure out how to do repeated runs for a patch
so big… (the list of touched tests was too long for circleci, i need to
figure out what the limit is and chunk it into separate circleci configs) …
and it probably makes sense to wait until most of 19055 is done (or tests
are temporarily annotated as skipped).


CEP-21 - Transactional cluster metadata merged to trunk

2023-11-24 Thread Marcus Eriksson
Hi all

We just wanted to give a heads up that we merged CEP-21, Transactional Cluster 
Metadata to trunk[1] today. 

There are about 15-20 flaky or failing tests in total, spread over several test 
jobs[2] (i.e. single digit failures in a few of these). We have filed JIRAs for 
the failures and are working on getting those fixed as a top priority. 
CASSANDRA-19055[3] is the umbrella ticket for this follow up work.

There are also a number of improvements we will work on in the coming weeks, we 
will file JIRAs for those early next week and add them as subtasks to 
CASSANDRA-19055.

This is a huge patch and so constant rebasing onto even small changesets is 
challenging, hence we took the decision to merge now despite CI not being 100% 
stable. Landing in trunk also mitigates this somewhat, as resolving these 
issues isn’t blocking any imminent release.  

Given the scope and ambition of the changes some teething problems will be 
inevitable, so please reach out in #cassandra-tcm on ASF Slack if you have any 
questions or problems. We do not expect this to add a substantial amount of 
work for other contributors, but if this happens to affect your patch we’ll 
gladly help out.

Thanks
Sam, Alex & Marcus


[1] 
https://github.com/apache/cassandra/commit/ae0842372ff6dd1437d026f82968a3749f555ff4
[2] 
https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/256/workflows/c4fda8f1-a8d6-4523-be83-5e30b9de39fe
 
[3] https://issues.apache.org/jira/browse/CASSANDRA-19055