Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-16 Thread Ekaterina Dimitrova
Thanks for opening an epic @Jacek.

It seems the dtest_offheap job is replaced by dtest_latest which means we
will have the same amount of jobs after the current ticket and I am not
worried about Jenkins.

Though in CircleCI we did not have the dtest_offheap job mandatory run
pre-commit but as far as I can see this ticket suggests dtest_latest to be
mandatory run in the pre-commit workflow.
I would like to suggest we commit the current proposal. Only, I think the
config should be mentioned experimental somewhere.

As a short term solution to the raised consumption pre-commit tests run I
would like to suggest we accept running only the J11 pre-commit workflow
(which covers also tests run with J17) until we surface the other
discussion and we apply other test configuration changes/optimizations.

On Fri, 16 Feb 2024 at 9:08, Paulo Motta  wrote:

> Thanks for clarifying Branimir! I'm +1 on proceeding as proposed and I
> think this change will make it easier to gain confidence to update
> configurations.
>
> Interesting discussion and suggestions on this thread - I think we can
> follow-up on improving test/CI workflow in a different thread/proposal to
> avoid blocking this.
>
> On Thu, Feb 15, 2024 at 9:59 AM Branimir Lambov <
> branimir.lam...@datastax.com> wrote:
>
>> Paulo:
>>
>>> 1) Will cassandra.yaml remain the default test config? Is the plan
>>> moving forward to require green CI for both configurations on pre-commit,
>>> or pre-release?
>>
>> The plan is to ensure both configurations are green pre-commit. This
>> should not increase the CI cost as this replaces extra configurations we
>> were running before (e.g. test-tries).
>>
>> 2) What will this mean for the release artifact, is the idea to continue
>>> shipping with the current cassandra.yaml or eventually switch to the
>>> optimized configuration (ie. 6.X) while making the legacy default
>>> configuration available via an optional flag?
>>
>> The release simply includes an additional yaml file, which contains a
>> one-liner how to use it.
>>
>> Jeff:
>>
>>> 1) If there’s an “old compatible default” and “latest recommended
>>> settings”, when does the value in “old compatible default” get updated?
>>> Never?
>>
>> This does not change anything about these decisions. The question is very
>> serious without this patch as well: Does V6 have to support pain-free
>> upgrade from V5 working in V4 compatible mode? If so, can we ever deprecate
>> or drop anything? If not, are we not breaking upgradeability promises?
>>
>> 2) If there are test failures with the new values, it seems REALLY
>>> IMPORTANT to make sure those test failures are discovered + fixed IN THE
>>> FUTURE TOO. If pushing new yaml into a different file makes us less likely
>>> to catch the failures in the future, it seems like we’re hurting ourselves.
>>> Branimir mentions this, but how do we ensure that we don’t let this pattern
>>> disguise future bugs?
>>
>> The main objective of this patch is to ensure that the second yaml is
>> tested too, pre-commit. We were not doing this for all features we tell
>> users are supported.
>>
>> Paulo:
>>
>>> - if cassandra_latest.yaml becomes the new default configuration for
>>> 6.0, then precommit only needs to be run against thatversion - prerelease
>>> needs to be run against all cassandra.yaml variants.
>>
>> Assuming we keep the pace of development, there will be new "latest"
>> features in 6.0 (e.g. Accord could be one). The idea is more to move some
>> of the settings from latest to default when they are deemed mature enough.
>>
>> Josh:
>>
>>> I propose to significantly reduce that stuff. Let's distinguish the
>>> packages of tests that need to be run with CDC enabled / disabled, with
>>> commitlog compression enabled / disabled, tests that verify sstable formats
>>> (mostly io and index I guess), and leave other parameters set as with the
>>> latest configuration - this is the easiest way I think.
>>> For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about
>>> other stuff. To me running no-vnodes makes no sense because no-vnodes is
>>> just a special case of vnodes=1. On the other hand offheap/onheap buffers
>>> could be tested in unit tests. In short, I'd run dtests only with the
>>> default and latest configuration.
>>
>> Some of these changes are already done in this ticket.
>>
>> Regards,
>> Branimir
>>
>>
>>
>> On Thu, Feb 15, 2024 at 3:08 PM Paulo Motta  wrote:
>>
>>> > It's also been questioned about why we don't just enable settings we
>>> recommend.  These are settings we recommend for new clusters.  *Our
>>> existing cassandra.yaml needs to be tailored for existing clusters being
>>> upgraded, where we are very conservative about changing defaults.*
>>>
>>> I think this unnecessarily penalizes new users with subpar defaults and
>>> existing users who wish to use optimized/recommended defaults and need to
>>> maintain additional logic to support that. This change offers an
>>> opportunity to revisit this.
>>>
>>> 

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-16 Thread David Capwell
I’m +1 once the tests are passing ands +0 while they are failingSent from my iPhoneOn Feb 16, 2024, at 6:08 AM, Paulo Motta  wrote:Thanks for clarifying Branimir! I'm +1 on proceeding as proposed and I think this change will make it easier to gain confidence to update configurations.Interesting discussion and suggestions on this thread - I think we can follow-up on improving test/CI workflow in a different thread/proposal to avoid blocking this.On Thu, Feb 15, 2024 at 9:59 AM Branimir Lambov  wrote:Paulo:1) Will cassandra.yaml remain the default test config? Is the plan moving forward to require green CI for both configurations on pre-commit, or pre-release?The plan is to ensure both configurations are green pre-commit. This should not increase the CI cost as this replaces extra configurations we were running before (e.g. test-tries).2) What will this mean for the release artifact, is the idea to continue shipping with the current cassandra.yaml or eventually switch to the optimized configuration (ie. 6.X) while making the legacy default configuration available via an optional flag?The release simply includes an additional yaml file, which contains a one-liner how to use it.Jeff:1) If there’s an “old compatible default” and “latest recommended settings”, when does the value in “old compatible default” get updated? Never? This does not change anything about these decisions. The question is very serious without this patch as well: Does V6 have to support pain-free upgrade from V5 working in V4 compatible mode? If so, can we ever deprecate or drop anything? If not, are we not breaking upgradeability promises?2) If there are test failures with the new values, it seems REALLY IMPORTANT to make sure those test failures are discovered + fixed IN THE FUTURE TOO. If pushing new yaml into a different file makes us less likely to catch the failures in the future, it seems like we’re hurting ourselves. Branimir mentions this, but how do we ensure that we don’t let this pattern disguise future bugs? The main objective of this patch is to ensure that the second yaml is tested too, pre-commit. We were not doing this for all features we tell users are supported.Paulo:- if cassandra_latest.yaml becomes the new default configuration for 6.0, then precommit only needs to be run against thatversion - prerelease needs to be run against all cassandra.yaml variants.Assuming we keep the pace of development, there will be new "latest" features in 6.0 (e.g. Accord could be one). The idea is more to move some of the settings from latest to default when they are deemed mature enough.Josh:I propose to significantly reduce that stuff. Let's distinguish the packages of tests that need to be run with CDC enabled / disabled, with commitlog compression enabled / disabled, tests that verify sstable formats (mostly io and index I guess), and leave other parameters set as with the latest configuration - this is the easiest way I think. For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about other stuff. To me running no-vnodes makes no sense because no-vnodes is just a special case of vnodes=1. On the other hand offheap/onheap buffers could be tested in unit tests. In short, I'd run dtests only with the default and latest configuration.Some of these changes are already done in this ticket. Regards,BranimirOn Thu, Feb 15, 2024 at 3:08 PM Paulo Motta  wrote:> It's also been questioned about why we don't just enable settings we recommend.  These are settings we recommend for new clusters.  *Our existing cassandra.yaml needs to be tailored for existing clusters being upgraded, where we are very conservative about changing defaults.*I think this unnecessarily penalizes new users with subpar defaults and existing users who wish to use optimized/recommended defaults and need to maintain additional logic to support that. This change offers an opportunity to revisit this.Is not updating the default cassandra.yaml with new recommended configuration just to protect existing clusters from accidentally overriding cassandra.yaml with a new version during major upgrades? If so, perhaps we could add a new explicit flag “enable_major_upgrade: false” to “cassandra.yaml” that fails startup if an upgrade is detected and force operators to review the configuration before a major upgrade?Related to Jeff’s question, I think we need a way to consolidate “latest recommended settings” into “old compatible default” when cutting a new major version, otherwise the files will diverge perpetually.I think cassandra_latest.yaml offers a way to “buffer” proposals for default configuration changes which are consolidated into “cassandra.yaml” in the subsequent major release, eventually converging configurations and reducing the maintenance burden.On Thu, 15 Feb 2024 at 04:24 Mick Semb Wever  wrote:      Mick and Ekaterina (and everyone really) - any thoughts on what test coverage, if any, we should

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-16 Thread Paulo Motta
Thanks for clarifying Branimir! I'm +1 on proceeding as proposed and I
think this change will make it easier to gain confidence to update
configurations.

Interesting discussion and suggestions on this thread - I think we can
follow-up on improving test/CI workflow in a different thread/proposal to
avoid blocking this.

On Thu, Feb 15, 2024 at 9:59 AM Branimir Lambov <
branimir.lam...@datastax.com> wrote:

> Paulo:
>
>> 1) Will cassandra.yaml remain the default test config? Is the plan moving
>> forward to require green CI for both configurations on pre-commit, or
>> pre-release?
>
> The plan is to ensure both configurations are green pre-commit. This
> should not increase the CI cost as this replaces extra configurations we
> were running before (e.g. test-tries).
>
> 2) What will this mean for the release artifact, is the idea to continue
>> shipping with the current cassandra.yaml or eventually switch to the
>> optimized configuration (ie. 6.X) while making the legacy default
>> configuration available via an optional flag?
>
> The release simply includes an additional yaml file, which contains a
> one-liner how to use it.
>
> Jeff:
>
>> 1) If there’s an “old compatible default” and “latest recommended
>> settings”, when does the value in “old compatible default” get updated?
>> Never?
>
> This does not change anything about these decisions. The question is very
> serious without this patch as well: Does V6 have to support pain-free
> upgrade from V5 working in V4 compatible mode? If so, can we ever deprecate
> or drop anything? If not, are we not breaking upgradeability promises?
>
> 2) If there are test failures with the new values, it seems REALLY
>> IMPORTANT to make sure those test failures are discovered + fixed IN THE
>> FUTURE TOO. If pushing new yaml into a different file makes us less likely
>> to catch the failures in the future, it seems like we’re hurting ourselves.
>> Branimir mentions this, but how do we ensure that we don’t let this pattern
>> disguise future bugs?
>
> The main objective of this patch is to ensure that the second yaml is
> tested too, pre-commit. We were not doing this for all features we tell
> users are supported.
>
> Paulo:
>
>> - if cassandra_latest.yaml becomes the new default configuration for 6.0,
>> then precommit only needs to be run against thatversion - prerelease needs
>> to be run against all cassandra.yaml variants.
>
> Assuming we keep the pace of development, there will be new "latest"
> features in 6.0 (e.g. Accord could be one). The idea is more to move some
> of the settings from latest to default when they are deemed mature enough.
>
> Josh:
>
>> I propose to significantly reduce that stuff. Let's distinguish the
>> packages of tests that need to be run with CDC enabled / disabled, with
>> commitlog compression enabled / disabled, tests that verify sstable formats
>> (mostly io and index I guess), and leave other parameters set as with the
>> latest configuration - this is the easiest way I think.
>> For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about
>> other stuff. To me running no-vnodes makes no sense because no-vnodes is
>> just a special case of vnodes=1. On the other hand offheap/onheap buffers
>> could be tested in unit tests. In short, I'd run dtests only with the
>> default and latest configuration.
>
> Some of these changes are already done in this ticket.
>
> Regards,
> Branimir
>
>
>
> On Thu, Feb 15, 2024 at 3:08 PM Paulo Motta  wrote:
>
>> > It's also been questioned about why we don't just enable settings we
>> recommend.  These are settings we recommend for new clusters.  *Our
>> existing cassandra.yaml needs to be tailored for existing clusters being
>> upgraded, where we are very conservative about changing defaults.*
>>
>> I think this unnecessarily penalizes new users with subpar defaults and
>> existing users who wish to use optimized/recommended defaults and need to
>> maintain additional logic to support that. This change offers an
>> opportunity to revisit this.
>>
>> Is not updating the default cassandra.yaml with new recommended
>> configuration just to protect existing clusters from accidentally
>> overriding cassandra.yaml with a new version during major upgrades? If so,
>> perhaps we could add a new explicit flag “enable_major_upgrade: false” to
>> “cassandra.yaml” that fails startup if an upgrade is detected and force
>> operators to review the configuration before a major upgrade?
>>
>> Related to Jeff’s question, I think we need a way to consolidate “latest
>> recommended settings” into “old compatible default” when cutting a new
>> major version, otherwise the files will diverge perpetually.
>>
>> I think cassandra_latest.yaml offers a way to “buffer” proposals for
>> default configuration changes which are consolidated into “cassandra.yaml”
>> in the subsequent major release, eventually converging configurations and
>> reducing the maintenance burden.
>>
>> On Thu, 15 Feb 2024 at 04:24 Mick Semb We

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-16 Thread Jacek Lewandowski
We should conclude this discussion by answering Branimir's original
question.* I vote for merging that and exposing issues to the CI.*

For pre-commit optimization I've opened
https://issues.apache.org/jira/browse/CASSANDRA-19406 epic and we should
add exact tasks there to make this valuable discussion result in some
concrete actions. Then, we can discuss each task in a more organized way.

czw., 15 lut 2024 o 21:29 Štefan Miklošovič 
napisał(a):

> I love the idea of David to make this dual config stuff directly the part
> of the tests, I just leave this here where I quickly put some super
> primitive runner together
>
>
> https://github.com/smiklosovic/cassandra/commit/693803772218b52c424491b826c704811d606a31
>
> We could just run by default with one config and annotate it with all
> configs if we think this is crucial to test in both scenarios.
>
> Anyway, happy to expand on this but I do not want to block any progress in
> the ticket, might come afterwards, just showing what is possible.
>
> On Thu, Feb 15, 2024 at 7:59 PM David Capwell  wrote:
>
>> This thread got large quick, yay!
>>
>> is there a reason all guardrails and reliability (aka repair retries)
>> configs are off by default?  They are off by default in the normal config
>> for backwards compatibility reasons, but if we are defining a config saying
>> what we recommend, we should enable these things by default IMO.
>>
>> This is one more question to be answered by this discussion. Are there
>> other options that should be enabled by the "latest" configuration? To what
>> values should they be set?
>> Is there something that is currently enabled that should not be?
>>
>>
>> Very likely, we should try to figure that out.  We should also answer how
>> conservative do we want to be by default?  There are many configs we need
>> to flesh out here, glad to help with the configs I authored (prob best for
>> JIRA rather than this thread)
>>
>>
>> Should we merge the configs breaking these tests?  No…. When we have
>> failing tests people do not spend the time to figure out if their logic
>> caused a regression and merge, making things more unstable… so when we
>> merge failing tests that leads to people merging even more failing tests...
>>
>> In this case this also means that people will not see at all failures
>> that they introduce in any of the advanced features, as they are not tested
>> at all. Also, since CASSANDRA-19167 and 19168 already have fixes, the
>> non-latest test suite will remain clean after merge. Note that these two
>> problems demonstrate that we have failures in the configuration we ship
>> with, because we are not actually testing it at all. IMHO this is a problem
>> that we should not delay fixing.
>>
>>
>> I am not arguing we should not get this into CI, but more we should fix
>> the known issues before getting into CI… its what we normally do, I don’t
>> see a reason to special case this work.
>>
>> I am 100% cool blocking 5.0 on these bugs found (even if they are test
>> failures), but don’t feel we should enable in CI until these issues are
>> resolved; we can add the yaml now, but not the CI pipelines.
>>
>>
>> 1) If there’s an “old compatible default” and “latest recommended
>> settings”, when does the value in “old compatible default” get updated?
>> Never?
>>
>>
>> How about replacing cassandra.yaml with cassandra_latest.yaml on trunk
>> when cutting cassandra-6.0 branch? Any new default changes on trunk go to
>> cassandra_latest.yaml.
>>
>>
>> I feel its dangerous to define this at the file level and should do at
>> the config level… I personally see us adding new features disabled by
>> default in cassandra.yaml and the recommended values in
>> Cassandra-latest.yaml… If I add a config in 5.1.2 should it get enabled by
>> default in 6.0?  I don’t feel thats wise.
>>
>> Maybe it makes sense to annotate the configs with the target version for
>> the default change?
>>
>> Let's distinguish the packages of tests that need to be run with CDC
>> enabled / disabled, with commitlog compression enabled / disabled, tests
>> that verify sstable formats (mostly io and index I guess), and leave other
>> parameters set as with the latest configuration - this is the easiest way I
>> think.
>>
>>
>> Yes please!  I really hate having a pipeline per config, we should
>> annotate this some how in the tests that matter… junit can param the tests
>> for us so we cover the different configs the test supports… I have written
>> many tests that are costly and run on all these other pipelines but have 0
>> change in the config… just wasting resources rerunning…
>>
>> Pushing this to the test also is a better author/maintainer experience…
>> running the test in your IDE and seeing all the params and their results is
>> so much better than monkeying around with yaml files and ant…. My repair
>> simulation tests have a hack flag to try to switch the yaml to make it
>> easier to test against the other configs and I loath it so much…
>>
>> To me

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Štefan Miklošovič
I love the idea of David to make this dual config stuff directly the part
of the tests, I just leave this here where I quickly put some super
primitive runner together

https://github.com/smiklosovic/cassandra/commit/693803772218b52c424491b826c704811d606a31

We could just run by default with one config and annotate it with all
configs if we think this is crucial to test in both scenarios.

Anyway, happy to expand on this but I do not want to block any progress in
the ticket, might come afterwards, just showing what is possible.

On Thu, Feb 15, 2024 at 7:59 PM David Capwell  wrote:

> This thread got large quick, yay!
>
> is there a reason all guardrails and reliability (aka repair retries)
> configs are off by default?  They are off by default in the normal config
> for backwards compatibility reasons, but if we are defining a config saying
> what we recommend, we should enable these things by default IMO.
>
> This is one more question to be answered by this discussion. Are there
> other options that should be enabled by the "latest" configuration? To what
> values should they be set?
> Is there something that is currently enabled that should not be?
>
>
> Very likely, we should try to figure that out.  We should also answer how
> conservative do we want to be by default?  There are many configs we need
> to flesh out here, glad to help with the configs I authored (prob best for
> JIRA rather than this thread)
>
>
> Should we merge the configs breaking these tests?  No…. When we have
> failing tests people do not spend the time to figure out if their logic
> caused a regression and merge, making things more unstable… so when we
> merge failing tests that leads to people merging even more failing tests...
>
> In this case this also means that people will not see at all failures that
> they introduce in any of the advanced features, as they are not tested at
> all. Also, since CASSANDRA-19167 and 19168 already have fixes, the
> non-latest test suite will remain clean after merge. Note that these two
> problems demonstrate that we have failures in the configuration we ship
> with, because we are not actually testing it at all. IMHO this is a problem
> that we should not delay fixing.
>
>
> I am not arguing we should not get this into CI, but more we should fix
> the known issues before getting into CI… its what we normally do, I don’t
> see a reason to special case this work.
>
> I am 100% cool blocking 5.0 on these bugs found (even if they are test
> failures), but don’t feel we should enable in CI until these issues are
> resolved; we can add the yaml now, but not the CI pipelines.
>
>
> 1) If there’s an “old compatible default” and “latest recommended
> settings”, when does the value in “old compatible default” get updated?
> Never?
>
>
> How about replacing cassandra.yaml with cassandra_latest.yaml on trunk
> when cutting cassandra-6.0 branch? Any new default changes on trunk go to
> cassandra_latest.yaml.
>
>
> I feel its dangerous to define this at the file level and should do at the
> config level… I personally see us adding new features disabled by default
> in cassandra.yaml and the recommended values in Cassandra-latest.yaml… If I
> add a config in 5.1.2 should it get enabled by default in 6.0?  I don’t
> feel thats wise.
>
> Maybe it makes sense to annotate the configs with the target version for
> the default change?
>
> Let's distinguish the packages of tests that need to be run with CDC
> enabled / disabled, with commitlog compression enabled / disabled, tests
> that verify sstable formats (mostly io and index I guess), and leave other
> parameters set as with the latest configuration - this is the easiest way I
> think.
>
>
> Yes please!  I really hate having a pipeline per config, we should
> annotate this some how in the tests that matter… junit can param the tests
> for us so we cover the different configs the test supports… I have written
> many tests that are costly and run on all these other pipelines but have 0
> change in the config… just wasting resources rerunning…
>
> Pushing this to the test also is a better author/maintainer experience…
> running the test in your IDE and seeing all the params and their results is
> so much better than monkeying around with yaml files and ant…. My repair
> simulation tests have a hack flag to try to switch the yaml to make it
> easier to test against the other configs and I loath it so much…
>
> To me running no-vnodes makes no sense because no-vnodes is just a special
> case of vnodes=1
>
>
> And to many in the community the only config for production =).  In this
> debate we have 3 camps: no-vnode, vnode <= 4 tokens, vnode > 4 tokens (I am
> simplifying….)… For those in the no-vnode camp their tests are focused on
> this case and get disabled when vnodes are enabled (so not running this
> config lowers coverage).
>
> I don’t think we are going to solve this debate in this thread, but if we
> can push this to the test to run as a param I think tha

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Ekaterina Dimitrova
>
> Perhaps this was needed during j17 stabilization but is no longer required?

No, I only switched from tests running J8+J11 to tests running J11+J17.
What we tested was something decided in the 4.0 era when JDK11 was added,
and I was not even part of the community yet :-)

> Any known java-related changes require precommit j11 + j17.

While it is correct, we need to run both when changing anything
Java-related. I think more cases are not always obvious during development.
Just a few examples on top of my mind:
- Every next Java version is closing more and more internals - this means
that someone can be tempted to use some JDK11 internals, which are still
open in 11 but closed in later Java versions, and we will be in trouble. I
would prefer to see that before committing the patch, not after. Some cases
can be considered early in development and save us time. They are not
necessarily caught in compile time.
- Not all dependencies always support all Java versions, and it is not
necessarily immediately obvious
- Not all issues are exposed as failures on every CI run - we have flakies
hard to reproduce with runtime exceptions, especially around things that
require object crawling - on top of my mind jamm and the leak detector, but
also it could come from a dependency we haven't even thought about it.

With all my due respect to everyone and the extremely valuable discussion
here around what we run in CI, I fear we diverged the discussion from what
we do about the ticket in hand, and it is feasible today, to the long-term
discussion - nightly builds, etc. I believe the long-term discussion
deserves its thread, ticket, and work.
@Berenguer, thank you for volunteering to open an additional thread and
working on a new suggestion. Does anyone have anything against moving all
long-term suggestions not immediately related to this work here into a new
discussion thread? Also, in the spirit of the repeatable CI project,
creating a table with pre-commit and post-commit suggested jobs to run will
be good. Then we can decide what we want and as a second step add/remove
jobs in Jenkins, Circle, or whatever other CI people use at the moment and
hopefully converge it soon through the repeatable CI project. Do you think
this makes sense?

Again, I don't see value in running build J11 and J17 runtime additionally
> to J11 runtime - just pick one unless we change something specific to JVM

All JDK-17 problems I've seen were exposed in both situations - run 17 with
build 11 or 17 build. So I am fine with Jacek's suggestion, too, but I
prefer us to run on every commit, whatever we ship with. In the case of 5.0
- build JDK11, run JDK11 tests, run JDK17 tests, and to help ourselves -
build with JDK17.

Branimir in this patch has already done some basic cleanup of test
> variations, so this is not a duplication of the pipeline. It's a
> significant improvement.
>  I'm ok with cassandra_latest being committed and added to the pipeline,
> *if* the authors genuinely believe there's significant time and effort
> saved in doing so.

I share this sentiment if people are okay with us adding pre-commit now new
Python and Java distributed test jobs with the new configuration, and this
is not going to raise a lot the resource consumption. (Python tests are the
most resource-heavy tests. though we do not look at upgrade tests)

The plan is to ensure both configurations are green pre-commit. This should
> not increase the CI cost as this replaces extra configurations we were
> running before (e.g. test-tries).

Branimir, Did you also replace any Python tests? I am not worried about
unit test consumption but about the Python tests primarily.  Those are
running on the bigger containers in CircleCI, which burn more credits.
Also, Stefan, valid point - does Jenkins currently have enough resources to
cover the load? Was this tested?


On Thu, 15 Feb 2024 at 13:59, David Capwell  wrote:

> This thread got large quick, yay!
>
> is there a reason all guardrails and reliability (aka repair retries)
> configs are off by default?  They are off by default in the normal config
> for backwards compatibility reasons, but if we are defining a config saying
> what we recommend, we should enable these things by default IMO.
>
> This is one more question to be answered by this discussion. Are there
> other options that should be enabled by the "latest" configuration? To what
> values should they be set?
> Is there something that is currently enabled that should not be?
>
>
> Very likely, we should try to figure that out.  We should also answer how
> conservative do we want to be by default?  There are many configs we need
> to flesh out here, glad to help with the configs I authored (prob best for
> JIRA rather than this thread)
>
>
> Should we merge the configs breaking these tests?  No…. When we have
> failing tests people do not spend the time to figure out if their logic
> caused a regression and merge, making things more unstable… so when we
> merge faili

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread David Capwell
This thread got large quick, yay!

>> is there a reason all guardrails and reliability (aka repair retries) 
>> configs are off by default?  They are off by default in the normal config 
>> for backwards compatibility reasons, but if we are defining a config saying 
>> what we recommend, we should enable these things by default IMO.
> 
> This is one more question to be answered by this discussion. Are there other 
> options that should be enabled by the "latest" configuration? To what values 
> should they be set?
> Is there something that is currently enabled that should not be?

Very likely, we should try to figure that out.  We should also answer how 
conservative do we want to be by default?  There are many configs we need to 
flesh out here, glad to help with the configs I authored (prob best for JIRA 
rather than this thread)

> 
>> Should we merge the configs breaking these tests?  No…. When we have failing 
>> tests people do not spend the time to figure out if their logic caused a 
>> regression and merge, making things more unstable… so when we merge failing 
>> tests that leads to people merging even more failing tests...
> In this case this also means that people will not see at all failures that 
> they introduce in any of the advanced features, as they are not tested at 
> all. Also, since CASSANDRA-19167 and 19168 already have fixes, the non-latest 
> test suite will remain clean after merge. Note that these two problems 
> demonstrate that we have failures in the configuration we ship with, because 
> we are not actually testing it at all. IMHO this is a problem that we should 
> not delay fixing.


I am not arguing we should not get this into CI, but more we should fix the 
known issues before getting into CI… its what we normally do, I don’t see a 
reason to special case this work.

I am 100% cool blocking 5.0 on these bugs found (even if they are test 
failures), but don’t feel we should enable in CI until these issues are 
resolved; we can add the yaml now, but not the CI pipelines.


> 1) If there’s an “old compatible default” and “latest recommended settings”, 
> when does the value in “old compatible default” get updated? Never? 

> How about replacing cassandra.yaml with cassandra_latest.yaml on trunk when 
> cutting cassandra-6.0 branch? Any new default changes on trunk go to 
> cassandra_latest.yaml.

I feel its dangerous to define this at the file level and should do at the 
config level… I personally see us adding new features disabled by default in 
cassandra.yaml and the recommended values in Cassandra-latest.yaml… If I add a 
config in 5.1.2 should it get enabled by default in 6.0?  I don’t feel thats 
wise.

Maybe it makes sense to annotate the configs with the target version for the 
default change?

> Let's distinguish the packages of tests that need to be run with CDC enabled 
> / disabled, with commitlog compression enabled / disabled, tests that verify 
> sstable formats (mostly io and index I guess), and leave other parameters set 
> as with the latest configuration - this is the easiest way I think. 

Yes please!  I really hate having a pipeline per config, we should annotate 
this some how in the tests that matter… junit can param the tests for us so we 
cover the different configs the test supports… I have written many tests that 
are costly and run on all these other pipelines but have 0 change in the 
config… just wasting resources rerunning…

Pushing this to the test also is a better author/maintainer experience… running 
the test in your IDE and seeing all the params and their results is so much 
better than monkeying around with yaml files and ant…. My repair simulation 
tests have a hack flag to try to switch the yaml to make it easier to test 
against the other configs and I loath it so much…

> To me running no-vnodes makes no sense because no-vnodes is just a special 
> case of vnodes=1

And to many in the community the only config for production =).  In this debate 
we have 3 camps: no-vnode, vnode <= 4 tokens, vnode > 4 tokens (I am 
simplifying….)… For those in the no-vnode camp their tests are focused on this 
case and get disabled when vnodes are enabled (so not running this config 
lowers coverage).  

I don’t think we are going to solve this debate in this thread, but if we can 
push this to the test to run as a param I think thats best… avoid having 2 
pipelines and push this to the tests that actually support both configs...

> On Feb 15, 2024, at 10:20 AM, Jon Haddad  wrote:
> 
> For the sake of argument, if we picked one, would you (anyone, not just 
> Stefan) be OK with the JVM that's selected being the one you don't use?  I'm 
> willing to bet most folks will have a preference for the JVM they run in 
> production.  I think both should be run on some frequent basis but I can 
> definitely see the reasoning behind not wanting it to block folks on work, it 
> sounds like a lot of wasted days waiting on CI especially during a bigger 
> multi-cycle

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Štefan Miklošovič
This goes back to what I was talking about previously, we might even run
tests for both j11 and j17 e.g. in Circle _but only on a selected set of
tests_ where there is some kind of a "tension" between the code and Java
version, whatever it means. Like Chronicle queues or BTrees etc, I merely
remeber that was somehow problematic in the past ...

Maybe Ekaterina could bring her insight into what were the biggest and most
important parts of Java 17 support during her work towards that?

However, it is questionable how we would actually separate what is Java
17-worthy to test as all is pretty much connected with everything else.

What you suggest in terms of running it all periodically reasonates
positively with me.

On Thu, Feb 15, 2024 at 7:22 PM Jon Haddad  wrote:

> For the sake of argument, if we picked one, would you (anyone, not just
> Stefan) be OK with the JVM that's selected being the one you don't use?
> I'm willing to bet most folks will have a preference for the JVM they run
> in production.  I think both should be run on some frequent basis but I can
> definitely see the reasoning behind not wanting it to block folks on work,
> it sounds like a lot of wasted days waiting on CI especially during a
> bigger multi-cycle review.
>
> I suppose that it wouldn't necessarily need to be consistent - for example
> some folks might use 17 and others 11.  If this was the route the project
> goes down, it seems like it would be reasonable for someone to run
> whichever JVM version they felt like.  Hopefully at least a few regular
> committers would run 17, and that might be enough.
>
> Maybe instead of running the full suite on post - commit, it could be run
> periodically, like once a night or if it's longer than 24h run once a
> week.  If both JVMs get hit b/c different teams opt to use different
> versions, maybe it ends up being enough coverage.
>
> I'm curious if anyone can think of an issue that affected one JVM and not
> others, because that would probably help determine the usefulness of 2 test
> suites.
>
>
> On Thu, Feb 15, 2024 at 10:01 AM Štefan Miklošovič <
> stefan.mikloso...@gmail.com> wrote:
>
>> Only requiring building on supported JDKs and running all tests only on a
>> pre-defined version is definitely an improvement in terms of build time.
>> Building it is cheap, one worker and 5 minutes.
>>
>> As I already said, just want to reiterate that, instead of _running with
>> all Java's_ we might run with one Java version, we would just change it for
>> runs of two yamls (default and latest).
>>
>> However, this would put more stress on Jenkins based on what you
>> described in Post-commit point. Just saying it aloud.
>>
>> On Thu, Feb 15, 2024 at 6:12 PM Josh McKenzie 
>> wrote:
>>
>>> Would it make sense to only block commits on the test strategy you've
>>> listed, and shift the entire massive test suite to post-commit?
>>>
>>>
>>> Lots and lots of other emails
>>>
>>>
>>> ;)
>>>
>>> There's an interesting broad question of: What config do we consider
>>> "recommended" going forward, the "conservative" (i.e. old) or the
>>> "performant" (i.e. new)? And what JDK do we consider "recommended" going
>>> forward, the oldest we support or the newest?
>>>
>>> Since those recommendations apply for new clusters, people need to
>>> qualify their setups, and we have a high bar of quality on testing
>>> pre-merge, my gut tells me "performant + newest JDK". This would impact
>>> what we'd test pre-commit IMO.
>>>
>>> Having been doing a lot of CI stuff lately, some observations:
>>>
>>>- Our True North needs to be releasing a database that's free of
>>>defects that violate our core properties we commit to our users. No data
>>>loss, no data resurrection, transient or otherwise, due to defects in our
>>>code (meteors, tsunamis, etc notwithstanding).
>>>- The relationship of time spent on CI and stability of final full
>>>*post-commit* runs is asymptotic. It's not even 90/10; we're
>>>probably somewhere like 98% value gained from 10% of work, and the other 
>>> 2%
>>>"stability" (i.e. green test suites, not "our database works") is a
>>>long-tail slog. Especially in the current ASF CI heterogenous env w/its
>>>current orchestration.
>>>- Thus: Pre-commit and post-commit should be different. The
>>>following points all apply to pre-commit:
>>>- The goal of pre-commit tests should be some number of 9's of no
>>>test failures post-commit (i.e. for every 20 green pre-commit we 
>>> introduce
>>>1 flake post-commit). Not full perfection; it's not worth the compute and
>>>complexity.
>>>- We should *build *all branches on all supported JDK's (8 + 11 for
>>>older, 11 + 17 for newer, etc).
>>>- We should *run *all test suites with the *recommended *
>>>*configuration* against the *highest versioned JDK a branch
>>>supports. *And we should formally recommend our users run on that
>>>JDK.
>>>- We should *at least* run all jvm-b

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Jon Haddad
For the sake of argument, if we picked one, would you (anyone, not just
Stefan) be OK with the JVM that's selected being the one you don't use?
I'm willing to bet most folks will have a preference for the JVM they run
in production.  I think both should be run on some frequent basis but I can
definitely see the reasoning behind not wanting it to block folks on work,
it sounds like a lot of wasted days waiting on CI especially during a
bigger multi-cycle review.

I suppose that it wouldn't necessarily need to be consistent - for example
some folks might use 17 and others 11.  If this was the route the project
goes down, it seems like it would be reasonable for someone to run
whichever JVM version they felt like.  Hopefully at least a few regular
committers would run 17, and that might be enough.

Maybe instead of running the full suite on post - commit, it could be run
periodically, like once a night or if it's longer than 24h run once a
week.  If both JVMs get hit b/c different teams opt to use different
versions, maybe it ends up being enough coverage.

I'm curious if anyone can think of an issue that affected one JVM and not
others, because that would probably help determine the usefulness of 2 test
suites.


On Thu, Feb 15, 2024 at 10:01 AM Štefan Miklošovič <
stefan.mikloso...@gmail.com> wrote:

> Only requiring building on supported JDKs and running all tests only on a
> pre-defined version is definitely an improvement in terms of build time.
> Building it is cheap, one worker and 5 minutes.
>
> As I already said, just want to reiterate that, instead of _running with
> all Java's_ we might run with one Java version, we would just change it for
> runs of two yamls (default and latest).
>
> However, this would put more stress on Jenkins based on what you described
> in Post-commit point. Just saying it aloud.
>
> On Thu, Feb 15, 2024 at 6:12 PM Josh McKenzie 
> wrote:
>
>> Would it make sense to only block commits on the test strategy you've
>> listed, and shift the entire massive test suite to post-commit?
>>
>>
>> Lots and lots of other emails
>>
>>
>> ;)
>>
>> There's an interesting broad question of: What config do we consider
>> "recommended" going forward, the "conservative" (i.e. old) or the
>> "performant" (i.e. new)? And what JDK do we consider "recommended" going
>> forward, the oldest we support or the newest?
>>
>> Since those recommendations apply for new clusters, people need to
>> qualify their setups, and we have a high bar of quality on testing
>> pre-merge, my gut tells me "performant + newest JDK". This would impact
>> what we'd test pre-commit IMO.
>>
>> Having been doing a lot of CI stuff lately, some observations:
>>
>>- Our True North needs to be releasing a database that's free of
>>defects that violate our core properties we commit to our users. No data
>>loss, no data resurrection, transient or otherwise, due to defects in our
>>code (meteors, tsunamis, etc notwithstanding).
>>- The relationship of time spent on CI and stability of final full
>>*post-commit* runs is asymptotic. It's not even 90/10; we're probably
>>somewhere like 98% value gained from 10% of work, and the other 2%
>>"stability" (i.e. green test suites, not "our database works") is a
>>long-tail slog. Especially in the current ASF CI heterogenous env w/its
>>current orchestration.
>>- Thus: Pre-commit and post-commit should be different. The following
>>points all apply to pre-commit:
>>- The goal of pre-commit tests should be some number of 9's of no
>>test failures post-commit (i.e. for every 20 green pre-commit we introduce
>>1 flake post-commit). Not full perfection; it's not worth the compute and
>>complexity.
>>- We should *build *all branches on all supported JDK's (8 + 11 for
>>older, 11 + 17 for newer, etc).
>>- We should *run *all test suites with the *recommended *
>>*configuration* against the *highest versioned JDK a branch
>>supports. *And we should formally recommend our users run on that JDK.
>>- We should *at least* run all jvm-based configurations on the
>>highest supported JDK version with the "not recommended but still
>>supported" configuration.
>>- I'm open to being persuaded that we should at least run jvm-unit
>>tests on the older JDK w/the conservative config pre-commit, but not much
>>beyond that.
>>
>> That would leave us with the following distilled:
>>
>> *Pre-commit:*
>>
>>- Build on all supported jdks
>>- All test suites on highest supported jdk using recommended config
>>- Repeat testing on new or changed tests on highest supported JDK
>>w/recommended config
>>- JDK-based test suites on highest supported jdk using other config
>>
>> *Post-commit:*
>>
>>- Run everything. All suites, all supported JDK's, both config files.
>>
>> With Butler + the *jenkins-jira* integration script
>> 

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Brandon Williams
I don't think there is as simple a way to identify those since there
are many ways you can specify a single token.

Kind Regards,
Brandon



On Thu, Feb 15, 2024 at 11:45 AM Jacek Lewandowski
 wrote:
>
> Brandon, that should be doable with the current filters I think - that is, 
> select only those tests which do not support vnodes. Do you know about such 
> in-jvm dtests as well?
>
> - - -- --- -  -
> Jacek Lewandowski
>
>
> czw., 15 lut 2024 o 18:21 Brandon Williams  napisał(a):
>>
>> On Thu, Feb 15, 2024 at 1:10 AM Jacek Lewandowski
>>  wrote:
>> > For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about 
>> > other stuff. To me running no-vnodes makes no sense because no-vnodes is 
>> > just a special case of vnodes=1. On the other hand offheap/onheap buffers 
>> > could be tested in unit tests. In short, I'd run dtests only with the 
>> > default and latest configuration.
>>
>> I largely agree that no-vnodes isn't useful, but there are some
>> non-vnode operations like moving a token that don't work with vnodes
>> and still need to be tested.  I think we could probably get quick
>> savings by breaking out the @no_vnodes tests to a separate suite run
>> so we aren't completely doubling our effort for little gain with every
>> commit.


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Štefan Miklošovič
Only requiring building on supported JDKs and running all tests only on a
pre-defined version is definitely an improvement in terms of build time.
Building it is cheap, one worker and 5 minutes.

As I already said, just want to reiterate that, instead of _running with
all Java's_ we might run with one Java version, we would just change it for
runs of two yamls (default and latest).

However, this would put more stress on Jenkins based on what you described
in Post-commit point. Just saying it aloud.

On Thu, Feb 15, 2024 at 6:12 PM Josh McKenzie  wrote:

> Would it make sense to only block commits on the test strategy you've
> listed, and shift the entire massive test suite to post-commit?
>
>
> Lots and lots of other emails
>
>
> ;)
>
> There's an interesting broad question of: What config do we consider
> "recommended" going forward, the "conservative" (i.e. old) or the
> "performant" (i.e. new)? And what JDK do we consider "recommended" going
> forward, the oldest we support or the newest?
>
> Since those recommendations apply for new clusters, people need to qualify
> their setups, and we have a high bar of quality on testing pre-merge, my
> gut tells me "performant + newest JDK". This would impact what we'd test
> pre-commit IMO.
>
> Having been doing a lot of CI stuff lately, some observations:
>
>- Our True North needs to be releasing a database that's free of
>defects that violate our core properties we commit to our users. No data
>loss, no data resurrection, transient or otherwise, due to defects in our
>code (meteors, tsunamis, etc notwithstanding).
>- The relationship of time spent on CI and stability of final full
>*post-commit* runs is asymptotic. It's not even 90/10; we're probably
>somewhere like 98% value gained from 10% of work, and the other 2%
>"stability" (i.e. green test suites, not "our database works") is a
>long-tail slog. Especially in the current ASF CI heterogenous env w/its
>current orchestration.
>- Thus: Pre-commit and post-commit should be different. The following
>points all apply to pre-commit:
>- The goal of pre-commit tests should be some number of 9's of no test
>failures post-commit (i.e. for every 20 green pre-commit we introduce 1
>flake post-commit). Not full perfection; it's not worth the compute and
>complexity.
>- We should *build *all branches on all supported JDK's (8 + 11 for
>older, 11 + 17 for newer, etc).
>- We should *run *all test suites with the *recommended *
>*configuration* against the *highest versioned JDK a branch supports. *And
>we should formally recommend our users run on that JDK.
>- We should *at least* run all jvm-based configurations on the highest
>supported JDK version with the "not recommended but still supported"
>configuration.
>- I'm open to being persuaded that we should at least run jvm-unit
>tests on the older JDK w/the conservative config pre-commit, but not much
>beyond that.
>
> That would leave us with the following distilled:
>
> *Pre-commit:*
>
>- Build on all supported jdks
>- All test suites on highest supported jdk using recommended config
>- Repeat testing on new or changed tests on highest supported JDK
>w/recommended config
>- JDK-based test suites on highest supported jdk using other config
>
> *Post-commit:*
>
>- Run everything. All suites, all supported JDK's, both config files.
>
> With Butler + the *jenkins-jira* integration script
> (need
> to dust that off but it should remain good to go), we should have a pretty
> clear view as to when any consistent regressions are introduced and why.
> We'd remain exposed to JDK-specific flake introductions and flakes in
> unchanged tests, but there's no getting around the 2nd one and I expect the
> former to be rare enough to not warrant the compute to prevent it.
>
> On Thu, Feb 15, 2024, at 10:02 AM, Jon Haddad wrote:
>
> Would it make sense to only block commits on the test strategy you've
> listed, and shift the entire massive test suite to post-commit?  If there
> really is only a small % of times the entire suite is useful this seems
> like it could unblock the dev cycle but still have the benefit of the full
> test suite.
>
>
>
> On Thu, Feb 15, 2024 at 3:18 AM Berenguer Blasi 
> wrote:
>
>
> On reducing circle ci usage during dev while iterating, not with the
> intention to replace the pre-commit CI (yet), we could do away with testing
> only dtests, jvm-dtests, units and cqlsh for a _single_ configuration imo.
> That would greatly reduce usage. I hacked it quickly here for illustration
> purposes:
> https://app.circleci.com/pipelines/github/bereng/cassandra/1164/workflows/3a47c9ef-6456-4190-b5a5-aea2aff641f1
> The good thing is that we have the tooling to dial in whatever we decide
> atm.
>
> Changing pre-commit is a different discu

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Jacek Lewandowski
Brandon, that should be doable with the current filters I think - that is,
select only those tests which do not support vnodes. Do you know about such
in-jvm dtests as well?

- - -- --- -  -
Jacek Lewandowski


czw., 15 lut 2024 o 18:21 Brandon Williams  napisał(a):

> On Thu, Feb 15, 2024 at 1:10 AM Jacek Lewandowski
>  wrote:
> > For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about
> other stuff. To me running no-vnodes makes no sense because no-vnodes is
> just a special case of vnodes=1. On the other hand offheap/onheap buffers
> could be tested in unit tests. In short, I'd run dtests only with the
> default and latest configuration.
>
> I largely agree that no-vnodes isn't useful, but there are some
> non-vnode operations like moving a token that don't work with vnodes
> and still need to be tested.  I think we could probably get quick
> savings by breaking out the @no_vnodes tests to a separate suite run
> so we aren't completely doubling our effort for little gain with every
> commit.
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Brandon Williams
On Thu, Feb 15, 2024 at 1:10 AM Jacek Lewandowski
 wrote:
> For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about other 
> stuff. To me running no-vnodes makes no sense because no-vnodes is just a 
> special case of vnodes=1. On the other hand offheap/onheap buffers could be 
> tested in unit tests. In short, I'd run dtests only with the default and 
> latest configuration.

I largely agree that no-vnodes isn't useful, but there are some
non-vnode operations like moving a token that don't work with vnodes
and still need to be tested.  I think we could probably get quick
savings by breaking out the @no_vnodes tests to a separate suite run
so we aren't completely doubling our effort for little gain with every
commit.


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Jacek Lewandowski
Great summary Josh,

>
>- JDK-based test suites on highest supported jdk using other config
>
> Do you mean a smoke test suite by that ^ ?

- - -- --- -  -
Jacek Lewandowski


czw., 15 lut 2024 o 18:12 Josh McKenzie  napisał(a):

> Would it make sense to only block commits on the test strategy you've
> listed, and shift the entire massive test suite to post-commit?
>
>
> Lots and lots of other emails
>
>
> ;)
>
> There's an interesting broad question of: What config do we consider
> "recommended" going forward, the "conservative" (i.e. old) or the
> "performant" (i.e. new)? And what JDK do we consider "recommended" going
> forward, the oldest we support or the newest?
>
> Since those recommendations apply for new clusters, people need to qualify
> their setups, and we have a high bar of quality on testing pre-merge, my
> gut tells me "performant + newest JDK". This would impact what we'd test
> pre-commit IMO.
>
> Having been doing a lot of CI stuff lately, some observations:
>
>- Our True North needs to be releasing a database that's free of
>defects that violate our core properties we commit to our users. No data
>loss, no data resurrection, transient or otherwise, due to defects in our
>code (meteors, tsunamis, etc notwithstanding).
>- The relationship of time spent on CI and stability of final full
>*post-commit* runs is asymptotic. It's not even 90/10; we're probably
>somewhere like 98% value gained from 10% of work, and the other 2%
>"stability" (i.e. green test suites, not "our database works") is a
>long-tail slog. Especially in the current ASF CI heterogenous env w/its
>current orchestration.
>- Thus: Pre-commit and post-commit should be different. The following
>points all apply to pre-commit:
>- The goal of pre-commit tests should be some number of 9's of no test
>failures post-commit (i.e. for every 20 green pre-commit we introduce 1
>flake post-commit). Not full perfection; it's not worth the compute and
>complexity.
>- We should *build *all branches on all supported JDK's (8 + 11 for
>older, 11 + 17 for newer, etc).
>- We should *run *all test suites with the *recommended *
>*configuration* against the *highest versioned JDK a branch supports. *And
>we should formally recommend our users run on that JDK.
>- We should *at least* run all jvm-based configurations on the highest
>supported JDK version with the "not recommended but still supported"
>configuration.
>- I'm open to being persuaded that we should at least run jvm-unit
>tests on the older JDK w/the conservative config pre-commit, but not much
>beyond that.
>
> That would leave us with the following distilled:
>
> *Pre-commit:*
>
>- Build on all supported jdks
>- All test suites on highest supported jdk using recommended config
>- Repeat testing on new or changed tests on highest supported JDK
>w/recommended config
>- JDK-based test suites on highest supported jdk using other config
>
> *Post-commit:*
>
>- Run everything. All suites, all supported JDK's, both config files.
>
> With Butler + the *jenkins-jira* integration script
> (need
> to dust that off but it should remain good to go), we should have a pretty
> clear view as to when any consistent regressions are introduced and why.
> We'd remain exposed to JDK-specific flake introductions and flakes in
> unchanged tests, but there's no getting around the 2nd one and I expect the
> former to be rare enough to not warrant the compute to prevent it.
>
> On Thu, Feb 15, 2024, at 10:02 AM, Jon Haddad wrote:
>
> Would it make sense to only block commits on the test strategy you've
> listed, and shift the entire massive test suite to post-commit?  If there
> really is only a small % of times the entire suite is useful this seems
> like it could unblock the dev cycle but still have the benefit of the full
> test suite.
>
>
>
> On Thu, Feb 15, 2024 at 3:18 AM Berenguer Blasi 
> wrote:
>
>
> On reducing circle ci usage during dev while iterating, not with the
> intention to replace the pre-commit CI (yet), we could do away with testing
> only dtests, jvm-dtests, units and cqlsh for a _single_ configuration imo.
> That would greatly reduce usage. I hacked it quickly here for illustration
> purposes:
> https://app.circleci.com/pipelines/github/bereng/cassandra/1164/workflows/3a47c9ef-6456-4190-b5a5-aea2aff641f1
> The good thing is that we have the tooling to dial in whatever we decide
> atm.
>
> Changing pre-commit is a different discussion, to which I agree btw. But
> the above could save time and $ big time during dev and be done and merged
> in a matter of days imo.
>
> I can open a DISCUSS thread if we feel it's worth it.
> On 15/2/24 10:24, Mick Semb Wever wrote:
>
>
>
> Mick and Ekaterina (and everyone really) - any thoughts o

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Josh McKenzie
> Would it make sense to only block commits on the test strategy you've listed, 
> and shift the entire massive test suite to post-commit? 

> Lots and lots of other emails

;)

There's an interesting broad question of: What config do we consider 
"recommended" going forward, the "conservative" (i.e. old) or the "performant" 
(i.e. new)? And what JDK do we consider "recommended" going forward, the oldest 
we support or the newest?

Since those recommendations apply for new clusters, people need to qualify 
their setups, and we have a high bar of quality on testing pre-merge, my gut 
tells me "performant + newest JDK". This would impact what we'd test pre-commit 
IMO.

Having been doing a lot of CI stuff lately, some observations:
 • Our True North needs to be releasing a database that's free of defects that 
violate our core properties we commit to our users. No data loss, no data 
resurrection, transient or otherwise, due to defects in our code (meteors, 
tsunamis, etc notwithstanding).
 • The relationship of time spent on CI and stability of final full 
*post-commit* runs is asymptotic. It's not even 90/10; we're probably somewhere 
like 98% value gained from 10% of work, and the other 2% "stability" (i.e. 
green test suites, not "our database works") is a long-tail slog. Especially in 
the current ASF CI heterogenous env w/its current orchestration.
 • Thus: Pre-commit and post-commit should be different. The following points 
all apply to pre-commit:
 • The goal of pre-commit tests should be some number of 9's of no test 
failures post-commit (i.e. for every 20 green pre-commit we introduce 1 flake 
post-commit). Not full perfection; it's not worth the compute and complexity.
 • We should **build **all branches on all supported JDK's (8 + 11 for older, 
11 + 17 for newer, etc).
 • We should **run **all test suites with the *recommended **configuration* 
against the *highest versioned JDK a branch supports. *And we should formally 
recommend our users run on that JDK.
 • We should *at least* run all jvm-based configurations on the highest 
supported JDK version with the "not recommended but still supported" 
configuration.
 • I'm open to being persuaded that we should at least run jvm-unit tests on 
the older JDK w/the conservative config pre-commit, but not much beyond that.
That would leave us with the following distilled:

*Pre-commit:*
 • Build on all supported jdks
 • All test suites on highest supported jdk using recommended config
 • Repeat testing on new or changed tests on highest supported JDK 
w/recommended config
 • JDK-based test suites on highest supported jdk using other config
*Post-commit:*
 • Run everything. All suites, all supported JDK's, both config files.
With Butler + the *jenkins-jira* integration script  
(need
 to dust that off but it should remain good to go), we should have a pretty 
clear view as to when any consistent regressions are introduced and why. We'd 
remain exposed to JDK-specific flake introductions and flakes in unchanged 
tests, but there's no getting around the 2nd one and I expect the former to be 
rare enough to not warrant the compute to prevent it.

On Thu, Feb 15, 2024, at 10:02 AM, Jon Haddad wrote:
> Would it make sense to only block commits on the test strategy you've listed, 
> and shift the entire massive test suite to post-commit?  If there really is 
> only a small % of times the entire suite is useful this seems like it could 
> unblock the dev cycle but still have the benefit of the full test suite.  
> 
> 
> 
> On Thu, Feb 15, 2024 at 3:18 AM Berenguer Blasi  
> wrote:
>> __
>> On reducing circle ci usage during dev while iterating, not with the 
>> intention to replace the pre-commit CI (yet), we could do away with testing 
>> only dtests, jvm-dtests, units and cqlsh for a _single_ configuration imo. 
>> That would greatly reduce usage. I hacked it quickly here for illustration 
>> purposes: 
>> https://app.circleci.com/pipelines/github/bereng/cassandra/1164/workflows/3a47c9ef-6456-4190-b5a5-aea2aff641f1
>>  The good thing is that we have the tooling to dial in whatever we decide 
>> atm.
>> 
>> Changing pre-commit is a different discussion, to which I agree btw. But the 
>> above could save time and $ big time during dev and be done and merged in a 
>> matter of days imo.
>> 
>> I can open a DISCUSS thread if we feel it's worth it.
>> 
>> On 15/2/24 10:24, Mick Semb Wever wrote:
>>>  
 Mick and Ekaterina (and everyone really) - any thoughts on what test 
 coverage, if any, we should commit to for this new configuration? 
 Acknowledging that we already have *a lot* of CI that we run.
>>> 
>>> 
>>> 
>>> Branimir in this patch has already done some basic cleanup of test 
>>> variations, so this is not a duplication of the pipeline.  It's a 
>>> significant improvement.
>>> 
>>> I'm ok with cassandra_latest being committed and

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Jon Haddad
Would it make sense to only block commits on the test strategy you've
listed, and shift the entire massive test suite to post-commit?  If there
really is only a small % of times the entire suite is useful this seems
like it could unblock the dev cycle but still have the benefit of the full
test suite.



On Thu, Feb 15, 2024 at 3:18 AM Berenguer Blasi 
wrote:

> On reducing circle ci usage during dev while iterating, not with the
> intention to replace the pre-commit CI (yet), we could do away with testing
> only dtests, jvm-dtests, units and cqlsh for a _single_ configuration imo.
> That would greatly reduce usage. I hacked it quickly here for illustration
> purposes:
> https://app.circleci.com/pipelines/github/bereng/cassandra/1164/workflows/3a47c9ef-6456-4190-b5a5-aea2aff641f1
> The good thing is that we have the tooling to dial in whatever we decide
> atm.
>
> Changing pre-commit is a different discussion, to which I agree btw. But
> the above could save time and $ big time during dev and be done and merged
> in a matter of days imo.
>
> I can open a DISCUSS thread if we feel it's worth it.
> On 15/2/24 10:24, Mick Semb Wever wrote:
>
>
>
>> Mick and Ekaterina (and everyone really) - any thoughts on what test
>> coverage, if any, we should commit to for this new configuration?
>> Acknowledging that we already have *a lot* of CI that we run.
>>
>
>
>
> Branimir in this patch has already done some basic cleanup of test
> variations, so this is not a duplication of the pipeline.  It's a
> significant improvement.
>
> I'm ok with cassandra_latest being committed and added to the pipeline,
> *if* the authors genuinely believe there's significant time and effort
> saved in doing so.
>
> How many broken tests are we talking about ?
> Are they consistently broken or flaky ?
> Are they ticketed up and 5.0-rc blockers ?
>
> Having to deal with flakies and broken tests is an unfortunate reality to
> having a pipeline of 170k tests.
>
> Despite real frustrations I don't believe the broken windows analogy is
> appropriate here – it's more of a leave the campground cleaner…   That
> being said, knowingly introducing a few broken tests is not that either,
> but still having to deal with a handful of consistently breaking tests
> for a short period of time is not the same cognitive burden as flakies.
> There are currently other broken tests in 5.0: VectorUpdateDeleteTest,
> upgrade_through_versions_test; are these compounding to the frustrations ?
>
> It's also been questioned about why we don't just enable settings we
> recommend.  These are settings we recommend for new clusters.  Our existing
> cassandra.yaml needs to be tailored for existing clusters being upgraded,
> where we are very conservative about changing defaults.
>
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Branimir Lambov
Paulo:

> 1) Will cassandra.yaml remain the default test config? Is the plan moving
> forward to require green CI for both configurations on pre-commit, or
> pre-release?

The plan is to ensure both configurations are green pre-commit. This should
not increase the CI cost as this replaces extra configurations we were
running before (e.g. test-tries).

2) What will this mean for the release artifact, is the idea to continue
> shipping with the current cassandra.yaml or eventually switch to the
> optimized configuration (ie. 6.X) while making the legacy default
> configuration available via an optional flag?

The release simply includes an additional yaml file, which contains a
one-liner how to use it.

Jeff:

> 1) If there’s an “old compatible default” and “latest recommended
> settings”, when does the value in “old compatible default” get updated?
> Never?

This does not change anything about these decisions. The question is very
serious without this patch as well: Does V6 have to support pain-free
upgrade from V5 working in V4 compatible mode? If so, can we ever deprecate
or drop anything? If not, are we not breaking upgradeability promises?

2) If there are test failures with the new values, it seems REALLY
> IMPORTANT to make sure those test failures are discovered + fixed IN THE
> FUTURE TOO. If pushing new yaml into a different file makes us less likely
> to catch the failures in the future, it seems like we’re hurting ourselves.
> Branimir mentions this, but how do we ensure that we don’t let this pattern
> disguise future bugs?

The main objective of this patch is to ensure that the second yaml is
tested too, pre-commit. We were not doing this for all features we tell
users are supported.

Paulo:

> - if cassandra_latest.yaml becomes the new default configuration for 6.0,
> then precommit only needs to be run against thatversion - prerelease needs
> to be run against all cassandra.yaml variants.

Assuming we keep the pace of development, there will be new "latest"
features in 6.0 (e.g. Accord could be one). The idea is more to move some
of the settings from latest to default when they are deemed mature enough.

Josh:

> I propose to significantly reduce that stuff. Let's distinguish the
> packages of tests that need to be run with CDC enabled / disabled, with
> commitlog compression enabled / disabled, tests that verify sstable formats
> (mostly io and index I guess), and leave other parameters set as with the
> latest configuration - this is the easiest way I think.
> For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about
> other stuff. To me running no-vnodes makes no sense because no-vnodes is
> just a special case of vnodes=1. On the other hand offheap/onheap buffers
> could be tested in unit tests. In short, I'd run dtests only with the
> default and latest configuration.

Some of these changes are already done in this ticket.

Regards,
Branimir



On Thu, Feb 15, 2024 at 3:08 PM Paulo Motta  wrote:

> > It's also been questioned about why we don't just enable settings we
> recommend.  These are settings we recommend for new clusters.  *Our
> existing cassandra.yaml needs to be tailored for existing clusters being
> upgraded, where we are very conservative about changing defaults.*
>
> I think this unnecessarily penalizes new users with subpar defaults and
> existing users who wish to use optimized/recommended defaults and need to
> maintain additional logic to support that. This change offers an
> opportunity to revisit this.
>
> Is not updating the default cassandra.yaml with new recommended
> configuration just to protect existing clusters from accidentally
> overriding cassandra.yaml with a new version during major upgrades? If so,
> perhaps we could add a new explicit flag “enable_major_upgrade: false” to
> “cassandra.yaml” that fails startup if an upgrade is detected and force
> operators to review the configuration before a major upgrade?
>
> Related to Jeff’s question, I think we need a way to consolidate “latest
> recommended settings” into “old compatible default” when cutting a new
> major version, otherwise the files will diverge perpetually.
>
> I think cassandra_latest.yaml offers a way to “buffer” proposals for
> default configuration changes which are consolidated into “cassandra.yaml”
> in the subsequent major release, eventually converging configurations and
> reducing the maintenance burden.
>
> On Thu, 15 Feb 2024 at 04:24 Mick Semb Wever  wrote:
>
>>
>>
>>> Mick and Ekaterina (and everyone really) - any thoughts on what test
>>> coverage, if any, we should commit to for this new configuration?
>>> Acknowledging that we already have *a lot* of CI that we run.
>>>
>>
>>
>>
>> Branimir in this patch has already done some basic cleanup of test
>> variations, so this is not a duplication of the pipeline.  It's a
>> significant improvement.
>>
>> I'm ok with cassandra_latest being committed and added to the pipeline,
>> *if* the authors genuinely believe there's s

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Paulo Motta
> It's also been questioned about why we don't just enable settings we
recommend.  These are settings we recommend for new clusters.  *Our
existing cassandra.yaml needs to be tailored for existing clusters being
upgraded, where we are very conservative about changing defaults.*

I think this unnecessarily penalizes new users with subpar defaults and
existing users who wish to use optimized/recommended defaults and need to
maintain additional logic to support that. This change offers an
opportunity to revisit this.

Is not updating the default cassandra.yaml with new recommended
configuration just to protect existing clusters from accidentally
overriding cassandra.yaml with a new version during major upgrades? If so,
perhaps we could add a new explicit flag “enable_major_upgrade: false” to
“cassandra.yaml” that fails startup if an upgrade is detected and force
operators to review the configuration before a major upgrade?

Related to Jeff’s question, I think we need a way to consolidate “latest
recommended settings” into “old compatible default” when cutting a new
major version, otherwise the files will diverge perpetually.

I think cassandra_latest.yaml offers a way to “buffer” proposals for
default configuration changes which are consolidated into “cassandra.yaml”
in the subsequent major release, eventually converging configurations and
reducing the maintenance burden.

On Thu, 15 Feb 2024 at 04:24 Mick Semb Wever  wrote:

>
>
>> Mick and Ekaterina (and everyone really) - any thoughts on what test
>> coverage, if any, we should commit to for this new configuration?
>> Acknowledging that we already have *a lot* of CI that we run.
>>
>
>
>
> Branimir in this patch has already done some basic cleanup of test
> variations, so this is not a duplication of the pipeline.  It's a
> significant improvement.
>
> I'm ok with cassandra_latest being committed and added to the pipeline,
> *if* the authors genuinely believe there's significant time and effort
> saved in doing so.
>
> How many broken tests are we talking about ?
> Are they consistently broken or flaky ?
> Are they ticketed up and 5.0-rc blockers ?
>
> Having to deal with flakies and broken tests is an unfortunate reality to
> having a pipeline of 170k tests.
>
> Despite real frustrations I don't believe the broken windows analogy is
> appropriate here – it's more of a leave the campground cleaner…   That
> being said, knowingly introducing a few broken tests is not that either,
> but still having to deal with a handful of consistently breaking tests
> for a short period of time is not the same cognitive burden as flakies.
> There are currently other broken tests in 5.0: VectorUpdateDeleteTest,
> upgrade_through_versions_test; are these compounding to the frustrations ?
>
> It's also been questioned about why we don't just enable settings we
> recommend.  These are settings we recommend for new clusters.  Our existing
> cassandra.yaml needs to be tailored for existing clusters being upgraded,
> where we are very conservative about changing defaults.
>
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Berenguer Blasi
On reducing circle ci usage during dev while iterating, not with the 
intention to replace the pre-commit CI (yet), we could do away with 
testing only dtests, jvm-dtests, units and cqlsh for a _single_ 
configuration imo. That would greatly reduce usage. I hacked it quickly 
here for illustration purposes: 
https://app.circleci.com/pipelines/github/bereng/cassandra/1164/workflows/3a47c9ef-6456-4190-b5a5-aea2aff641f1 
The good thing is that we have the tooling to dial in whatever we decide 
atm.


Changing pre-commit is a different discussion, to which I agree btw. But 
the above could save time and $ big time during dev and be done and 
merged in a matter of days imo.


I can open a DISCUSS thread if we feel it's worth it.

On 15/2/24 10:24, Mick Semb Wever wrote:


Mick and Ekaterina (and everyone really) - any thoughts on what
test coverage, if any, we should commit to for this new
configuration? Acknowledging that we already have /a lot/ of CI
that we run.




Branimir in this patch has already done some basic cleanup of test 
variations, so this is not a duplication of the pipeline.  It's a 
significant improvement.


I'm ok with cassandra_latest being committed and added to the 
pipeline, *if* the authors genuinely believe there's significant time 
and effort saved in doing so.


How many broken tests are we talking about ?
Are they consistently broken or flaky ?
Are they ticketed up and 5.0-rc blockers ?

Having to deal with flakies and broken tests is an unfortunate reality 
to having a pipeline of 170k tests.


Despite real frustrations I don't believe the broken windows analogy 
is appropriate here – it's more of a leave the campground cleaner…That 
being said, knowingly introducing a few broken tests is not that 
either, but still having to deal with a handful of consistently 
breaking tests for a short period of time is not the same cognitive 
burden as flakies. There are currently other broken tests in 5.0: 
VectorUpdateDeleteTest, upgrade_through_versions_test; are these 
compounding to the frustrations ?


It's also been questioned about why we don't just enable settings we 
recommend.  These are settings we recommend for new clusters.  Our 
existing cassandra.yaml needs to be tailored for existing clusters 
being upgraded, where we are very conservative about changing defaults.


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Berenguer Blasi
On the merging failing tests discussion I _do_ spend the time looking if 
my patch did cause them or not and I certainly enforce that in the 
reviews I do. The current failures are a manageable number to check 
against Butler/Jenkins/Circle/Jira so I was under the impression 
everybody else was also doing it.


Thanks for bringing up the CI discussion. I have been advocating 
internally to cut down circle CI usage for 1y. I am happy to see the 
concern is shared. We also run the same dtest 4 times at least: vnodes, 
no vnodes,... cqlsh a number of times... unit tests the same... We're 
well beyond the 4.0 release when I remember I would see failures in 
junit-compression not to be found in the other variations. That was 
meaningful in those days. But I can't remember when did I recently find 
a failure _specific_ to a particular test flavor: cdc, compression,... I 
think it would be better ROI to let those super-rare (nowadays) be 
caught by nightly runs.


On 15/2/24 8:53, Jacek Lewandowski wrote:
I fully understand you. Although I have that luxury to use more 
containers, I simply feel that rerunning the same code with different 
configurations which do not impact that code is just a waste of 
resources and money.


- - -- --- -  -
Jacek Lewandowski


czw., 15 lut 2024 o 08:41 Štefan Miklošovič 
 napisał(a):


By the way, I am not sure if it is all completely transparent and
understood by everybody but let me guide you through a typical
patch which is meant to be applied from 4.0 to trunk (4 branches)
to see how it looks like.

I do not have the luxury of running CircleCI on 100 containers, I
have just 25. So what takes around 2.5h for 100 containers takes
around 6-7 for 25. That is a typical java11_pre-commit_tests for
trunk. Then I have to provide builds for java17_pre-commit_tests
too, that takes around 3-4 hours because it just tests less, let's
round it up to 10 hours for trunk.

Then I need to do this for 5.0 as well, basically double the time
because as I am writing this the difference is not too big between
these two branches. So 20 hours.

Then I need to build 4.1 and 4.0 too, 4.0 is very similar to 4.1
when it comes to the number of tests, nevertheless, there are
workflows for Java 8 and Java 11 for each so lets say this takes
10 hours again. So together I'm 35.

To schedule all the builds, trigger them, monitor their progress
etc is work in itself. I am scripting this like crazy to not touch
the UI in Circle at all and I made my custom scripts which call
Circle API and it triggers the builds from the console to speed
this up because as soon as a developer is meant to be clicking
around all day, needing to tracking the progress, it gets old
pretty quickly.

Thank god this is just a patch from 4.0, when it comes to 3.0 and
3.11 just add more hours to that.

So all in all, a typical 4.0 - trunk patch is tested for two days
at least, that's when all is nice and I do not need to rework it
and rurun it again ... Does this all sound flexible and speedy
enough for people?

If we dropped the formal necessity to build various jvms it would
significantly speed up the development.


On Thu, Feb 15, 2024 at 8:10 AM Jacek Lewandowski
 wrote:

Excellent point, I was saying for some time that IMHO we
can reduce to running in CI at least pre-commit:
1) Build J11 2) build J17
3) run tests with build 11 + runtime 11
4) run tests with build 11 and runtime 17.


Ekaterina, I was thinking more about:
1) build J11
2) build J17
3) run tests with build J11 + runtime J11
4) run smoke tests with build J17 and runtime J17

Again, I don't see value in running build J11 and J17 runtime
additionally to J11 runtime - just pick one unless we change
something specific to JVM

If we need to decide whether to test the latest or default, I
think we should pick the latest because this is actually
Cassandra 5.0 defined as a set of new features that will shine
on the website.

Also - we have configurations which test some features but
they more like dimensions:
- commit log compression
- sstable compression
- CDC
- Trie memtables
- Trie SSTable format
- Extended deletion time
...

Currently, with what we call the default configuration is
tested with:
- no compression, no CDC, no extended deletion time
- *commit log compression + sstable compression*, no cdc, no
extended deletion time
- no compression, *CDC enabled*, no extended deletion time
- no compression, no CDC, *enabled extended deletion time*

This applies only to unit tests of course

Then, are we going to test all of those

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Mick Semb Wever
> Mick and Ekaterina (and everyone really) - any thoughts on what test
> coverage, if any, we should commit to for this new configuration?
> Acknowledging that we already have *a lot* of CI that we run.
>



Branimir in this patch has already done some basic cleanup of test
variations, so this is not a duplication of the pipeline.  It's a
significant improvement.

I'm ok with cassandra_latest being committed and added to the pipeline,
*if* the authors genuinely believe there's significant time and effort
saved in doing so.

How many broken tests are we talking about ?
Are they consistently broken or flaky ?
Are they ticketed up and 5.0-rc blockers ?

Having to deal with flakies and broken tests is an unfortunate reality to
having a pipeline of 170k tests.

Despite real frustrations I don't believe the broken windows analogy is
appropriate here – it's more of a leave the campground cleaner…   That
being said, knowingly introducing a few broken tests is not that either,
but still having to deal with a handful of consistently breaking tests for
a short period of time is not the same cognitive burden as flakies.
There are currently other broken tests in 5.0: VectorUpdateDeleteTest,
upgrade_through_versions_test; are these compounding to the frustrations ?

It's also been questioned about why we don't just enable settings we
recommend.  These are settings we recommend for new clusters.  Our existing
cassandra.yaml needs to be tailored for existing clusters being upgraded,
where we are very conservative about changing defaults.


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Jacek Lewandowski
I fully understand you. Although I have that luxury to use more containers,
I simply feel that rerunning the same code with different configurations
which do not impact that code is just a waste of resources and money.

- - -- --- -  -
Jacek Lewandowski


czw., 15 lut 2024 o 08:41 Štefan Miklošovič 
napisał(a):

> By the way, I am not sure if it is all completely transparent and
> understood by everybody but let me guide you through a typical patch which
> is meant to be applied from 4.0 to trunk (4 branches) to see how it looks
> like.
>
> I do not have the luxury of running CircleCI on 100 containers, I have
> just 25. So what takes around 2.5h for 100 containers takes around 6-7 for
> 25. That is a typical java11_pre-commit_tests for trunk. Then I have to
> provide builds for java17_pre-commit_tests too, that takes around 3-4 hours
> because it just tests less, let's round it up to 10 hours for trunk.
>
> Then I need to do this for 5.0 as well, basically double the time because
> as I am writing this the difference is not too big between these two
> branches. So 20 hours.
>
> Then I need to build 4.1 and 4.0 too, 4.0 is very similar to 4.1 when it
> comes to the number of tests, nevertheless, there are workflows for Java 8
> and Java 11 for each so lets say this takes 10 hours again. So together I'm
> 35.
>
> To schedule all the builds, trigger them, monitor their progress etc is
> work in itself. I am scripting this like crazy to not touch the UI in
> Circle at all and I made my custom scripts which call Circle API and it
> triggers the builds from the console to speed this up because as soon as a
> developer is meant to be clicking around all day, needing to tracking the
> progress, it gets old pretty quickly.
>
> Thank god this is just a patch from 4.0, when it comes to 3.0 and 3.11
> just add more hours to that.
>
> So all in all, a typical 4.0 - trunk patch is tested for two days at
> least, that's when all is nice and I do not need to rework it and rurun it
> again ... Does this all sound flexible and speedy enough for people?
>
> If we dropped the formal necessity to build various jvms it would
> significantly speed up the development.
>
>
> On Thu, Feb 15, 2024 at 8:10 AM Jacek Lewandowski <
> lewandowski.ja...@gmail.com> wrote:
>
>> Excellent point, I was saying for some time that IMHO we can reduce
>>> to running in CI at least pre-commit:
>>> 1) Build J11 2) build J17
>>> 3) run tests with build 11 + runtime 11
>>> 4) run tests with build 11 and runtime 17.
>>
>>
>> Ekaterina, I was thinking more about:
>> 1) build J11
>> 2) build J17
>> 3) run tests with build J11 + runtime J11
>> 4) run smoke tests with build J17 and runtime J17
>>
>> Again, I don't see value in running build J11 and J17 runtime
>> additionally to J11 runtime - just pick one unless we change something
>> specific to JVM
>>
>> If we need to decide whether to test the latest or default, I think we
>> should pick the latest because this is actually Cassandra 5.0 defined as a
>> set of new features that will shine on the website.
>>
>> Also - we have configurations which test some features but they more like
>> dimensions:
>> - commit log compression
>> - sstable compression
>> - CDC
>> - Trie memtables
>> - Trie SSTable format
>> - Extended deletion time
>> ...
>>
>> Currently, with what we call the default configuration is tested with:
>> - no compression, no CDC, no extended deletion time
>> - *commit log compression + sstable compression*, no cdc, no extended
>> deletion time
>> - no compression, *CDC enabled*, no extended deletion time
>> - no compression, no CDC, *enabled extended deletion time*
>>
>> This applies only to unit tests of course
>>
>> Then, are we going to test all of those scenarios with the "latest"
>> configuration? I'm asking because the latest configuration is mostly about
>> tries and UCS and has nothing to do with compression or CDC. Then why the
>> default configuration should be tested more thoroughly than latest which
>> enables essential Cassandra 5.0 features?
>>
>> I propose to significantly reduce that stuff. Let's distinguish the
>> packages of tests that need to be run with CDC enabled / disabled, with
>> commitlog compression enabled / disabled, tests that verify sstable formats
>> (mostly io and index I guess), and leave other parameters set as with the
>> latest configuration - this is the easiest way I think.
>>
>> For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about
>> other stuff. To me running no-vnodes makes no sense because no-vnodes is
>> just a special case of vnodes=1. On the other hand offheap/onheap buffers
>> could be tested in unit tests. In short, I'd run dtests only with the
>> default and latest configuration.
>>
>> Sorry for being too wordy,
>>
>>
>> czw., 15 lut 2024 o 07:39 Štefan Miklošovič 
>> napisał(a):
>>
>>> Something along what Paulo is proposing makes sense to me. To sum it up,
>>> knowing what workflows we have now:
>>>

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Štefan Miklošovič
By the way, I am not sure if it is all completely transparent and
understood by everybody but let me guide you through a typical patch which
is meant to be applied from 4.0 to trunk (4 branches) to see how it looks
like.

I do not have the luxury of running CircleCI on 100 containers, I have just
25. So what takes around 2.5h for 100 containers takes around 6-7 for 25.
That is a typical java11_pre-commit_tests for trunk. Then I have to provide
builds for java17_pre-commit_tests too, that takes around 3-4 hours because
it just tests less, let's round it up to 10 hours for trunk.

Then I need to do this for 5.0 as well, basically double the time because
as I am writing this the difference is not too big between these two
branches. So 20 hours.

Then I need to build 4.1 and 4.0 too, 4.0 is very similar to 4.1 when it
comes to the number of tests, nevertheless, there are workflows for Java 8
and Java 11 for each so lets say this takes 10 hours again. So together I'm
35.

To schedule all the builds, trigger them, monitor their progress etc is
work in itself. I am scripting this like crazy to not touch the UI in
Circle at all and I made my custom scripts which call Circle API and it
triggers the builds from the console to speed this up because as soon as a
developer is meant to be clicking around all day, needing to tracking the
progress, it gets old pretty quickly.

Thank god this is just a patch from 4.0, when it comes to 3.0 and 3.11 just
add more hours to that.

So all in all, a typical 4.0 - trunk patch is tested for two days at least,
that's when all is nice and I do not need to rework it and rurun it again
... Does this all sound flexible and speedy enough for people?

If we dropped the formal necessity to build various jvms it would
significantly speed up the development.


On Thu, Feb 15, 2024 at 8:10 AM Jacek Lewandowski <
lewandowski.ja...@gmail.com> wrote:

> Excellent point, I was saying for some time that IMHO we can reduce
>> to running in CI at least pre-commit:
>> 1) Build J11 2) build J17
>> 3) run tests with build 11 + runtime 11
>> 4) run tests with build 11 and runtime 17.
>
>
> Ekaterina, I was thinking more about:
> 1) build J11
> 2) build J17
> 3) run tests with build J11 + runtime J11
> 4) run smoke tests with build J17 and runtime J17
>
> Again, I don't see value in running build J11 and J17 runtime
> additionally to J11 runtime - just pick one unless we change something
> specific to JVM
>
> If we need to decide whether to test the latest or default, I think we
> should pick the latest because this is actually Cassandra 5.0 defined as a
> set of new features that will shine on the website.
>
> Also - we have configurations which test some features but they more like
> dimensions:
> - commit log compression
> - sstable compression
> - CDC
> - Trie memtables
> - Trie SSTable format
> - Extended deletion time
> ...
>
> Currently, with what we call the default configuration is tested with:
> - no compression, no CDC, no extended deletion time
> - *commit log compression + sstable compression*, no cdc, no extended
> deletion time
> - no compression, *CDC enabled*, no extended deletion time
> - no compression, no CDC, *enabled extended deletion time*
>
> This applies only to unit tests of course
>
> Then, are we going to test all of those scenarios with the "latest"
> configuration? I'm asking because the latest configuration is mostly about
> tries and UCS and has nothing to do with compression or CDC. Then why the
> default configuration should be tested more thoroughly than latest which
> enables essential Cassandra 5.0 features?
>
> I propose to significantly reduce that stuff. Let's distinguish the
> packages of tests that need to be run with CDC enabled / disabled, with
> commitlog compression enabled / disabled, tests that verify sstable formats
> (mostly io and index I guess), and leave other parameters set as with the
> latest configuration - this is the easiest way I think.
>
> For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about
> other stuff. To me running no-vnodes makes no sense because no-vnodes is
> just a special case of vnodes=1. On the other hand offheap/onheap buffers
> could be tested in unit tests. In short, I'd run dtests only with the
> default and latest configuration.
>
> Sorry for being too wordy,
>
>
> czw., 15 lut 2024 o 07:39 Štefan Miklošovič 
> napisał(a):
>
>> Something along what Paulo is proposing makes sense to me. To sum it up,
>> knowing what workflows we have now:
>>
>> java17_pre-commit_tests
>> java11_pre-commit_tests
>> java17_separate_tests
>> java11_separate_tests
>>
>> We would have couple more, together like:
>>
>> java17_pre-commit_tests
>> java17_pre-commit_tests-latest-yaml
>> java11_pre-commit_tests
>> java11_pre-commit_tests-latest-yaml
>> java17_separate_tests
>> java17_separate_tests-default-yaml
>> java11_separate_tests
>> java11_separate_tests-latest-yaml
>>
>> To go over Paulo's plan, his steps 1-3 for 5.0 would result

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Jacek Lewandowski
>
> Excellent point, I was saying for some time that IMHO we can reduce
> to running in CI at least pre-commit:
> 1) Build J11 2) build J17
> 3) run tests with build 11 + runtime 11
> 4) run tests with build 11 and runtime 17.


Ekaterina, I was thinking more about:
1) build J11
2) build J17
3) run tests with build J11 + runtime J11
4) run smoke tests with build J17 and runtime J17

Again, I don't see value in running build J11 and J17 runtime additionally
to J11 runtime - just pick one unless we change something specific to JVM

If we need to decide whether to test the latest or default, I think we
should pick the latest because this is actually Cassandra 5.0 defined as a
set of new features that will shine on the website.

Also - we have configurations which test some features but they more like
dimensions:
- commit log compression
- sstable compression
- CDC
- Trie memtables
- Trie SSTable format
- Extended deletion time
...

Currently, with what we call the default configuration is tested with:
- no compression, no CDC, no extended deletion time
- *commit log compression + sstable compression*, no cdc, no extended
deletion time
- no compression, *CDC enabled*, no extended deletion time
- no compression, no CDC, *enabled extended deletion time*

This applies only to unit tests of course

Then, are we going to test all of those scenarios with the "latest"
configuration? I'm asking because the latest configuration is mostly about
tries and UCS and has nothing to do with compression or CDC. Then why the
default configuration should be tested more thoroughly than latest which
enables essential Cassandra 5.0 features?

I propose to significantly reduce that stuff. Let's distinguish the
packages of tests that need to be run with CDC enabled / disabled, with
commitlog compression enabled / disabled, tests that verify sstable formats
(mostly io and index I guess), and leave other parameters set as with the
latest configuration - this is the easiest way I think.

For dtests we have vnodes/no-vnodes, offheap/onheap, and nothing about
other stuff. To me running no-vnodes makes no sense because no-vnodes is
just a special case of vnodes=1. On the other hand offheap/onheap buffers
could be tested in unit tests. In short, I'd run dtests only with the
default and latest configuration.

Sorry for being too wordy,


czw., 15 lut 2024 o 07:39 Štefan Miklošovič 
napisał(a):

> Something along what Paulo is proposing makes sense to me. To sum it up,
> knowing what workflows we have now:
>
> java17_pre-commit_tests
> java11_pre-commit_tests
> java17_separate_tests
> java11_separate_tests
>
> We would have couple more, together like:
>
> java17_pre-commit_tests
> java17_pre-commit_tests-latest-yaml
> java11_pre-commit_tests
> java11_pre-commit_tests-latest-yaml
> java17_separate_tests
> java17_separate_tests-default-yaml
> java11_separate_tests
> java11_separate_tests-latest-yaml
>
> To go over Paulo's plan, his steps 1-3 for 5.0 would result in requiring
> just one workflow
>
> java11_pre-commit_tests
>
> when no configuration is touched and two workflows
>
> java11_pre-commit_tests
> java11_pre-commit_tests-latest-yaml
>
> when there is some configuration change.
>
> Now the term "some configuration change" is quite tricky and it is not
> always easy to evaluate if both default and latest yaml workflows need to
> be executed. It might happen that a change is of such a nature that it does
> not change the configuration but it is necessary to verify that it still
> works with both scenarios. -latest.yaml config might be such that a change
> would make sense to do in isolation for default config only but it would
> not work with -latest.yaml too. I don't know if this is just a theoretical
> problem or not but my gut feeling is that we would be safer if we just
> required both default and latest yaml workflows together.
>
> Even if we do, we basically replace "two jvms" builds for "two yamls"
> builds but I consider "two yamls" builds to be more valuable in general
> than "two jvms" builds. It would take basically the same amount of time, we
> would just reoriented our building matrix from different jvms to different
> yamls.
>
> For releases we would for sure need to just run it across jvms too.
>
> On Thu, Feb 15, 2024 at 7:05 AM Paulo Motta  wrote:
>
>> > Perhaps it is also a good opportunity to distinguish subsets of tests
>> which make sense to run with a configuration matrix.
>>
>> Agree. I think we should define a “standard/golden” configuration for
>> each branch and minimally require precommit tests for that configuration.
>> Assignees and reviewers can determine if additional test variants are
>> required based on the patch scope.
>>
>> Nightly and prerelease tests can be run to catch any issues outside the
>> standard configuration based on the supported configuration matrix.
>>
>> On Wed, 14 Feb 2024 at 15:32 Jacek Lewandowski <
>> lewandowski.ja...@gmail.com> wrote:
>>
>>> śr., 14 lut 2024 o 17:30 Josh McKenzie 
>

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Štefan Miklošovič
Something along what Paulo is proposing makes sense to me. To sum it up,
knowing what workflows we have now:

java17_pre-commit_tests
java11_pre-commit_tests
java17_separate_tests
java11_separate_tests

We would have couple more, together like:

java17_pre-commit_tests
java17_pre-commit_tests-latest-yaml
java11_pre-commit_tests
java11_pre-commit_tests-latest-yaml
java17_separate_tests
java17_separate_tests-default-yaml
java11_separate_tests
java11_separate_tests-latest-yaml

To go over Paulo's plan, his steps 1-3 for 5.0 would result in requiring
just one workflow

java11_pre-commit_tests

when no configuration is touched and two workflows

java11_pre-commit_tests
java11_pre-commit_tests-latest-yaml

when there is some configuration change.

Now the term "some configuration change" is quite tricky and it is not
always easy to evaluate if both default and latest yaml workflows need to
be executed. It might happen that a change is of such a nature that it does
not change the configuration but it is necessary to verify that it still
works with both scenarios. -latest.yaml config might be such that a change
would make sense to do in isolation for default config only but it would
not work with -latest.yaml too. I don't know if this is just a theoretical
problem or not but my gut feeling is that we would be safer if we just
required both default and latest yaml workflows together.

Even if we do, we basically replace "two jvms" builds for "two yamls"
builds but I consider "two yamls" builds to be more valuable in general
than "two jvms" builds. It would take basically the same amount of time, we
would just reoriented our building matrix from different jvms to different
yamls.

For releases we would for sure need to just run it across jvms too.

On Thu, Feb 15, 2024 at 7:05 AM Paulo Motta  wrote:

> > Perhaps it is also a good opportunity to distinguish subsets of tests
> which make sense to run with a configuration matrix.
>
> Agree. I think we should define a “standard/golden” configuration for each
> branch and minimally require precommit tests for that configuration.
> Assignees and reviewers can determine if additional test variants are
> required based on the patch scope.
>
> Nightly and prerelease tests can be run to catch any issues outside the
> standard configuration based on the supported configuration matrix.
>
> On Wed, 14 Feb 2024 at 15:32 Jacek Lewandowski <
> lewandowski.ja...@gmail.com> wrote:
>
>> śr., 14 lut 2024 o 17:30 Josh McKenzie  napisał(a):
>>
>>> When we have failing tests people do not spend the time to figure out if
>>> their logic caused a regression and merge, making things more unstable… so
>>> when we merge failing tests that leads to people merging even more failing
>>> tests...
>>>
>>> What's the counter position to this Jacek / Berenguer?
>>>
>>
>> For how long are we going to deceive ourselves? Are we shipping those
>> features or not? Perhaps it is also a good opportunity to distinguish
>> subsets of tests which make sense to run with a configuration matrix.
>>
>> If we don't add those tests to the pre-commit pipeline, "people do not
>> spend the time to figure out if their logic caused a regression and merge,
>> making things more unstable…"
>> I think it is much more valuable to test those various configurations
>> rather than test against j11 and j17 separately. I can see a really little
>> value in doing that.
>>
>>
>>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Paulo Motta
> Perhaps it is also a good opportunity to distinguish subsets of tests
which make sense to run with a configuration matrix.

Agree. I think we should define a “standard/golden” configuration for each
branch and minimally require precommit tests for that configuration.
Assignees and reviewers can determine if additional test variants are
required based on the patch scope.

Nightly and prerelease tests can be run to catch any issues outside the
standard configuration based on the supported configuration matrix.

On Wed, 14 Feb 2024 at 15:32 Jacek Lewandowski 
wrote:

> śr., 14 lut 2024 o 17:30 Josh McKenzie  napisał(a):
>
>> When we have failing tests people do not spend the time to figure out if
>> their logic caused a regression and merge, making things more unstable… so
>> when we merge failing tests that leads to people merging even more failing
>> tests...
>>
>> What's the counter position to this Jacek / Berenguer?
>>
>
> For how long are we going to deceive ourselves? Are we shipping those
> features or not? Perhaps it is also a good opportunity to distinguish
> subsets of tests which make sense to run with a configuration matrix.
>
> If we don't add those tests to the pre-commit pipeline, "people do not
> spend the time to figure out if their logic caused a regression and merge,
> making things more unstable…"
> I think it is much more valuable to test those various configurations
> rather than test against j11 and j17 separately. I can see a really little
> value in doing that.
>
>
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Paulo Motta
> If there’s an “old compatible default” and “latest recommended settings”,
when does the value in “old compatible default” get updated? Never?

How about replacing cassandra.yaml with cassandra_latest.yaml on trunk when
cutting cassandra-6.0 branch? Any new default changes on trunk go to
cassandra_latest.yaml.

Basically major branch creation syncs cassandra_latest.yaml with
cassandra.yaml on trunk, and default changes on trunk are added to
cassandra_latest.yaml which will be eventually synced to cassandra.yaml
when the next major is cut.

On Wed, 14 Feb 2024 at 13:42 Jeff Jirsa  wrote:

> 1) If there’s an “old compatible default” and “latest recommended
> settings”, when does the value in “old compatible default” get updated?
> Never?
> 2) If there are test failures with the new values, it seems REALLY
> IMPORTANT to make sure those test failures are discovered + fixed IN THE
> FUTURE TOO. If pushing new yaml into a different file makes us less likely
> to catch the failures in the future, it seems like we’re hurting ourselves.
> Branimir mentions this, but how do we ensure that we don’t let this pattern
> disguise future bugs?
>
>
>
>
>
> On Feb 13, 2024, at 8:41 AM, Branimir Lambov  wrote:
>
> Hi All,
>
> CASSANDRA-18753 introduces a second set of defaults (in a separate
> "cassandra_latest.yaml") that enable new features of Cassandra. The
> objective is two-fold: to be able to test the database in this
> configuration, and to point potential users that are evaluating the
> technology to an optimized set of defaults that give a clearer picture of
> the expected performance of the database for a new user. The objective is
> to get this configuration into 5.0 to have the extra bit of confidence that
> we are not releasing (and recommending) options that have not gone through
> thorough CI.
>
> The implementation has already gone through review, but I'd like to get
> people's opinion on two things:
> - There are currently a number of test failures when the new options are
> selected, some of which appear to be genuine problems. Is the community
> okay with committing the patch before all of these are addressed? This
> should prevent the introduction of new failures and make sure we don't
> release before clearing the existing ones.
> - I'd like to get an opinion on what's suitable wording and documentation
> for the new defaults set. Currently, the patch proposes adding the
> following text to the yaml (see
> https://github.com/apache/cassandra/pull/2896/files):
> # NOTE:
> #   This file is provided in two versions:
> # - cassandra.yaml: Contains configuration defaults for a "compatible"
> #   configuration that operates using settings that are
> backwards-compatible
> #   and interoperable with machines running older versions of
> Cassandra.
> #   This version is provided to facilitate pain-free upgrades for
> existing
> #   users of Cassandra running in production who want to gradually and
> #   carefully introduce new features.
> # - cassandra_latest.yaml: Contains configuration defaults that enable
> #   the latest features of Cassandra, including improved functionality
> as
> #   well as higher performance. This version is provided for new users
> of
> #   Cassandra who want to get the most out of their cluster, and for
> users
> #   evaluating the technology.
> #   To use this version, simply copy this file over cassandra.yaml, or
> specify
> #   it using the -Dcassandra.config system property, e.g. by running
> # cassandra
> -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
> # /NOTE
> Does this sound sensible? Should we add a pointer to this defaults set
> elsewhere in the documentation?
>
> Regards,
> Branimir
>
>
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Paulo Motta
I share Jacek’s and Stefan’s sentiment about the low value of requiring
precommit j11+j17 tests for all changes.

Perhaps this was needed during j17 stabilization but is no longer required?
Please correct if I’m missing some context.

To have a practical proposal to address this, how about:

1) Define “standard” java version for branch (11 or 17).
2) Define “standard” cassandra.yaml variant for branch (legacy
cassandra.yaml or shiny cassandra_latest.yaml).
3) Require green CI on precommit on standard java version + standard
cassandra.yaml variant.
4) Any known java-related changes require precommit j11 + j17.
5) Any known configuration changes require precommit tests on all
cassandra.yaml variants.
6) All supported java versions + cassandra.yaml variants need to be checked
before a release is proposed, to catch any issue missed during 4) or 5).

For example:
- If j17 is set as “default” java version of the branch cassandra-5.0, then
j11 tests are no longer required for patches that don’t touch java-related
stuff
- if cassandra_latest.yaml becomes the new default configuration for 6.0,
then precommit only needs to be run against thatversion - prerelease needs
to be run against all cassandra.yaml variants.

Wdyt?

On Wed, 14 Feb 2024 at 18:25 Štefan Miklošovič 
wrote:

> Jon,
>
> I was mostly referring to Circle CI where we have two pre-commit
> workflows. (just click on anything here
> https://app.circleci.com/pipelines/github/instaclustr/cassandra)
>
> java17_pre-commit_tests
>
> This workflow is compiling & testing everything with Java 17
>
> java11_pre-commit_tests
>
> This workflow is compiling with Java 11 and it contains jobs which are
> also run with Java 11 and another set of jobs which run with Java 17.
>
> The workflow I have so far is that when I want to merge something, it is
> required to formally provide builds for both workflows. Maybe I am doing
> more work than necessary here but my understanding is that this has to be
> done and it is required.
>
> I think that Jacek was talking also about this and that it is questionable
> what value it brings.
>
>
>
> On Thu, Feb 15, 2024 at 12:13 AM Jon Haddad  wrote:
>
>> Stefan, can you elaborate on what you are proposing?  It's not clear (at
>> least to me) what level of testing you're advocating for.  Dropping testing
>> both on dev branches, every commit, just on release?  In addition, can you
>> elaborate on what is a hassle about it?  It's been a long time since I
>> committed anything but I don't remember 2 JVMs (8 & 11) being a problem.
>>
>> Jon
>>
>>
>>
>> On Wed, Feb 14, 2024 at 2:35 PM Štefan Miklošovič <
>> stefan.mikloso...@gmail.com> wrote:
>>
>>> I agree with Jacek, I don't quite understand why we are running the
>>> pipeline for j17 and j11 every time. I think this should be opt-in.
>>> Majority of the time, we are just refactoring and coding stuff for
>>> Cassandra where testing it for both jvms is just pointless and we _know_
>>> that it will be fine in 11 and 17 too because we do not do anything
>>> special. If we find some subsystems where testing that on both jvms is
>>> crucial, we might do that, I just do not remember when it was last time
>>> that testing it in both j17 and j11 suddenly uncovered some bug. Seems more
>>> like a hassle.
>>>
>>> We might then test the whole pipeline with a different config basically
>>> for same time as we currently do.
>>>
>>> On Wed, Feb 14, 2024 at 9:32 PM Jacek Lewandowski <
>>> lewandowski.ja...@gmail.com> wrote:
>>>
 śr., 14 lut 2024 o 17:30 Josh McKenzie 
 napisał(a):

> When we have failing tests people do not spend the time to figure out
> if their logic caused a regression and merge, making things more unstable…
> so when we merge failing tests that leads to people merging even more
> failing tests...
>
> What's the counter position to this Jacek / Berenguer?
>

 For how long are we going to deceive ourselves? Are we shipping those
 features or not? Perhaps it is also a good opportunity to distinguish
 subsets of tests which make sense to run with a configuration matrix.

 If we don't add those tests to the pre-commit pipeline, "people do not
 spend the time to figure out if their logic caused a regression and merge,
 making things more unstable…"
 I think it is much more valuable to test those various configurations
 rather than test against j11 and j17 separately. I can see a really little
 value in doing that.





Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Štefan Miklošovič
Jon,

I was mostly referring to Circle CI where we have two pre-commit workflows.
(just click on anything here
https://app.circleci.com/pipelines/github/instaclustr/cassandra)

java17_pre-commit_tests

This workflow is compiling & testing everything with Java 17

java11_pre-commit_tests

This workflow is compiling with Java 11 and it contains jobs which are also
run with Java 11 and another set of jobs which run with Java 17.

The workflow I have so far is that when I want to merge something, it is
required to formally provide builds for both workflows. Maybe I am doing
more work than necessary here but my understanding is that this has to be
done and it is required.

I think that Jacek was talking also about this and that it is questionable
what value it brings.



On Thu, Feb 15, 2024 at 12:13 AM Jon Haddad  wrote:

> Stefan, can you elaborate on what you are proposing?  It's not clear (at
> least to me) what level of testing you're advocating for.  Dropping testing
> both on dev branches, every commit, just on release?  In addition, can you
> elaborate on what is a hassle about it?  It's been a long time since I
> committed anything but I don't remember 2 JVMs (8 & 11) being a problem.
>
> Jon
>
>
>
> On Wed, Feb 14, 2024 at 2:35 PM Štefan Miklošovič <
> stefan.mikloso...@gmail.com> wrote:
>
>> I agree with Jacek, I don't quite understand why we are running the
>> pipeline for j17 and j11 every time. I think this should be opt-in.
>> Majority of the time, we are just refactoring and coding stuff for
>> Cassandra where testing it for both jvms is just pointless and we _know_
>> that it will be fine in 11 and 17 too because we do not do anything
>> special. If we find some subsystems where testing that on both jvms is
>> crucial, we might do that, I just do not remember when it was last time
>> that testing it in both j17 and j11 suddenly uncovered some bug. Seems more
>> like a hassle.
>>
>> We might then test the whole pipeline with a different config basically
>> for same time as we currently do.
>>
>> On Wed, Feb 14, 2024 at 9:32 PM Jacek Lewandowski <
>> lewandowski.ja...@gmail.com> wrote:
>>
>>> śr., 14 lut 2024 o 17:30 Josh McKenzie 
>>> napisał(a):
>>>
 When we have failing tests people do not spend the time to figure out
 if their logic caused a regression and merge, making things more unstable…
 so when we merge failing tests that leads to people merging even more
 failing tests...

 What's the counter position to this Jacek / Berenguer?

>>>
>>> For how long are we going to deceive ourselves? Are we shipping those
>>> features or not? Perhaps it is also a good opportunity to distinguish
>>> subsets of tests which make sense to run with a configuration matrix.
>>>
>>> If we don't add those tests to the pre-commit pipeline, "people do not
>>> spend the time to figure out if their logic caused a regression and merge,
>>> making things more unstable…"
>>> I think it is much more valuable to test those various configurations
>>> rather than test against j11 and j17 separately. I can see a really little
>>> value in doing that.
>>>
>>>
>>>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Ekaterina Dimitrova
>
> I'm ok with breaking trunk CI temporarily as long as failures are tracked
> and triaged/addressed before the next release.


>From the ticket, I understand it is meant for 5.0-rc

I share this sentiment for the release we decide to ship with:

> The failures should block release or we should not advertise we have those
> features at all, and the configuration should be named "experimental"
> rather than "latest".


Is the community okay with committing the patch before all of these are
> addressed?

If we aim to fix everything before the next release 5.0-rc, we can commit
CASSANDRA-18753 after the fixes are applied. If we are not going to do all
the fixes anytime soon - I prefer to commit and have the failures and the
tickets open. Otherwise, I can guarantee that I, personally, will forget
some of those failures and miss them in time... and I am suspicious I won’t
be the only one :-)

This version is provided for new users of # Cassandra who want to get the
> most out of their cluster and for users # evaluating the technology.

>From reading this thread, we do not recommend using it straight into
production but to experiment, gain trust, and then use it in production.
Did I get it correctly? We need to confirm what it is and be sure it is
clearly stated in the docs.

Announcing this new yaml file under NEWS.txt features sounds reasonable to
me. Or can we add a new separate section on top of  NEWS.txt 5.0, dedicated
only to the announcement of this new configuration file?

Mick and Ekaterina (and everyone really) - any thoughts on what test
> coverage we should commit to for this new configuration? Acknowledging that
> we already have *a lot* of CI that we run.

I do not have an immediate answer. I see there is some proposed CI
configuration in the ticket. As far as I can tell from a quick look, the
suggestion is to replace unit-trie with unit-latest (which exercises also
tries) and the additional new jobs will be Python and Java DTests. (no new
upgrade tests)
On top of my mind - we probably need a cost-benefit analysis, risk
analysis, and tradeoffs discussed - burnt resources vs manpower, early
detection vs late discovery, or even prod issues. Experimental vs
production-ready, etc

Now, this question can have different answers depending on whether this is
an experimental config or we recommend it for production use.

I would expect new features to be enabled in this configuration and all
tests to be run pre-commit with the default and the new YAML files. Is this
a correct assumption? Probably done with a note on the ML.

The question is, do we have enough resources in Jenkins to facilitate all
this testing post-commit?

> I think it is much more valuable to test those various configurations
> rather than test against j11 and j17 separately. I can see a really little
> value in doing that.

Excellent point, I was saying for some time that IMHO we can reduce
to running in CI at least pre-commit:
1) Build J11 2) build J17
3) run tests with build 11 + runtime 11
4) run tests with build 11 and runtime 17.

Technically, that is what we also ship in 5.0. (Except the 2), the JDK17
build but we should not remove that from CI)
Does it make sense to reduce to what I mentioned in 1,2,3,4 and instead add
the suggested jobs with the new configuration from CASSANDRA-18753 in
pre-commit? Please correct me if I am wrong, but I understand that running
with JDK17 tests on the 17 build is experimental in CI, so we can gain
confidence until the release when we will drop 11. No? If that is correct,
I do not see why we run those tests on every pre-commit and not only what
we ship.

Best regards,
Ekaterina

On Wed, 14 Feb 2024 at 17:35, Štefan Miklošovič 
wrote:

> I agree with Jacek, I don't quite understand why we are running the
> pipeline for j17 and j11 every time. I think this should be opt-in.
> Majority of the time, we are just refactoring and coding stuff for
> Cassandra where testing it for both jvms is just pointless and we _know_
> that it will be fine in 11 and 17 too because we do not do anything
> special. If we find some subsystems where testing that on both jvms is
> crucial, we might do that, I just do not remember when it was last time
> that testing it in both j17 and j11 suddenly uncovered some bug. Seems more
> like a hassle.
>
> We might then test the whole pipeline with a different config basically
> for same time as we currently do.
>
> On Wed, Feb 14, 2024 at 9:32 PM Jacek Lewandowski <
> lewandowski.ja...@gmail.com> wrote:
>
>> śr., 14 lut 2024 o 17:30 Josh McKenzie  napisał(a):
>>
>>> When we have failing tests people do not spend the time to figure out if
>>> their logic caused a regression and merge, making things more unstable… so
>>> when we merge failing tests that leads to people merging even more failing
>>> tests...
>>>
>>> What's the counter position to this Jacek / Berenguer?
>>>
>>
>> For how long are we going to deceive ourselves? Are we shipping those
>> features or not? Perhaps it is 

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Jon Haddad
Stefan, can you elaborate on what you are proposing?  It's not clear (at
least to me) what level of testing you're advocating for.  Dropping testing
both on dev branches, every commit, just on release?  In addition, can you
elaborate on what is a hassle about it?  It's been a long time since I
committed anything but I don't remember 2 JVMs (8 & 11) being a problem.

Jon



On Wed, Feb 14, 2024 at 2:35 PM Štefan Miklošovič <
stefan.mikloso...@gmail.com> wrote:

> I agree with Jacek, I don't quite understand why we are running the
> pipeline for j17 and j11 every time. I think this should be opt-in.
> Majority of the time, we are just refactoring and coding stuff for
> Cassandra where testing it for both jvms is just pointless and we _know_
> that it will be fine in 11 and 17 too because we do not do anything
> special. If we find some subsystems where testing that on both jvms is
> crucial, we might do that, I just do not remember when it was last time
> that testing it in both j17 and j11 suddenly uncovered some bug. Seems more
> like a hassle.
>
> We might then test the whole pipeline with a different config basically
> for same time as we currently do.
>
> On Wed, Feb 14, 2024 at 9:32 PM Jacek Lewandowski <
> lewandowski.ja...@gmail.com> wrote:
>
>> śr., 14 lut 2024 o 17:30 Josh McKenzie  napisał(a):
>>
>>> When we have failing tests people do not spend the time to figure out if
>>> their logic caused a regression and merge, making things more unstable… so
>>> when we merge failing tests that leads to people merging even more failing
>>> tests...
>>>
>>> What's the counter position to this Jacek / Berenguer?
>>>
>>
>> For how long are we going to deceive ourselves? Are we shipping those
>> features or not? Perhaps it is also a good opportunity to distinguish
>> subsets of tests which make sense to run with a configuration matrix.
>>
>> If we don't add those tests to the pre-commit pipeline, "people do not
>> spend the time to figure out if their logic caused a regression and merge,
>> making things more unstable…"
>> I think it is much more valuable to test those various configurations
>> rather than test against j11 and j17 separately. I can see a really little
>> value in doing that.
>>
>>
>>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Štefan Miklošovič
I agree with Jacek, I don't quite understand why we are running the
pipeline for j17 and j11 every time. I think this should be opt-in.
Majority of the time, we are just refactoring and coding stuff for
Cassandra where testing it for both jvms is just pointless and we _know_
that it will be fine in 11 and 17 too because we do not do anything
special. If we find some subsystems where testing that on both jvms is
crucial, we might do that, I just do not remember when it was last time
that testing it in both j17 and j11 suddenly uncovered some bug. Seems more
like a hassle.

We might then test the whole pipeline with a different config basically for
same time as we currently do.

On Wed, Feb 14, 2024 at 9:32 PM Jacek Lewandowski <
lewandowski.ja...@gmail.com> wrote:

> śr., 14 lut 2024 o 17:30 Josh McKenzie  napisał(a):
>
>> When we have failing tests people do not spend the time to figure out if
>> their logic caused a regression and merge, making things more unstable… so
>> when we merge failing tests that leads to people merging even more failing
>> tests...
>>
>> What's the counter position to this Jacek / Berenguer?
>>
>
> For how long are we going to deceive ourselves? Are we shipping those
> features or not? Perhaps it is also a good opportunity to distinguish
> subsets of tests which make sense to run with a configuration matrix.
>
> If we don't add those tests to the pre-commit pipeline, "people do not
> spend the time to figure out if their logic caused a regression and merge,
> making things more unstable…"
> I think it is much more valuable to test those various configurations
> rather than test against j11 and j17 separately. I can see a really little
> value in doing that.
>
>
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Jacek Lewandowski
śr., 14 lut 2024 o 17:30 Josh McKenzie  napisał(a):

> When we have failing tests people do not spend the time to figure out if
> their logic caused a regression and merge, making things more unstable… so
> when we merge failing tests that leads to people merging even more failing
> tests...
>
> What's the counter position to this Jacek / Berenguer?
>

For how long are we going to deceive ourselves? Are we shipping those
features or not? Perhaps it is also a good opportunity to distinguish
subsets of tests which make sense to run with a configuration matrix.

If we don't add those tests to the pre-commit pipeline, "people do not
spend the time to figure out if their logic caused a regression and merge,
making things more unstable…"
I think it is much more valuable to test those various configurations
rather than test against j11 and j17 separately. I can see a really little
value in doing that.


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Jeff Jirsa
1) If there’s an “old compatible default” and “latest recommended settings”, 
when does the value in “old compatible default” get updated? Never? 
2) If there are test failures with the new values, it seems REALLY IMPORTANT to 
make sure those test failures are discovered + fixed IN THE FUTURE TOO. If 
pushing new yaml into a different file makes us less likely to catch the 
failures in the future, it seems like we’re hurting ourselves. Branimir 
mentions this, but how do we ensure that we don’t let this pattern disguise 
future bugs? 





> On Feb 13, 2024, at 8:41 AM, Branimir Lambov  wrote:
> 
> Hi All,
> 
> CASSANDRA-18753 introduces a second set of defaults (in a separate 
> "cassandra_latest.yaml") that enable new features of Cassandra. The objective 
> is two-fold: to be able to test the database in this configuration, and to 
> point potential users that are evaluating the technology to an optimized set 
> of defaults that give a clearer picture of the expected performance of the 
> database for a new user. The objective is to get this configuration into 5.0 
> to have the extra bit of confidence that we are not releasing (and 
> recommending) options that have not gone through thorough CI.
> 
> The implementation has already gone through review, but I'd like to get 
> people's opinion on two things:
> - There are currently a number of test failures when the new options are 
> selected, some of which appear to be genuine problems. Is the community okay 
> with committing the patch before all of these are addressed? This should 
> prevent the introduction of new failures and make sure we don't release 
> before clearing the existing ones.
> - I'd like to get an opinion on what's suitable wording and documentation for 
> the new defaults set. Currently, the patch proposes adding the following text 
> to the yaml (see https://github.com/apache/cassandra/pull/2896/files):
> # NOTE:
> #   This file is provided in two versions:
> # - cassandra.yaml: Contains configuration defaults for a "compatible"
> #   configuration that operates using settings that are 
> backwards-compatible
> #   and interoperable with machines running older versions of Cassandra.
> #   This version is provided to facilitate pain-free upgrades for existing
> #   users of Cassandra running in production who want to gradually and
> #   carefully introduce new features.
> # - cassandra_latest.yaml: Contains configuration defaults that enable
> #   the latest features of Cassandra, including improved functionality as
> #   well as higher performance. This version is provided for new users of
> #   Cassandra who want to get the most out of their cluster, and for users
> #   evaluating the technology.
> #   To use this version, simply copy this file over cassandra.yaml, or 
> specify
> #   it using the -Dcassandra.config system property, e.g. by running
> # cassandra 
> -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
> # /NOTE
> Does this sound sensible? Should we add a pointer to this defaults set 
> elsewhere in the documentation?
> 
> Regards,
> Branimir



Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Paulo Motta
Cool stuff! This will make it easier to advance configuration defaults
without affecting stable configuration.

Wording looks good to me. +1 to include a NEWS.txt note. I'm ok with
breaking trunk CI temporarily as long as failures are tracked and
triaged/addressed before the next release.

I haven't had the chance to look into CASSANDRA-18753 yet so apologies if
this was already discussed but I have the following questions about
handling 2 configuration files moving forward:
1) Will cassandra.yaml remain the default test config? Is the plan moving
forward to require green CI for both configurations on pre-commit, or
pre-release?
2) What will this mean for the release artifact, is the idea to continue
shipping with the current cassandra.yaml or eventually switch to the
optimized configuration (ie. 6.X) while making the legacy default
configuration available via an optional flag?

On Tue, Feb 13, 2024 at 11:42 AM Branimir Lambov  wrote:

> Hi All,
>
> CASSANDRA-18753 introduces a second set of defaults (in a separate
> "cassandra_latest.yaml") that enable new features of Cassandra. The
> objective is two-fold: to be able to test the database in this
> configuration, and to point potential users that are evaluating the
> technology to an optimized set of defaults that give a clearer picture of
> the expected performance of the database for a new user. The objective is
> to get this configuration into 5.0 to have the extra bit of confidence that
> we are not releasing (and recommending) options that have not gone through
> thorough CI.
>
> The implementation has already gone through review, but I'd like to get
> people's opinion on two things:
> - There are currently a number of test failures when the new options are
> selected, some of which appear to be genuine problems. Is the community
> okay with committing the patch before all of these are addressed? This
> should prevent the introduction of new failures and make sure we don't
> release before clearing the existing ones.
> - I'd like to get an opinion on what's suitable wording and documentation
> for the new defaults set. Currently, the patch proposes adding the
> following text to the yaml (see
> https://github.com/apache/cassandra/pull/2896/files):
> # NOTE:
> #   This file is provided in two versions:
> # - cassandra.yaml: Contains configuration defaults for a "compatible"
> #   configuration that operates using settings that are
> backwards-compatible
> #   and interoperable with machines running older versions of
> Cassandra.
> #   This version is provided to facilitate pain-free upgrades for
> existing
> #   users of Cassandra running in production who want to gradually and
> #   carefully introduce new features.
> # - cassandra_latest.yaml: Contains configuration defaults that enable
> #   the latest features of Cassandra, including improved functionality
> as
> #   well as higher performance. This version is provided for new users
> of
> #   Cassandra who want to get the most out of their cluster, and for
> users
> #   evaluating the technology.
> #   To use this version, simply copy this file over cassandra.yaml, or
> specify
> #   it using the -Dcassandra.config system property, e.g. by running
> # cassandra
> -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
> # /NOTE
> Does this sound sensible? Should we add a pointer to this defaults set
> elsewhere in the documentation?
>
> Regards,
> Branimir
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Josh McKenzie
> When we have failing tests people do not spend the time to figure out if 
> their logic caused a regression and merge, making things more unstable… so 
> when we merge failing tests that leads to people merging even more failing 
> tests...
What's the counter position to this Jacek / Berenguer?

Mick and Ekaterina (and everyone really) - any thoughts on what test coverage, 
if any, we should commit to for this new configuration? Acknowledging that we 
already have *a lot* of CI that we run.


On Wed, Feb 14, 2024, at 5:11 AM, Berenguer Blasi wrote:
> +1 to not doing, imo, the ostrich lol
> 
> On 14/2/24 10:58, Jacek Lewandowski wrote:
>> We should not block merging configuration changes given it is a valid 
>> configuration - which I understand as it is correct, passes all config 
>> validations, it matches documented rules, etc. And this provided latest 
>> config matches those requirements I assume.
>> 
>> The failures should block release or we should not advertise we have those 
>> features at all, and the configuration should be named "experimental" rather 
>> than "latest".
>> 
>> The config changes are not responsible for broken features and we should not 
>> bury our heads in the sand pretending that everything is ok.
>> 
>> Thanks,
>> 
>> śr., 14 lut 2024, 10:47 użytkownik Štefan Miklošovič 
>>  napisał:
>>> Wording looks good to me. I would also put that into NEWS.txt but I am not 
>>> sure what section. New features, Upgrading nor Deprecation does not seem to 
>>> be a good category. 
>>> 
>>> On Tue, Feb 13, 2024 at 5:42 PM Branimir Lambov  wrote:
 Hi All,
 
 CASSANDRA-18753 introduces a second set of defaults (in a separate 
 "cassandra_latest.yaml") that enable new features of Cassandra. The 
 objective is two-fold: to be able to test the database in this 
 configuration, and to point potential users that are evaluating the 
 technology to an optimized set of defaults that give a clearer picture of 
 the expected performance of the database for a new user. The objective is 
 to get this configuration into 5.0 to have the extra bit of confidence 
 that we are not releasing (and recommending) options that have not gone 
 through thorough CI.
 
 The implementation has already gone through review, but I'd like to get 
 people's opinion on two things:
 - There are currently a number of test failures when the new options are 
 selected, some of which appear to be genuine problems. Is the community 
 okay with committing the patch before all of these are addressed? This 
 should prevent the introduction of new failures and make sure we don't 
 release before clearing the existing ones.
 - I'd like to get an opinion on what's suitable wording and documentation 
 for the new defaults set. Currently, the patch proposes adding the 
 following text to the yaml (see 
 https://github.com/apache/cassandra/pull/2896/files):
 # NOTE:
 #   This file is provided in two versions:
 # - cassandra.yaml: Contains configuration defaults for a "compatible"
 #   configuration that operates using settings that are 
 backwards-compatible
 #   and interoperable with machines running older versions of 
 Cassandra.
 #   This version is provided to facilitate pain-free upgrades for 
 existing
 #   users of Cassandra running in production who want to gradually and
 #   carefully introduce new features.
 # - cassandra_latest.yaml: Contains configuration defaults that enable
 #   the latest features of Cassandra, including improved functionality 
 as
 #   well as higher performance. This version is provided for new users 
 of
 #   Cassandra who want to get the most out of their cluster, and for 
 users
 #   evaluating the technology.
 #   To use this version, simply copy this file over cassandra.yaml, or 
 specify
 #   it using the -Dcassandra.config system property, e.g. by running
 # cassandra 
 -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
 # /NOTE
 Does this sound sensible? Should we add a pointer to this defaults set 
 elsewhere in the documentation?
 
 Regards,
 Branimir


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Berenguer Blasi

+1 to not doing, imo, the ostrich lol

On 14/2/24 10:58, Jacek Lewandowski wrote:
We should not block merging configuration changes given it is a valid 
configuration - which I understand as it is correct, passes all config 
validations, it matches documented rules, etc. And this provided 
latest config matches those requirements I assume.


The failures should block release or we should not advertise we have 
those features at all, and the configuration should be named 
"experimental" rather than "latest".


The config changes are not responsible for broken features and we 
should not bury our heads in the sand pretending that everything is ok.


Thanks,

śr., 14 lut 2024, 10:47 użytkownik Štefan Miklošovič 
 napisał:


Wording looks good to me. I would also put that into NEWS.txt but
I am not sure what section. New features, Upgrading nor
Deprecation does not seem to be a good category.

On Tue, Feb 13, 2024 at 5:42 PM Branimir Lambov
 wrote:

Hi All,

CASSANDRA-18753 introduces a second set of defaults (in a
separate "cassandra_latest.yaml") that enable new features of
Cassandra. The objective is two-fold: to be able to test the
database in this configuration, and to point potential users
that are evaluating the technology to an optimized set of
defaults that give a clearer picture of the expected
performance of the database for a new user. The objective is
to get this configuration into 5.0 to have the extra bit of
confidence that we are not releasing (and recommending)
options that have not gone through thorough CI.

The implementation has already gone through review, but I'd
like to get people's opinion on two things:
- There are currently a number of test failures when the new
options are selected, some of which appear to
be genuine problems. Is the community okay with committing the
patch before all of these are addressed? This should prevent
the introduction of new failures and make sure we don't
release before clearing the existing ones.
- I'd like to get an opinion on what's suitable wording and
documentation for the new defaults set. Currently, the patch
proposes adding the following text to the yaml (see
https://github.com/apache/cassandra/pull/2896/files):
# NOTE:
#   This file is provided in two versions:
#     - cassandra.yaml: Contains configuration defaults for a
"compatible"
#       configuration that operates using settings that are
backwards-compatible
#       and interoperable with machines running older versions
of Cassandra.
#       This version is provided to facilitate pain-free
upgrades for existing
#       users of Cassandra running in production who want to
gradually and
#       carefully introduce new features.
#     - cassandra_latest.yaml: Contains configuration defaults
that enable
#       the latest features of Cassandra, including improved
functionality as
#       well as higher performance. This version is provided
for new users of
#       Cassandra who want to get the most out of their
cluster, and for users
#       evaluating the technology.
#       To use this version, simply copy this file over
cassandra.yaml, or specify
#       it using the -Dcassandra.config system property, e.g.
by running
#         cassandra
-Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
# /NOTE
Does this sound sensible? Should we add a pointer to this
defaults set elsewhere in the documentation?

Regards,
Branimir


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Jacek Lewandowski
We should not block merging configuration changes given it is a valid
configuration - which I understand as it is correct, passes all config
validations, it matches documented rules, etc. And this provided latest
config matches those requirements I assume.

The failures should block release or we should not advertise we have those
features at all, and the configuration should be named "experimental"
rather than "latest".

The config changes are not responsible for broken features and we should
not bury our heads in the sand pretending that everything is ok.

Thanks,

śr., 14 lut 2024, 10:47 użytkownik Štefan Miklošovič <
stefan.mikloso...@gmail.com> napisał:

> Wording looks good to me. I would also put that into NEWS.txt but I am not
> sure what section. New features, Upgrading nor Deprecation does not seem to
> be a good category.
>
> On Tue, Feb 13, 2024 at 5:42 PM Branimir Lambov 
> wrote:
>
>> Hi All,
>>
>> CASSANDRA-18753 introduces a second set of defaults (in a separate
>> "cassandra_latest.yaml") that enable new features of Cassandra. The
>> objective is two-fold: to be able to test the database in this
>> configuration, and to point potential users that are evaluating the
>> technology to an optimized set of defaults that give a clearer picture of
>> the expected performance of the database for a new user. The objective is
>> to get this configuration into 5.0 to have the extra bit of confidence that
>> we are not releasing (and recommending) options that have not gone through
>> thorough CI.
>>
>> The implementation has already gone through review, but I'd like to get
>> people's opinion on two things:
>> - There are currently a number of test failures when the new options are
>> selected, some of which appear to be genuine problems. Is the community
>> okay with committing the patch before all of these are addressed? This
>> should prevent the introduction of new failures and make sure we don't
>> release before clearing the existing ones.
>> - I'd like to get an opinion on what's suitable wording and documentation
>> for the new defaults set. Currently, the patch proposes adding the
>> following text to the yaml (see
>> https://github.com/apache/cassandra/pull/2896/files):
>> # NOTE:
>> #   This file is provided in two versions:
>> # - cassandra.yaml: Contains configuration defaults for a "compatible"
>> #   configuration that operates using settings that are
>> backwards-compatible
>> #   and interoperable with machines running older versions of
>> Cassandra.
>> #   This version is provided to facilitate pain-free upgrades for
>> existing
>> #   users of Cassandra running in production who want to gradually and
>> #   carefully introduce new features.
>> # - cassandra_latest.yaml: Contains configuration defaults that enable
>> #   the latest features of Cassandra, including improved
>> functionality as
>> #   well as higher performance. This version is provided for new
>> users of
>> #   Cassandra who want to get the most out of their cluster, and for
>> users
>> #   evaluating the technology.
>> #   To use this version, simply copy this file over cassandra.yaml,
>> or specify
>> #   it using the -Dcassandra.config system property, e.g. by running
>> # cassandra
>> -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
>> # /NOTE
>> Does this sound sensible? Should we add a pointer to this defaults set
>> elsewhere in the documentation?
>>
>> Regards,
>> Branimir
>>
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Štefan Miklošovič
Wording looks good to me. I would also put that into NEWS.txt but I am not
sure what section. New features, Upgrading nor Deprecation does not seem to
be a good category.

On Tue, Feb 13, 2024 at 5:42 PM Branimir Lambov  wrote:

> Hi All,
>
> CASSANDRA-18753 introduces a second set of defaults (in a separate
> "cassandra_latest.yaml") that enable new features of Cassandra. The
> objective is two-fold: to be able to test the database in this
> configuration, and to point potential users that are evaluating the
> technology to an optimized set of defaults that give a clearer picture of
> the expected performance of the database for a new user. The objective is
> to get this configuration into 5.0 to have the extra bit of confidence that
> we are not releasing (and recommending) options that have not gone through
> thorough CI.
>
> The implementation has already gone through review, but I'd like to get
> people's opinion on two things:
> - There are currently a number of test failures when the new options are
> selected, some of which appear to be genuine problems. Is the community
> okay with committing the patch before all of these are addressed? This
> should prevent the introduction of new failures and make sure we don't
> release before clearing the existing ones.
> - I'd like to get an opinion on what's suitable wording and documentation
> for the new defaults set. Currently, the patch proposes adding the
> following text to the yaml (see
> https://github.com/apache/cassandra/pull/2896/files):
> # NOTE:
> #   This file is provided in two versions:
> # - cassandra.yaml: Contains configuration defaults for a "compatible"
> #   configuration that operates using settings that are
> backwards-compatible
> #   and interoperable with machines running older versions of
> Cassandra.
> #   This version is provided to facilitate pain-free upgrades for
> existing
> #   users of Cassandra running in production who want to gradually and
> #   carefully introduce new features.
> # - cassandra_latest.yaml: Contains configuration defaults that enable
> #   the latest features of Cassandra, including improved functionality
> as
> #   well as higher performance. This version is provided for new users
> of
> #   Cassandra who want to get the most out of their cluster, and for
> users
> #   evaluating the technology.
> #   To use this version, simply copy this file over cassandra.yaml, or
> specify
> #   it using the -Dcassandra.config system property, e.g. by running
> # cassandra
> -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
> # /NOTE
> Does this sound sensible? Should we add a pointer to this defaults set
> elsewhere in the documentation?
>
> Regards,
> Branimir
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Branimir Lambov
is there a reason all guardrails and reliability (aka repair retries)
configs are off by default?  They are off by default in the normal config
for backwards compatibility reasons, but if we are defining a config saying
what we recommend, we should enable these things by default IMO.

This is one more question to be answered by this discussion. Are there
other options that should be enabled by the "latest" configuration? To what
values should they be set?
Is there something that is currently enabled that should not be?

Should we merge the configs breaking these tests?  No…. When we have
failing tests people do not spend the time to figure out if their logic
caused a regression and merge, making things more unstable… so when we
merge failing tests that leads to people merging even more failing tests...

In this case this also means that people will not see at all failures that
they introduce in any of the advanced features, as they are not tested at
all. Also, since CASSANDRA-19167 and 19168 already have fixes, the
non-latest test suite will remain clean after merge. Note that these two
problems demonstrate that we have failures in the configuration we ship
with, because we are not actually testing it at all. IMHO this is a problem
that we should not delay fixing.

Regards,
Branimir

On Wed, Feb 14, 2024 at 1:07 AM David Capwell  wrote:

> so can cause repairs to deadlock forever
>
>
> Small correction, I finished fixing the tests in CASSANDRA-19042 and we
> don’t deadlock, we timeout and fail repair if any of those messages are
> dropped.
>
> On Feb 13, 2024, at 11:04 AM, David Capwell  wrote:
>
> and to point potential users that are evaluating the technology to an
> optimized set of defaults
>
>
> Left this comment in the GH… is there a reason all guardrails and
> reliability (aka repair retries) configs are off by default?  They are
> off by default in the normal config for backwards compatibility reasons,
> but if we are defining a config saying what we recommend, we should enable
> these things by default IMO.
>
> There are currently a number of test failures when the new options are
> selected, some of which appear to be genuine problems. Is the community
> okay with committing the patch before all of these are addressed?
>
>
> I was tagged on CASSANDRA-19042, the paxos repair message handing does
> not have the repair reliably improvements that 5.0 have, so can cause
> repairs to deadlock forever (same as current 4.x repairs).  Bringing these
> up to par with the rest of repair would be very much welcome (they are also
> lacking visibility, so need to fallback to heap dumps to see what’s going
> on; same as 4.0.x but not 4.1.x), but I doubt I have cycles to do that….
> This refactor is not 100% trivial as it has fun subtle concurrency issues
> to address (message retries and dedupping), and making sure this logic
> works with the existing repair simulation tests does require refactoring
> how the paxos cleanup state is tracked, which could have subtle consequents.
>
> I do think this should be fixed, but should it block 5.0?  Not sure… will
> leave to others….
>
> Should we merge the configs breaking these tests?  No…. When we have
> failing tests people do not spend the time to figure out if their logic
> caused a regression and merge, making things more unstable… so when we
> merge failing tests that leads to people merging even more failing tests...
>
> On Feb 13, 2024, at 8:41 AM, Branimir Lambov  wrote:
>
> Hi All,
>
> CASSANDRA-18753 introduces a second set of defaults (in a separate
> "cassandra_latest.yaml") that enable new features of Cassandra. The
> objective is two-fold: to be able to test the database in this
> configuration, and to point potential users that are evaluating the
> technology to an optimized set of defaults that give a clearer picture of
> the expected performance of the database for a new user. The objective is
> to get this configuration into 5.0 to have the extra bit of confidence that
> we are not releasing (and recommending) options that have not gone through
> thorough CI.
>
> The implementation has already gone through review, but I'd like to get
> people's opinion on two things:
> - There are currently a number of test failures when the new options are
> selected, some of which appear to be genuine problems. Is the community
> okay with committing the patch before all of these are addressed? This
> should prevent the introduction of new failures and make sure we don't
> release before clearing the existing ones.
> - I'd like to get an opinion on what's suitable wording and documentation
> for the new defaults set. Currently, the patch proposes adding the
> following text to the yaml (see
> https://github.com/apache/cassandra/pull/2896/files):
> # NOTE:
> #   This file is provided in two versions:
> # - cassandra.yaml: Contains configuration defaults for a "compatible"
> #   configuration that operates using settings that are
> backwards-compatible
> #

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-13 Thread David Capwell
> so can cause repairs to deadlock forever

Small correction, I finished fixing the tests in CASSANDRA-19042 and we don’t 
deadlock, we timeout and fail repair if any of those messages are dropped.  

> On Feb 13, 2024, at 11:04 AM, David Capwell  wrote:
> 
>> and to point potential users that are evaluating the technology to an 
>> optimized set of defaults
> 
> Left this comment in the GH… is there a reason all guardrails and reliability 
> (aka repair retries) configs are off by default?  They are off by default in 
> the normal config for backwards compatibility reasons, but if we are defining 
> a config saying what we recommend, we should enable these things by default 
> IMO.
> 
>> There are currently a number of test failures when the new options are 
>> selected, some of which appear to be genuine problems. Is the community okay 
>> with committing the patch before all of these are addressed?
> 
> I was tagged on CASSANDRA-19042, the paxos repair message handing does not 
> have the repair reliably improvements that 5.0 have, so can cause repairs to 
> deadlock forever (same as current 4.x repairs).  Bringing these up to par 
> with the rest of repair would be very much welcome (they are also lacking 
> visibility, so need to fallback to heap dumps to see what’s going on; same as 
> 4.0.x but not 4.1.x), but I doubt I have cycles to do that…. This refactor is 
> not 100% trivial as it has fun subtle concurrency issues to address (message 
> retries and dedupping), and making sure this logic works with the existing 
> repair simulation tests does require refactoring how the paxos cleanup state 
> is tracked, which could have subtle consequents.
> 
> I do think this should be fixed, but should it block 5.0?  Not sure… will 
> leave to others….
> 
> Should we merge the configs breaking these tests?  No…. When we have failing 
> tests people do not spend the time to figure out if their logic caused a 
> regression and merge, making things more unstable… so when we merge failing 
> tests that leads to people merging even more failing tests...
> 
>> On Feb 13, 2024, at 8:41 AM, Branimir Lambov  wrote:
>> 
>> Hi All,
>> 
>> CASSANDRA-18753 introduces a second set of defaults (in a separate 
>> "cassandra_latest.yaml") that enable new features of Cassandra. The 
>> objective is two-fold: to be able to test the database in this 
>> configuration, and to point potential users that are evaluating the 
>> technology to an optimized set of defaults that give a clearer picture of 
>> the expected performance of the database for a new user. The objective is to 
>> get this configuration into 5.0 to have the extra bit of confidence that we 
>> are not releasing (and recommending) options that have not gone through 
>> thorough CI.
>> 
>> The implementation has already gone through review, but I'd like to get 
>> people's opinion on two things:
>> - There are currently a number of test failures when the new options are 
>> selected, some of which appear to be genuine problems. Is the community okay 
>> with committing the patch before all of these are addressed? This should 
>> prevent the introduction of new failures and make sure we don't release 
>> before clearing the existing ones.
>> - I'd like to get an opinion on what's suitable wording and documentation 
>> for the new defaults set. Currently, the patch proposes adding the following 
>> text to the yaml (see https://github.com/apache/cassandra/pull/2896/files):
>> # NOTE:
>> #   This file is provided in two versions:
>> # - cassandra.yaml: Contains configuration defaults for a "compatible"
>> #   configuration that operates using settings that are 
>> backwards-compatible
>> #   and interoperable with machines running older versions of Cassandra.
>> #   This version is provided to facilitate pain-free upgrades for 
>> existing
>> #   users of Cassandra running in production who want to gradually and
>> #   carefully introduce new features.
>> # - cassandra_latest.yaml: Contains configuration defaults that enable
>> #   the latest features of Cassandra, including improved functionality as
>> #   well as higher performance. This version is provided for new users of
>> #   Cassandra who want to get the most out of their cluster, and for 
>> users
>> #   evaluating the technology.
>> #   To use this version, simply copy this file over cassandra.yaml, or 
>> specify
>> #   it using the -Dcassandra.config system property, e.g. by running
>> # cassandra 
>> -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
>> # /NOTE
>> Does this sound sensible? Should we add a pointer to this defaults set 
>> elsewhere in the documentation?
>> 
>> Regards,
>> Branimir
> 



Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-13 Thread David Capwell
> and to point potential users that are evaluating the technology to an 
> optimized set of defaults

Left this comment in the GH… is there a reason all guardrails and reliability 
(aka repair retries) configs are off by default?  They are off by default in 
the normal config for backwards compatibility reasons, but if we are defining a 
config saying what we recommend, we should enable these things by default IMO.

> There are currently a number of test failures when the new options are 
> selected, some of which appear to be genuine problems. Is the community okay 
> with committing the patch before all of these are addressed?

I was tagged on CASSANDRA-19042, the paxos repair message handing does not have 
the repair reliably improvements that 5.0 have, so can cause repairs to 
deadlock forever (same as current 4.x repairs).  Bringing these up to par with 
the rest of repair would be very much welcome (they are also lacking 
visibility, so need to fallback to heap dumps to see what’s going on; same as 
4.0.x but not 4.1.x), but I doubt I have cycles to do that…. This refactor is 
not 100% trivial as it has fun subtle concurrency issues to address (message 
retries and dedupping), and making sure this logic works with the existing 
repair simulation tests does require refactoring how the paxos cleanup state is 
tracked, which could have subtle consequents.

I do think this should be fixed, but should it block 5.0?  Not sure… will leave 
to others….

Should we merge the configs breaking these tests?  No…. When we have failing 
tests people do not spend the time to figure out if their logic caused a 
regression and merge, making things more unstable… so when we merge failing 
tests that leads to people merging even more failing tests...

> On Feb 13, 2024, at 8:41 AM, Branimir Lambov  wrote:
> 
> Hi All,
> 
> CASSANDRA-18753 introduces a second set of defaults (in a separate 
> "cassandra_latest.yaml") that enable new features of Cassandra. The objective 
> is two-fold: to be able to test the database in this configuration, and to 
> point potential users that are evaluating the technology to an optimized set 
> of defaults that give a clearer picture of the expected performance of the 
> database for a new user. The objective is to get this configuration into 5.0 
> to have the extra bit of confidence that we are not releasing (and 
> recommending) options that have not gone through thorough CI.
> 
> The implementation has already gone through review, but I'd like to get 
> people's opinion on two things:
> - There are currently a number of test failures when the new options are 
> selected, some of which appear to be genuine problems. Is the community okay 
> with committing the patch before all of these are addressed? This should 
> prevent the introduction of new failures and make sure we don't release 
> before clearing the existing ones.
> - I'd like to get an opinion on what's suitable wording and documentation for 
> the new defaults set. Currently, the patch proposes adding the following text 
> to the yaml (see https://github.com/apache/cassandra/pull/2896/files):
> # NOTE:
> #   This file is provided in two versions:
> # - cassandra.yaml: Contains configuration defaults for a "compatible"
> #   configuration that operates using settings that are 
> backwards-compatible
> #   and interoperable with machines running older versions of Cassandra.
> #   This version is provided to facilitate pain-free upgrades for existing
> #   users of Cassandra running in production who want to gradually and
> #   carefully introduce new features.
> # - cassandra_latest.yaml: Contains configuration defaults that enable
> #   the latest features of Cassandra, including improved functionality as
> #   well as higher performance. This version is provided for new users of
> #   Cassandra who want to get the most out of their cluster, and for users
> #   evaluating the technology.
> #   To use this version, simply copy this file over cassandra.yaml, or 
> specify
> #   it using the -Dcassandra.config system property, e.g. by running
> # cassandra 
> -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
> # /NOTE
> Does this sound sensible? Should we add a pointer to this defaults set 
> elsewhere in the documentation?
> 
> Regards,
> Branimir



[Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-13 Thread Branimir Lambov
Hi All,

CASSANDRA-18753 introduces a second set of defaults (in a separate
"cassandra_latest.yaml") that enable new features of Cassandra. The
objective is two-fold: to be able to test the database in this
configuration, and to point potential users that are evaluating the
technology to an optimized set of defaults that give a clearer picture of
the expected performance of the database for a new user. The objective is
to get this configuration into 5.0 to have the extra bit of confidence that
we are not releasing (and recommending) options that have not gone through
thorough CI.

The implementation has already gone through review, but I'd like to get
people's opinion on two things:
- There are currently a number of test failures when the new options are
selected, some of which appear to be genuine problems. Is the community
okay with committing the patch before all of these are addressed? This
should prevent the introduction of new failures and make sure we don't
release before clearing the existing ones.
- I'd like to get an opinion on what's suitable wording and documentation
for the new defaults set. Currently, the patch proposes adding the
following text to the yaml (see
https://github.com/apache/cassandra/pull/2896/files):
# NOTE:
#   This file is provided in two versions:
# - cassandra.yaml: Contains configuration defaults for a "compatible"
#   configuration that operates using settings that are
backwards-compatible
#   and interoperable with machines running older versions of Cassandra.
#   This version is provided to facilitate pain-free upgrades for
existing
#   users of Cassandra running in production who want to gradually and
#   carefully introduce new features.
# - cassandra_latest.yaml: Contains configuration defaults that enable
#   the latest features of Cassandra, including improved functionality
as
#   well as higher performance. This version is provided for new users
of
#   Cassandra who want to get the most out of their cluster, and for
users
#   evaluating the technology.
#   To use this version, simply copy this file over cassandra.yaml, or
specify
#   it using the -Dcassandra.config system property, e.g. by running
# cassandra
-Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
# /NOTE
Does this sound sensible? Should we add a pointer to this defaults set
elsewhere in the documentation?

Regards,
Branimir