Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

Jacek Lewandowski Fri, 16 Feb 2024 01:44:05 -0800

We should conclude this discussion by answering Branimir's original
question.* I vote for merging that and exposing issues to the CI.*


For pre-commit optimization I've opened
https://issues.apache.org/jira/browse/CASSANDRA-19406 epic and we should
add exact tasks there to make this valuable discussion result in some
concrete actions. Then, we can discuss each task in a more organized way.

czw., 15 lut 2024 o 21:29 Štefan Miklošovič <stefan.mikloso...@gmail.com>
napisał(a):

> I love the idea of David to make this dual config stuff directly the part
> of the tests, I just leave this here where I quickly put some super
> primitive runner together
>
>
> https://github.com/smiklosovic/cassandra/commit/693803772218b52c424491b826c704811d606a31
>
> We could just run by default with one config and annotate it with all
> configs if we think this is crucial to test in both scenarios.
>
> Anyway, happy to expand on this but I do not want to block any progress in
> the ticket, might come afterwards, just showing what is possible.
>
> On Thu, Feb 15, 2024 at 7:59 PM David Capwell <dcapw...@apple.com> wrote:
>
>> This thread got large quick, yay!
>>
>> is there a reason all guardrails and reliability (aka repair retries)
>> configs are off by default?  They are off by default in the normal config
>> for backwards compatibility reasons, but if we are defining a config saying
>> what we recommend, we should enable these things by default IMO.
>>
>> This is one more question to be answered by this discussion. Are there
>> other options that should be enabled by the "latest" configuration? To what
>> values should they be set?
>> Is there something that is currently enabled that should not be?
>>
>>
>> Very likely, we should try to figure that out.  We should also answer how
>> conservative do we want to be by default?  There are many configs we need
>> to flesh out here, glad to help with the configs I authored (prob best for
>> JIRA rather than this thread)
>>
>>
>> Should we merge the configs breaking these tests?  No…. When we have
>> failing tests people do not spend the time to figure out if their logic
>> caused a regression and merge, making things more unstable… so when we
>> merge failing tests that leads to people merging even more failing tests...
>>
>> In this case this also means that people will not see at all failures
>> that they introduce in any of the advanced features, as they are not tested
>> at all. Also, since CASSANDRA-19167 and 19168 already have fixes, the
>> non-latest test suite will remain clean after merge. Note that these two
>> problems demonstrate that we have failures in the configuration we ship
>> with, because we are not actually testing it at all. IMHO this is a problem
>> that we should not delay fixing.
>>
>>
>> I am not arguing we should not get this into CI, but more we should fix
>> the known issues before getting into CI… its what we normally do, I don’t
>> see a reason to special case this work.
>>
>> I am 100% cool blocking 5.0 on these bugs found (even if they are test
>> failures), but don’t feel we should enable in CI until these issues are
>> resolved; we can add the yaml now, but not the CI pipelines.
>>
>>
>> 1) If there’s an “old compatible default” and “latest recommended
>> settings”, when does the value in “old compatible default” get updated?
>> Never?
>>
>>
>> How about replacing cassandra.yaml with cassandra_latest.yaml on trunk
>> when cutting cassandra-6.0 branch? Any new default changes on trunk go to
>> cassandra_latest.yaml.
>>
>>
>> I feel its dangerous to define this at the file level and should do at
>> the config level… I personally see us adding new features disabled by
>> default in cassandra.yaml and the recommended values in
>> Cassandra-latest.yaml… If I add a config in 5.1.2 should it get enabled by
>> default in 6.0?  I don’t feel thats wise.
>>
>> Maybe it makes sense to annotate the configs with the target version for
>> the default change?
>>
>> Let's distinguish the packages of tests that need to be run with CDC
>> enabled / disabled, with commitlog compression enabled / disabled, tests
>> that verify sstable formats (mostly io and index I guess), and leave other
>> parameters set as with the latest configuration - this is the easiest way I
>> think.
>>
>>
>> Yes please!  I really hate having a pipeline per config, we should
>> annotate this some how in the tests that matter… junit can param the tests
>> for us so we cover the different configs the test supports… I have written
>> many tests that are costly and run on all these other pipelines but have 0
>> change in the config… just wasting resources rerunning…
>>
>> Pushing this to the test also is a better author/maintainer experience…
>> running the test in your IDE and seeing all the params and their results is
>> so much better than monkeying around with yaml files and ant…. My repair
>> simulation tests have a hack flag to try to switch the yaml to make it
>> easier to test against the other configs and I loath it so much…
>>
>> To me running no-vnodes makes no sense because no-vnodes is just a
>> special case of vnodes=1
>>
>>
>> And to many in the community the only config for production =).  In this
>> debate we have 3 camps: no-vnode, vnode <= 4 tokens, vnode > 4 tokens (I am
>> simplifying….)… For those in the no-vnode camp their tests are focused on
>> this case and get disabled when vnodes are enabled (so not running this
>> config lowers coverage).
>>
>> I don’t think we are going to solve this debate in this thread, but if we
>> can push this to the test to run as a param I think thats best… avoid
>> having 2 pipelines and push this to the tests that actually support both
>> configs...
>>
>> On Feb 15, 2024, at 10:20 AM, Jon Haddad <j...@jonhaddad.com> wrote:
>>
>> For the sake of argument, if we picked one, would you (anyone, not just
>> Stefan) be OK with the JVM that's selected being the one you don't use?
>> I'm willing to bet most folks will have a preference for the JVM they run
>> in production.  I think both should be run on some frequent basis but I can
>> definitely see the reasoning behind not wanting it to block folks on work,
>> it sounds like a lot of wasted days waiting on CI especially during a
>> bigger multi-cycle review.
>>
>> I suppose that it wouldn't necessarily need to be consistent - for
>> example some folks might use 17 and others 11.  If this was the route the
>> project goes down, it seems like it would be reasonable for someone to run
>> whichever JVM version they felt like.  Hopefully at least a few regular
>> committers would run 17, and that might be enough.
>>
>> Maybe instead of running the full suite on post - commit, it could be run
>> periodically, like once a night or if it's longer than 24h run once a
>> week.  If both JVMs get hit b/c different teams opt to use different
>> versions, maybe it ends up being enough coverage.
>>
>> I'm curious if anyone can think of an issue that affected one JVM and not
>> others, because that would probably help determine the usefulness of 2 test
>> suites.
>>
>>
>> On Thu, Feb 15, 2024 at 10:01 AM Štefan Miklošovič <
>> stefan.mikloso...@gmail.com> wrote:
>>
>>> Only requiring building on supported JDKs and running all tests only on
>>> a pre-defined version is definitely an improvement in terms of build time.
>>> Building it is cheap, one worker and 5 minutes.
>>>
>>> As I already said, just want to reiterate that, instead of _running with
>>> all Java's_ we might run with one Java version, we would just change it for
>>> runs of two yamls (default and latest).
>>>
>>> However, this would put more stress on Jenkins based on what you
>>> described in Post-commit point. Just saying it aloud.
>>>
>>> On Thu, Feb 15, 2024 at 6:12 PM Josh McKenzie <jmcken...@apache.org>
>>> wrote:
>>>
>>>> Would it make sense to only block commits on the test strategy you've
>>>> listed, and shift the entire massive test suite to post-commit?
>>>>
>>>>
>>>> Lots and lots of other emails
>>>>
>>>>
>>>> ;)
>>>>
>>>> There's an interesting broad question of: What config do we consider
>>>> "recommended" going forward, the "conservative" (i.e. old) or the
>>>> "performant" (i.e. new)? And what JDK do we consider "recommended" going
>>>> forward, the oldest we support or the newest?
>>>>
>>>> Since those recommendations apply for new clusters, people need to
>>>> qualify their setups, and we have a high bar of quality on testing
>>>> pre-merge, my gut tells me "performant + newest JDK". This would impact
>>>> what we'd test pre-commit IMO.
>>>>
>>>> Having been doing a lot of CI stuff lately, some observations:
>>>>
>>>>    - Our True North needs to be releasing a database that's free of
>>>>    defects that violate our core properties we commit to our users. No data
>>>>    loss, no data resurrection, transient or otherwise, due to defects in 
>>>> our
>>>>    code (meteors, tsunamis, etc notwithstanding).
>>>>    - The relationship of time spent on CI and stability of final full
>>>>    *post-commit* runs is asymptotic. It's not even 90/10; we're
>>>>    probably somewhere like 98% value gained from 10% of work, and the 
>>>> other 2%
>>>>    "stability" (i.e. green test suites, not "our database works") is a
>>>>    long-tail slog. Especially in the current ASF CI heterogenous env w/its
>>>>    current orchestration.
>>>>    - Thus: Pre-commit and post-commit should be different. The
>>>>    following points all apply to pre-commit:
>>>>    - The goal of pre-commit tests should be some number of 9's of no
>>>>    test failures post-commit (i.e. for every 20 green pre-commit we 
>>>> introduce
>>>>    1 flake post-commit). Not full perfection; it's not worth the compute 
>>>> and
>>>>    complexity.
>>>>    - We should *build *all branches on all supported JDK's (8 + 11 for
>>>>    older, 11 + 17 for newer, etc).
>>>>    - We should *run *all test suites with the *recommended *
>>>>    *configuration* against the *highest versioned JDK a branch
>>>>    supports. *And we should formally recommend our users run on that
>>>>    JDK.
>>>>    - We should *at least* run all jvm-based configurations on the
>>>>    highest supported JDK version with the "not recommended but still
>>>>    supported" configuration.
>>>>    - I'm open to being persuaded that we should at least run jvm-unit
>>>>    tests on the older JDK w/the conservative config pre-commit, but not 
>>>> much
>>>>    beyond that.
>>>>
>>>> That would leave us with the following distilled:
>>>>
>>>> *Pre-commit:*
>>>>
>>>>    - Build on all supported jdks
>>>>    - All test suites on highest supported jdk using recommended config
>>>>    - Repeat testing on new or changed tests on highest supported JDK
>>>>    w/recommended config
>>>>    - JDK-based test suites on highest supported jdk using other config
>>>>
>>>> *Post-commit:*
>>>>
>>>>    - Run everything. All suites, all supported JDK's, both config
>>>>    files.
>>>>
>>>> With Butler + the *jenkins-jira* integration script
>>>> <https://github.com/apache/cassandra-builds/blob/trunk/jenkins-jira-integration/jenkins_jira_integration.py>(need
>>>> to dust that off but it should remain good to go), we should have a pretty
>>>> clear view as to when any consistent regressions are introduced and why.
>>>> We'd remain exposed to JDK-specific flake introductions and flakes in
>>>> unchanged tests, but there's no getting around the 2nd one and I expect the
>>>> former to be rare enough to not warrant the compute to prevent it.
>>>>
>>>> On Thu, Feb 15, 2024, at 10:02 AM, Jon Haddad wrote:
>>>>
>>>> Would it make sense to only block commits on the test strategy you've
>>>> listed, and shift the entire massive test suite to post-commit?  If there
>>>> really is only a small % of times the entire suite is useful this seems
>>>> like it could unblock the dev cycle but still have the benefit of the full
>>>> test suite.
>>>>
>>>>
>>>>
>>>> On Thu, Feb 15, 2024 at 3:18 AM Berenguer Blasi <
>>>> berenguerbl...@gmail.com> wrote:
>>>>
>>>>
>>>> On reducing circle ci usage during dev while iterating, not with the
>>>> intention to replace the pre-commit CI (yet), we could do away with testing
>>>> only dtests, jvm-dtests, units and cqlsh for a _single_ configuration imo.
>>>> That would greatly reduce usage. I hacked it quickly here for illustration
>>>> purposes:
>>>> https://app.circleci.com/pipelines/github/bereng/cassandra/1164/workflows/3a47c9ef-6456-4190-b5a5-aea2aff641f1
>>>> The good thing is that we have the tooling to dial in whatever we decide
>>>> atm.
>>>>
>>>> Changing pre-commit is a different discussion, to which I agree btw.
>>>> But the above could save time and $ big time during dev and be done and
>>>> merged in a matter of days imo.
>>>>
>>>> I can open a DISCUSS thread if we feel it's worth it.
>>>> On 15/2/24 10:24, Mick Semb Wever wrote:
>>>>
>>>>
>>>>
>>>> Mick and Ekaterina (and everyone really) - any thoughts on what test
>>>> coverage, if any, we should commit to for this new configuration?
>>>> Acknowledging that we already have *a lot* of CI that we run.
>>>>
>>>>
>>>>
>>>>
>>>> Branimir in this patch has already done some basic cleanup of test
>>>> variations, so this is not a duplication of the pipeline.  It's a
>>>> significant improvement.
>>>>
>>>> I'm ok with cassandra_latest being committed and added to the pipeline,
>>>> *if* the authors genuinely believe there's significant time and effort
>>>> saved in doing so.
>>>>
>>>> How many broken tests are we talking about ?
>>>> Are they consistently broken or flaky ?
>>>> Are they ticketed up and 5.0-rc blockers ?
>>>>
>>>> Having to deal with flakies and broken tests is an unfortunate reality
>>>> to having a pipeline of 170k tests.
>>>>
>>>> Despite real frustrations I don't believe the broken windows analogy is
>>>> appropriate here – it's more of a leave the campground cleaner…   That
>>>> being said, knowingly introducing a few broken tests is not that either,
>>>> but still having to deal with a handful of consistently breaking tests
>>>> for a short period of time is not the same cognitive burden as flakies.
>>>> There are currently other broken tests in 5.0: VectorUpdateDeleteTest,
>>>> upgrade_through_versions_test; are these compounding to the frustrations ?
>>>>
>>>> It's also been questioned about why we don't just enable settings we
>>>> recommend.  These are settings we recommend for new clusters.  Our existing
>>>> cassandra.yaml needs to be tailored for existing clusters being upgraded,
>>>> where we are very conservative about changing defaults.
>>>>
>>>>
>>>>
>>

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

Reply via email to