We should conclude this discussion by answering Branimir's original question.* I vote for merging that and exposing issues to the CI.*
For pre-commit optimization I've opened https://issues.apache.org/jira/browse/CASSANDRA-19406 epic and we should add exact tasks there to make this valuable discussion result in some concrete actions. Then, we can discuss each task in a more organized way. czw., 15 lut 2024 o 21:29 Štefan Miklošovič <stefan.mikloso...@gmail.com> napisał(a): > I love the idea of David to make this dual config stuff directly the part > of the tests, I just leave this here where I quickly put some super > primitive runner together > > > https://github.com/smiklosovic/cassandra/commit/693803772218b52c424491b826c704811d606a31 > > We could just run by default with one config and annotate it with all > configs if we think this is crucial to test in both scenarios. > > Anyway, happy to expand on this but I do not want to block any progress in > the ticket, might come afterwards, just showing what is possible. > > On Thu, Feb 15, 2024 at 7:59 PM David Capwell <dcapw...@apple.com> wrote: > >> This thread got large quick, yay! >> >> is there a reason all guardrails and reliability (aka repair retries) >> configs are off by default? They are off by default in the normal config >> for backwards compatibility reasons, but if we are defining a config saying >> what we recommend, we should enable these things by default IMO. >> >> This is one more question to be answered by this discussion. Are there >> other options that should be enabled by the "latest" configuration? To what >> values should they be set? >> Is there something that is currently enabled that should not be? >> >> >> Very likely, we should try to figure that out. We should also answer how >> conservative do we want to be by default? There are many configs we need >> to flesh out here, glad to help with the configs I authored (prob best for >> JIRA rather than this thread) >> >> >> Should we merge the configs breaking these tests? No…. When we have >> failing tests people do not spend the time to figure out if their logic >> caused a regression and merge, making things more unstable… so when we >> merge failing tests that leads to people merging even more failing tests... >> >> In this case this also means that people will not see at all failures >> that they introduce in any of the advanced features, as they are not tested >> at all. Also, since CASSANDRA-19167 and 19168 already have fixes, the >> non-latest test suite will remain clean after merge. Note that these two >> problems demonstrate that we have failures in the configuration we ship >> with, because we are not actually testing it at all. IMHO this is a problem >> that we should not delay fixing. >> >> >> I am not arguing we should not get this into CI, but more we should fix >> the known issues before getting into CI… its what we normally do, I don’t >> see a reason to special case this work. >> >> I am 100% cool blocking 5.0 on these bugs found (even if they are test >> failures), but don’t feel we should enable in CI until these issues are >> resolved; we can add the yaml now, but not the CI pipelines. >> >> >> 1) If there’s an “old compatible default” and “latest recommended >> settings”, when does the value in “old compatible default” get updated? >> Never? >> >> >> How about replacing cassandra.yaml with cassandra_latest.yaml on trunk >> when cutting cassandra-6.0 branch? Any new default changes on trunk go to >> cassandra_latest.yaml. >> >> >> I feel its dangerous to define this at the file level and should do at >> the config level… I personally see us adding new features disabled by >> default in cassandra.yaml and the recommended values in >> Cassandra-latest.yaml… If I add a config in 5.1.2 should it get enabled by >> default in 6.0? I don’t feel thats wise. >> >> Maybe it makes sense to annotate the configs with the target version for >> the default change? >> >> Let's distinguish the packages of tests that need to be run with CDC >> enabled / disabled, with commitlog compression enabled / disabled, tests >> that verify sstable formats (mostly io and index I guess), and leave other >> parameters set as with the latest configuration - this is the easiest way I >> think. >> >> >> Yes please! I really hate having a pipeline per config, we should >> annotate this some how in the tests that matter… junit can param the tests >> for us so we cover the different configs the test supports… I have written >> many tests that are costly and run on all these other pipelines but have 0 >> change in the config… just wasting resources rerunning… >> >> Pushing this to the test also is a better author/maintainer experience… >> running the test in your IDE and seeing all the params and their results is >> so much better than monkeying around with yaml files and ant…. My repair >> simulation tests have a hack flag to try to switch the yaml to make it >> easier to test against the other configs and I loath it so much… >> >> To me running no-vnodes makes no sense because no-vnodes is just a >> special case of vnodes=1 >> >> >> And to many in the community the only config for production =). In this >> debate we have 3 camps: no-vnode, vnode <= 4 tokens, vnode > 4 tokens (I am >> simplifying….)… For those in the no-vnode camp their tests are focused on >> this case and get disabled when vnodes are enabled (so not running this >> config lowers coverage). >> >> I don’t think we are going to solve this debate in this thread, but if we >> can push this to the test to run as a param I think thats best… avoid >> having 2 pipelines and push this to the tests that actually support both >> configs... >> >> On Feb 15, 2024, at 10:20 AM, Jon Haddad <j...@jonhaddad.com> wrote: >> >> For the sake of argument, if we picked one, would you (anyone, not just >> Stefan) be OK with the JVM that's selected being the one you don't use? >> I'm willing to bet most folks will have a preference for the JVM they run >> in production. I think both should be run on some frequent basis but I can >> definitely see the reasoning behind not wanting it to block folks on work, >> it sounds like a lot of wasted days waiting on CI especially during a >> bigger multi-cycle review. >> >> I suppose that it wouldn't necessarily need to be consistent - for >> example some folks might use 17 and others 11. If this was the route the >> project goes down, it seems like it would be reasonable for someone to run >> whichever JVM version they felt like. Hopefully at least a few regular >> committers would run 17, and that might be enough. >> >> Maybe instead of running the full suite on post - commit, it could be run >> periodically, like once a night or if it's longer than 24h run once a >> week. If both JVMs get hit b/c different teams opt to use different >> versions, maybe it ends up being enough coverage. >> >> I'm curious if anyone can think of an issue that affected one JVM and not >> others, because that would probably help determine the usefulness of 2 test >> suites. >> >> >> On Thu, Feb 15, 2024 at 10:01 AM Štefan Miklošovič < >> stefan.mikloso...@gmail.com> wrote: >> >>> Only requiring building on supported JDKs and running all tests only on >>> a pre-defined version is definitely an improvement in terms of build time. >>> Building it is cheap, one worker and 5 minutes. >>> >>> As I already said, just want to reiterate that, instead of _running with >>> all Java's_ we might run with one Java version, we would just change it for >>> runs of two yamls (default and latest). >>> >>> However, this would put more stress on Jenkins based on what you >>> described in Post-commit point. Just saying it aloud. >>> >>> On Thu, Feb 15, 2024 at 6:12 PM Josh McKenzie <jmcken...@apache.org> >>> wrote: >>> >>>> Would it make sense to only block commits on the test strategy you've >>>> listed, and shift the entire massive test suite to post-commit? >>>> >>>> >>>> Lots and lots of other emails >>>> >>>> >>>> ;) >>>> >>>> There's an interesting broad question of: What config do we consider >>>> "recommended" going forward, the "conservative" (i.e. old) or the >>>> "performant" (i.e. new)? And what JDK do we consider "recommended" going >>>> forward, the oldest we support or the newest? >>>> >>>> Since those recommendations apply for new clusters, people need to >>>> qualify their setups, and we have a high bar of quality on testing >>>> pre-merge, my gut tells me "performant + newest JDK". This would impact >>>> what we'd test pre-commit IMO. >>>> >>>> Having been doing a lot of CI stuff lately, some observations: >>>> >>>> - Our True North needs to be releasing a database that's free of >>>> defects that violate our core properties we commit to our users. No data >>>> loss, no data resurrection, transient or otherwise, due to defects in >>>> our >>>> code (meteors, tsunamis, etc notwithstanding). >>>> - The relationship of time spent on CI and stability of final full >>>> *post-commit* runs is asymptotic. It's not even 90/10; we're >>>> probably somewhere like 98% value gained from 10% of work, and the >>>> other 2% >>>> "stability" (i.e. green test suites, not "our database works") is a >>>> long-tail slog. Especially in the current ASF CI heterogenous env w/its >>>> current orchestration. >>>> - Thus: Pre-commit and post-commit should be different. The >>>> following points all apply to pre-commit: >>>> - The goal of pre-commit tests should be some number of 9's of no >>>> test failures post-commit (i.e. for every 20 green pre-commit we >>>> introduce >>>> 1 flake post-commit). Not full perfection; it's not worth the compute >>>> and >>>> complexity. >>>> - We should *build *all branches on all supported JDK's (8 + 11 for >>>> older, 11 + 17 for newer, etc). >>>> - We should *run *all test suites with the *recommended * >>>> *configuration* against the *highest versioned JDK a branch >>>> supports. *And we should formally recommend our users run on that >>>> JDK. >>>> - We should *at least* run all jvm-based configurations on the >>>> highest supported JDK version with the "not recommended but still >>>> supported" configuration. >>>> - I'm open to being persuaded that we should at least run jvm-unit >>>> tests on the older JDK w/the conservative config pre-commit, but not >>>> much >>>> beyond that. >>>> >>>> That would leave us with the following distilled: >>>> >>>> *Pre-commit:* >>>> >>>> - Build on all supported jdks >>>> - All test suites on highest supported jdk using recommended config >>>> - Repeat testing on new or changed tests on highest supported JDK >>>> w/recommended config >>>> - JDK-based test suites on highest supported jdk using other config >>>> >>>> *Post-commit:* >>>> >>>> - Run everything. All suites, all supported JDK's, both config >>>> files. >>>> >>>> With Butler + the *jenkins-jira* integration script >>>> <https://github.com/apache/cassandra-builds/blob/trunk/jenkins-jira-integration/jenkins_jira_integration.py>(need >>>> to dust that off but it should remain good to go), we should have a pretty >>>> clear view as to when any consistent regressions are introduced and why. >>>> We'd remain exposed to JDK-specific flake introductions and flakes in >>>> unchanged tests, but there's no getting around the 2nd one and I expect the >>>> former to be rare enough to not warrant the compute to prevent it. >>>> >>>> On Thu, Feb 15, 2024, at 10:02 AM, Jon Haddad wrote: >>>> >>>> Would it make sense to only block commits on the test strategy you've >>>> listed, and shift the entire massive test suite to post-commit? If there >>>> really is only a small % of times the entire suite is useful this seems >>>> like it could unblock the dev cycle but still have the benefit of the full >>>> test suite. >>>> >>>> >>>> >>>> On Thu, Feb 15, 2024 at 3:18 AM Berenguer Blasi < >>>> berenguerbl...@gmail.com> wrote: >>>> >>>> >>>> On reducing circle ci usage during dev while iterating, not with the >>>> intention to replace the pre-commit CI (yet), we could do away with testing >>>> only dtests, jvm-dtests, units and cqlsh for a _single_ configuration imo. >>>> That would greatly reduce usage. I hacked it quickly here for illustration >>>> purposes: >>>> https://app.circleci.com/pipelines/github/bereng/cassandra/1164/workflows/3a47c9ef-6456-4190-b5a5-aea2aff641f1 >>>> The good thing is that we have the tooling to dial in whatever we decide >>>> atm. >>>> >>>> Changing pre-commit is a different discussion, to which I agree btw. >>>> But the above could save time and $ big time during dev and be done and >>>> merged in a matter of days imo. >>>> >>>> I can open a DISCUSS thread if we feel it's worth it. >>>> On 15/2/24 10:24, Mick Semb Wever wrote: >>>> >>>> >>>> >>>> Mick and Ekaterina (and everyone really) - any thoughts on what test >>>> coverage, if any, we should commit to for this new configuration? >>>> Acknowledging that we already have *a lot* of CI that we run. >>>> >>>> >>>> >>>> >>>> Branimir in this patch has already done some basic cleanup of test >>>> variations, so this is not a duplication of the pipeline. It's a >>>> significant improvement. >>>> >>>> I'm ok with cassandra_latest being committed and added to the pipeline, >>>> *if* the authors genuinely believe there's significant time and effort >>>> saved in doing so. >>>> >>>> How many broken tests are we talking about ? >>>> Are they consistently broken or flaky ? >>>> Are they ticketed up and 5.0-rc blockers ? >>>> >>>> Having to deal with flakies and broken tests is an unfortunate reality >>>> to having a pipeline of 170k tests. >>>> >>>> Despite real frustrations I don't believe the broken windows analogy is >>>> appropriate here – it's more of a leave the campground cleaner… That >>>> being said, knowingly introducing a few broken tests is not that either, >>>> but still having to deal with a handful of consistently breaking tests >>>> for a short period of time is not the same cognitive burden as flakies. >>>> There are currently other broken tests in 5.0: VectorUpdateDeleteTest, >>>> upgrade_through_versions_test; are these compounding to the frustrations ? >>>> >>>> It's also been questioned about why we don't just enable settings we >>>> recommend. These are settings we recommend for new clusters. Our existing >>>> cassandra.yaml needs to be tailored for existing clusters being upgraded, >>>> where we are very conservative about changing defaults. >>>> >>>> >>>> >>