Thank you for all the opinions. That is useful for future work on CCM. When I implemented the changes that caused recent headaches, I wasn't aware that the CCM code was so patchworked, which resulted in many surprises. I apologize for that. Anyway, there is no reason to sit and complain instead of fixing what I already found.
There are some unit tests attached to CCM, and I wonder whether they are ever run before merging a patch because I'm unaware of any CI. Along with my recent patches, I've implemented quite comprehensive tests verifying whether the env updating function works as "expected", but it turned out the problems were outside that function. I'm making fun of the word "expected" because precisely defined contracts for CCM do not seem to exist. I'd like us to invest in unit tests and make them the contract guards. It is tough to verify whether a patch would cause problems in some environments; we usually have to run all workloads to confirm that, and this would not work in the future if we expect the community to get more involved. This recent incident was an excellent example of such a problem - I'm really thankful to Mick and Ariel for running all the Apache and Apple workloads to verify the patch, which consumed much of their time, and I cannot expect such involvement in the future. To the point, let's pay attention to the unit tests, make sure they are mocking Cassandra instead of running real programs so that they are fast, and add some CI—if they are fast unit tests, I bet we can rely on some free tier of GitHub actions. Adding the expectations we assumed when implementing the CI systems for Cassandra will make it less likely to break something accidentally. In the end, I feel we strayed off the topic a bit - my question was quite concrete - I'd like to remove the CASSANDRA_USE_JDK11 knob for CCM - it should set it appropriately for Cassandra 4 so that the CCM user does not have to bother. However, CCM should not use it to decide which Java version to use. I'm unaware of any release cycle of CCM versions anywhere. Perhaps we should do the following - tag a new version before the change and then add the proposed change. There are also other problems related to env setup - for example, when a user or the dtest framework wants to force a certain Java version, it is honored only for running a Cassandra node - it is not applied for running nodetool or any other command line tool. Therefore, there is a broader question about when the explicit Java version should be set - it feels like the correct approach would be to set it up when a node is created rather than when it is started so that the selection applies to running the server and all the commands. This would simplify things significantly - instead of resolving env and checking Java distributions each time we are about to run a node or a tool - resolve the required env changes once we create a node or when we update the installation directory, which we do when testing upgrades. Such simplification would also remove some clutter from the logs. Can you remember the whole environment logged frequently and twice when running a node? Can we make this discussion conclusive? Thanks! - - -- --- ----- -------- ------------- Jacek Lewandowski pt., 24 maj 2024 o 20:55 Josh McKenzie <jmcken...@apache.org> napisał(a): > The scripts that are in cassandra-builds seem like a starting point for > converging different CI systems so that they run the same set of tests in > as similar environments as possible > > Yeah, I took a superset of circle and ASF tests to try and run > :allthethings:. Part of how the checkstyle dependency check got in the way > too, since we weren't running that on ASF CI. :) > > Strong +1 to more convergence on what we're running in CI for sure. > > On Fri, May 24, 2024, at 11:59 AM, Ariel Weisberg wrote: > > Hi, > > There is definitely a mismatch between how the full range of dtests work > and the direction CCM is going in and we have some difficulty getting those > to match. I fully empathize with several of those CI systems not being > publicly visible/accessible, and the behavior of upgrade paths being > absolutely inscrutable relative to the environment variables that are set. > > I am happy to volunteer to test things in advance on Apple's CI. I'll also > try to get on top of responding faster :-) > > The window where reverting is useful is slightly past now that all the > issues I am aware of have been fixed, but in the future I think the burden > for revert might need to be lower. It's tough those because putting the > burden on ASF for non-ASF CI is not necessarily a given. > > There is a big gap between CI systems where how they invoke the dtests > determines the exact set of tests they run and how they invoke CCM (and > which CCM bugs they expose). I really don't like this approach including > relying on environment variables to dictate dtests execution behavior. I > hope to have some time to spend on this once my live migration work is in a > better place. > > Right now ASF CI is not running the upgrade paths that trigger JDK version > switching which is at the root of our recent problems. Once we close that > gap we should be in a much better place in terms of divergence. > > The scripts that are in cassandra-builds seem like a starting point for > converging different CI systems so that they run the same set of tests in > as similar environments as possible and harness specific quirks are pushed > into specific integration points where things like pointing to private > mirrors is supported. > > Additionally what I would like to see is that CI harnesses specify the > location of all JDKs, and then provide flags (not environment variables) to > the dtests that dictate what should be run. What is currently in Java path > or Java home shouldn't be relevant for any dtests IMO, I would like the > dtests (themselves or delegating to CCM) to juggle that themselves. > > Those flags should also be as declarative as possible and require > specifying C* versions and JDK versions so if you want to run the set of > tests we required to commit you don't need keep changing how the dtests are > invoked. > > Ariel > > On Thu, May 23, 2024, at 6:22 AM, Mick Semb Wever wrote: > > When starting Cassandra nodes, CCM uses the current env Java distribution > (defined by the JAVA_HOME env variable). This behavior is overridden in > three cases: > > - Java version is not supported by the selected Cassandra distribution - > in which case, CCM looks for supported Java distribution across JAVAx_HOME > env variables > > - Java version is specified explicitly (--jvm-version arg or jvm_version > param if used in Python) > > - CASSANDRA_USE_JDK11 is defined in env, in which case, for Cassandra 4.x > CCM forces to use only JDK11 > > > I want to ask you guys whether you are okay with removing the third > exception. If we remove it, Cassandra 4.x will not be treated in any > special way—CCM will use the current Java version, so if it is Java 11, it > will use Java 11 (and automatically set CASSANDRA_USE_JDK11), and if it is > Java 8, it will use Java 8 (and automatically unset CASSANDRA_USE_JDK11). > > > I think there is no need for CCM to use CASSANDRA_USE_JDK11 to make a > decision about which Java version to use as it adds more complexity, makes > it work differently for Cassandra 4.x than for other Cassandra versions, > and actually provides no value at all because if we work with Cassandra > having our env configured for Java 11, we have to have CASSANDRA_USE_JDK11 > and if not, we cannot have it. Therefore, CCM can be based solely on the > current Java version and not include the existence of CASSANDRA_USE_JDK11 > in the Java version selection process. > > WDYT? > > > > With the recent commits to ccm we have now broken three different CI > systems, in numerous different ways. All remain broken. > > At this point in time, the default behaviour should be to revert those > commits. Not to discuss whether we can further remove existing > functionality on the assumption we know all consumers, or that they are all > reading this thread and agreeing. > > In ccm, the jdk selection and switching does indeed deserve a clean up. > We have found a number of superfluous ways of achieving the same thing that > is leading to unnecessary code complexity. But we should not be hard > breaking things for downstream users and our CI. > > The initial commit to ccm that broke things was to fix ccm running a > binary 5.0-beta1 with a particular jdk. This patch and subsequent fixes > has included additional refactoring/cleaning changes that have broken a > number of things, like jdk-switching and upgrade_through_versions tests. > We keep trying to fix each breakage, but are also including additional > adjustments "to do the right thing" that only ends up breaking yet another > thing. This shouldn't be how we apply changes to a library that has many > (unknown) consumers, nor that we don't have full test coverage on. > > Given the broken CI systems and the troubles we have already caused > consumers, my recommendation is that these commits are reverted, and we > live with the binary 5.0-beta1 breakage for now, while we more patiently > work on a more complete and thorough fix. Furthermore to the specific > question in the post, I don't believe we should be removing working > functionality without first a deprecation cycle, given that ccm has many > unknown consumers. This depreciation period can be time-based, since ccm > doesn't have versions. > > > > > > > > >