Hi, +1 on CCM not using CASSANDRA_USE_JDK11 to pick a JDK version. I don’t think it’s a good interface for expressing what JDK CCM should use. I checked and I don’t see it being used in the dtests, so it shouldn’t break anything? I am not sure what it was added for.
Cleaning up CCM so it only selects a new JDK for each node when something changes that actually impacts what JDK should be used (change in C* version, explicit config change) also makes sense. Ariel On Sat, May 25, 2024, at 12:09 AM, Jacek Lewandowski wrote: > Thank you for all the opinions. That is useful for future work on CCM. > > > > When I implemented the changes that caused recent headaches, I wasn't aware > that the CCM code was so patchworked, which resulted in many surprises. I > apologize for that. Anyway, there is no reason to sit and complain instead of > fixing what I already found. > > > > There are some unit tests attached to CCM, and I wonder whether they are ever > run before merging a patch because I'm unaware of any CI. Along with my > recent patches, I've implemented quite comprehensive tests verifying whether > the env updating function works as "expected", but it turned out the problems > were outside that function. I'm making fun of the word "expected" because > precisely defined contracts for CCM do not seem to exist. I'd like us to > invest in unit tests and make them the contract guards. It is tough to verify > whether a patch would cause problems in some environments; we usually have to > run all workloads to confirm that, and this would not work in the future if > we expect the community to get more involved. This recent incident was an > excellent example of such a problem - I'm really thankful to Mick and Ariel > for running all the Apache and Apple workloads to verify the patch, which > consumed much of their time, and I cannot expect such involvement in the > future. > > > > To the point, let's pay attention to the unit tests, make sure they are > mocking Cassandra instead of running real programs so that they are fast, and > add some CI—if they are fast unit tests, I bet we can rely on some free tier > of GitHub actions. Adding the expectations we assumed when implementing the > CI systems for Cassandra will make it less likely to break something > accidentally. > > > > In the end, I feel we strayed off the topic a bit - my question was quite > concrete - I'd like to remove the CASSANDRA_USE_JDK11 knob for CCM - it > should set it appropriately for Cassandra 4 so that the CCM user does not > have to bother. However, CCM should not use it to decide which Java version > to use. I'm unaware of any release cycle of CCM versions anywhere. Perhaps we > should do the following - tag a new version before the change and then add > the proposed change. > > > > There are also other problems related to env setup - for example, when a user > or the dtest framework wants to force a certain Java version, it is honored > only for running a Cassandra node - it is not applied for running nodetool or > any other command line tool. Therefore, there is a broader question about > when the explicit Java version should be set - it feels like the correct > approach would be to set it up when a node is created rather than when it is > started so that the selection applies to running the server and all the > commands. This would simplify things significantly - instead of resolving env > and checking Java distributions each time we are about to run a node or a > tool - resolve the required env changes once we create a node or when we > update the installation directory, which we do when testing upgrades. Such > simplification would also remove some clutter from the logs. Can you remember > the whole environment logged frequently and twice when running a node? > > > > Can we make this discussion conclusive? > > > > Thanks! > > > - - -- --- ----- -------- ------------- > Jacek Lewandowski > > > pt., 24 maj 2024 o 20:55 Josh McKenzie <jmcken...@apache.org> napisał(a): >> __ >>> The scripts that are in cassandra-builds seem like a starting point for >>> converging different CI systems so that they run the same set of tests in >>> as similar environments as possible >> Yeah, I took a superset of circle and ASF tests to try and run >> :allthethings:. Part of how the checkstyle dependency check got in the way >> too, since we weren't running that on ASF CI. :) >> >> Strong +1 to more convergence on what we're running in CI for sure. >> >> On Fri, May 24, 2024, at 11:59 AM, Ariel Weisberg wrote: >>> Hi, >>> >>> There is definitely a mismatch between how the full range of dtests work >>> and the direction CCM is going in and we have some difficulty getting those >>> to match. I fully empathize with several of those CI systems not being >>> publicly visible/accessible, and the behavior of upgrade paths being >>> absolutely inscrutable relative to the environment variables that are set. >>> >>> I am happy to volunteer to test things in advance on Apple's CI. I'll also >>> try to get on top of responding faster :-) >>> >>> The window where reverting is useful is slightly past now that all the >>> issues I am aware of have been fixed, but in the future I think the burden >>> for revert might need to be lower. It's tough those because putting the >>> burden on ASF for non-ASF CI is not necessarily a given. >>> >>> There is a big gap between CI systems where how they invoke the dtests >>> determines the exact set of tests they run and how they invoke CCM (and >>> which CCM bugs they expose). I really don't like this approach including >>> relying on environment variables to dictate dtests execution behavior. I >>> hope to have some time to spend on this once my live migration work is in a >>> better place. >>> >>> Right now ASF CI is not running the upgrade paths that trigger JDK version >>> switching which is at the root of our recent problems. Once we close that >>> gap we should be in a much better place in terms of divergence. >>> >>> The scripts that are in cassandra-builds seem like a starting point for >>> converging different CI systems so that they run the same set of tests in >>> as similar environments as possible and harness specific quirks are pushed >>> into specific integration points where things like pointing to private >>> mirrors is supported. >>> >>> Additionally what I would like to see is that CI harnesses specify the >>> location of all JDKs, and then provide flags (not environment variables) to >>> the dtests that dictate what should be run. What is currently in Java path >>> or Java home shouldn't be relevant for any dtests IMO, I would like the >>> dtests (themselves or delegating to CCM) to juggle that themselves. >>> >>> Those flags should also be as declarative as possible and require >>> specifying C* versions and JDK versions so if you want to run the set of >>> tests we required to commit you don't need keep changing how the dtests are >>> invoked. >>> >>> Ariel >>> >>> On Thu, May 23, 2024, at 6:22 AM, Mick Semb Wever wrote: >>>>> When starting Cassandra nodes, CCM uses the current env Java distribution >>>>> (defined by the JAVA_HOME env variable). This behavior is overridden in >>>>> three cases: >>>>> >>>>> - Java version is not supported by the selected Cassandra distribution - >>>>> in which case, CCM looks for supported Java distribution across >>>>> JAVAx_HOME env variables >>>>> >>>>> - Java version is specified explicitly (--jvm-version arg or jvm_version >>>>> param if used in Python) >>>>> >>>>> - CASSANDRA_USE_JDK11 is defined in env, in which case, for Cassandra 4.x >>>>> CCM forces to use only JDK11 >>>>> >>>>> >>>>> >>>>> I want to ask you guys whether you are okay with removing the third >>>>> exception. If we remove it, Cassandra 4.x will not be treated in any >>>>> special way—CCM will use the current Java version, so if it is Java 11, >>>>> it will use Java 11 (and automatically set CASSANDRA_USE_JDK11), and if >>>>> it is Java 8, it will use Java 8 (and automatically unset >>>>> CASSANDRA_USE_JDK11). >>>>> >>>>> >>>>> >>>>> I think there is no need for CCM to use CASSANDRA_USE_JDK11 to make a >>>>> decision about which Java version to use as it adds more complexity, >>>>> makes it work differently for Cassandra 4.x than for other Cassandra >>>>> versions, and actually provides no value at all because if we work with >>>>> Cassandra having our env configured for Java 11, we have to have >>>>> CASSANDRA_USE_JDK11 and if not, we cannot have it. Therefore, CCM can be >>>>> based solely on the current Java version and not include the existence of >>>>> CASSANDRA_USE_JDK11 in the Java version selection process. >>>>> >>>>> >>>>> WDYT? >>>> >>>> >>>> With the recent commits to ccm we have now broken three different CI >>>> systems, in numerous different ways. All remain broken. >>>> >>>> At this point in time, the default behaviour should be to revert those >>>> commits. Not to discuss whether we can further remove existing >>>> functionality on the assumption we know all consumers, or that they are >>>> all reading this thread and agreeing. >>>> >>>> In ccm, the jdk selection and switching does indeed deserve a clean up. >>>> We have found a number of superfluous ways of achieving the same thing >>>> that is leading to unnecessary code complexity. But we should not be hard >>>> breaking things for downstream users and our CI. >>>> >>>> The initial commit to ccm that broke things was to fix ccm running a >>>> binary 5.0-beta1 with a particular jdk. This patch and subsequent fixes >>>> has included additional refactoring/cleaning changes that have broken a >>>> number of things, like jdk-switching and upgrade_through_versions tests. >>>> We keep trying to fix each breakage, but are also including additional >>>> adjustments "to do the right thing" that only ends up breaking yet another >>>> thing. This shouldn't be how we apply changes to a library that has many >>>> (unknown) consumers, nor that we don't have full test coverage on. >>>> >>>> Given the broken CI systems and the troubles we have already caused >>>> consumers, my recommendation is that these commits are reverted, and we >>>> live with the binary 5.0-beta1 breakage for now, while we more patiently >>>> work on a more complete and thorough fix. Furthermore to the specific >>>> question in the post, I don't believe we should be removing working >>>> functionality without first a deprecation cycle, given that ccm has many >>>> unknown consumers. This depreciation period can be time-based, since ccm >>>> doesn't have versions. >>>> >>>> >>>> >>>> >>>> >>>> >>> >>