Hi,

+1 on CCM not using CASSANDRA_USE_JDK11 to pick a JDK version. I don’t think 
it’s a good interface for expressing what JDK CCM should use. I checked and I 
don’t see it being used in the dtests, so it shouldn’t break anything? I am not 
sure what it was added for.

Cleaning up CCM so it only selects a new JDK for each node when something 
changes that actually impacts what JDK should be used (change in C* version, 
explicit config change) also makes sense.

Ariel

On Sat, May 25, 2024, at 12:09 AM, Jacek Lewandowski wrote:
> Thank you for all the opinions. That is useful for future work on CCM. 
> 
> 
> 
> When I implemented the changes that caused recent headaches, I wasn't aware 
> that the CCM code was so patchworked, which resulted in many surprises. I 
> apologize for that. Anyway, there is no reason to sit and complain instead of 
> fixing what I already found. 
> 
> 
> 
> There are some unit tests attached to CCM, and I wonder whether they are ever 
> run before merging a patch because I'm unaware of any CI. Along with my 
> recent patches, I've implemented quite comprehensive tests verifying whether 
> the env updating function works as "expected", but it turned out the problems 
> were outside that function. I'm making fun of the word "expected" because 
> precisely defined contracts for CCM do not seem to exist. I'd like us to 
> invest in unit tests and make them the contract guards. It is tough to verify 
> whether a patch would cause problems in some environments; we usually have to 
> run all workloads to confirm that, and this would not work in the future if 
> we expect the community to get more involved. This recent incident was an 
> excellent example of such a problem - I'm really thankful to Mick and Ariel 
> for running all the Apache and Apple workloads to verify the patch, which 
> consumed much of their time, and I cannot expect such involvement in the 
> future. 
> 
> 
> 
> To the point, let's pay attention to the unit tests, make sure they are 
> mocking Cassandra instead of running real programs so that they are fast, and 
> add some CI—if they are fast unit tests, I bet we can rely on some free tier 
> of GitHub actions. Adding the expectations we assumed when implementing the 
> CI systems for Cassandra will make it less likely to break something 
> accidentally.
> 
> 
> 
> In the end, I feel we strayed off the topic a bit - my question was quite 
> concrete - I'd like to remove the CASSANDRA_USE_JDK11 knob for CCM - it 
> should set it appropriately for Cassandra 4 so that the CCM user does not 
> have to bother. However, CCM should not use it to decide which Java version 
> to use. I'm unaware of any release cycle of CCM versions anywhere. Perhaps we 
> should do the following - tag a new version before the change and then add 
> the proposed change.
> 
> 
> 
> There are also other problems related to env setup - for example, when a user 
> or the dtest framework wants to force a certain Java version, it is honored 
> only for running a Cassandra node - it is not applied for running nodetool or 
> any other command line tool. Therefore, there is a broader question about 
> when the explicit Java version should be set - it feels like the correct 
> approach would be to set it up when a node is created rather than when it is 
> started so that the selection applies to running the server and all the 
> commands. This would simplify things significantly - instead of resolving env 
> and checking Java distributions each time we are about to run a node or a 
> tool - resolve the required env changes once we create a node or when we 
> update the installation directory, which we do when testing upgrades. Such 
> simplification would also remove some clutter from the logs. Can you remember 
> the whole environment logged frequently and twice when running a node?
> 
> 
> 
> Can we make this discussion conclusive?
> 
> 
> 
> Thanks!
> 
> 
> - - -- --- ----- -------- -------------
> Jacek Lewandowski
> 
> 
> pt., 24 maj 2024 o 20:55 Josh McKenzie <jmcken...@apache.org> napisał(a):
>> __
>>> The scripts that are in cassandra-builds seem like a starting point for 
>>> converging different CI systems so that they run the same set of tests in 
>>> as similar environments as possible
>> Yeah, I took a superset of circle and ASF tests to try and run 
>> :allthethings:. Part of how the checkstyle dependency check got in the way 
>> too, since we weren't running that on ASF CI. :)
>> 
>> Strong +1 to more convergence on what we're running in CI for sure.
>> 
>> On Fri, May 24, 2024, at 11:59 AM, Ariel Weisberg wrote:
>>> Hi,
>>> 
>>> There is definitely a mismatch between how the full range of dtests work 
>>> and the direction CCM is going in and we have some difficulty getting those 
>>> to match. I fully empathize with several of those CI systems not being 
>>> publicly visible/accessible, and the behavior of upgrade paths being 
>>> absolutely inscrutable relative to the environment variables that are set.
>>> 
>>> I am happy to volunteer to test things in advance on Apple's CI. I'll also 
>>> try to get on top of responding faster :-)
>>> 
>>> The window where reverting is useful is slightly past now that all the 
>>> issues I am aware of have been fixed, but in the future I think the burden 
>>> for revert might need to be lower. It's tough those because putting the 
>>> burden on ASF for non-ASF CI is not necessarily a given.
>>> 
>>> There is a big gap between CI systems where how they invoke the dtests 
>>> determines the exact set of tests they run and how they invoke CCM (and 
>>> which CCM bugs they expose). I really don't like this approach including 
>>> relying on environment variables to dictate dtests execution behavior. I 
>>> hope to have some time to spend on this once my live migration work is in a 
>>> better place.
>>> 
>>> Right now ASF CI is not running the upgrade paths that trigger JDK version 
>>> switching which is at the root of our recent problems. Once we close that 
>>> gap we should be in a much better place in terms of divergence.
>>> 
>>> The scripts that are in cassandra-builds seem like a starting point for 
>>> converging different CI systems so that they run the same set of tests in 
>>> as similar environments as possible and harness specific quirks are pushed 
>>> into specific integration points where things like pointing to private 
>>> mirrors is supported.
>>> 
>>> Additionally what I would like to see is that CI harnesses specify the 
>>> location of all JDKs, and then provide flags (not environment variables) to 
>>> the dtests that dictate what should be run. What is currently in Java path 
>>> or Java home shouldn't be relevant for any dtests IMO, I would like the 
>>> dtests (themselves or delegating to CCM) to juggle that themselves.
>>> 
>>> Those flags should also be as declarative as possible and require 
>>> specifying C* versions and JDK versions so if you want to run the set of 
>>> tests we required to commit you don't need keep changing how the dtests are 
>>> invoked. 
>>> 
>>> Ariel
>>> 
>>> On Thu, May 23, 2024, at 6:22 AM, Mick Semb Wever wrote:
>>>>> When starting Cassandra nodes, CCM uses the current env Java distribution 
>>>>> (defined by the JAVA_HOME env variable). This behavior is overridden in 
>>>>> three cases:
>>>>> 
>>>>> - Java version is not supported by the selected Cassandra distribution - 
>>>>> in which case, CCM looks for supported Java distribution across 
>>>>> JAVAx_HOME env variables
>>>>> 
>>>>> - Java version is specified explicitly (--jvm-version arg or jvm_version 
>>>>> param if used in Python)
>>>>> 
>>>>> - CASSANDRA_USE_JDK11 is defined in env, in which case, for Cassandra 4.x 
>>>>> CCM forces to use only JDK11
>>>>> 
>>>>> 
>>>>> 
>>>>> I want to ask you guys whether you are okay with removing the third 
>>>>> exception. If we remove it, Cassandra 4.x will not be treated in any 
>>>>> special way—CCM will use the current Java version, so if it is Java 11, 
>>>>> it will use Java 11 (and automatically set CASSANDRA_USE_JDK11), and if 
>>>>> it is Java 8, it will use Java 8 (and automatically unset 
>>>>> CASSANDRA_USE_JDK11). 
>>>>> 
>>>>> 
>>>>> 
>>>>> I think there is no need for CCM to use CASSANDRA_USE_JDK11 to make a 
>>>>> decision about which Java version to use as it adds more complexity, 
>>>>> makes it work differently for Cassandra 4.x than for other Cassandra 
>>>>> versions, and actually provides no value at all because if we work with 
>>>>> Cassandra having our env configured for Java 11, we have to have 
>>>>> CASSANDRA_USE_JDK11 and if not, we cannot have it. Therefore, CCM can be 
>>>>> based solely on the current Java version and not include the existence of 
>>>>> CASSANDRA_USE_JDK11 in the Java version selection process.
>>>>> 
>>>>> 
>>>>> WDYT? 
>>>> 
>>>>  
>>>> With the recent commits to ccm we have now broken three different CI 
>>>> systems, in numerous different ways.  All remain broken.
>>>> 
>>>> At this point in time, the default behaviour should be to revert those 
>>>> commits.  Not to discuss whether we can further remove existing 
>>>> functionality on the assumption we know all consumers, or that they are 
>>>> all reading this thread and agreeing.
>>>> 
>>>> In ccm, the jdk selection and switching does indeed deserve a clean up.  
>>>> We have found a number of superfluous ways of achieving the same thing 
>>>> that is leading to unnecessary code complexity.  But we should not be hard 
>>>> breaking things for downstream users and our CI.
>>>> 
>>>> The initial commit to ccm that broke things was to fix ccm running a 
>>>> binary 5.0-beta1 with a particular jdk.  This patch and subsequent fixes 
>>>> has included additional refactoring/cleaning changes that have broken a 
>>>> number of things, like jdk-switching and upgrade_through_versions tests.  
>>>> We keep trying to fix each breakage, but are also including additional 
>>>> adjustments "to do the right thing" that only ends up breaking yet another 
>>>> thing.   This shouldn't be how we apply changes to a library that has many 
>>>> (unknown) consumers, nor that we don't have full test coverage on.
>>>> 
>>>> Given the broken CI systems and the troubles we have already caused 
>>>> consumers, my recommendation is that these commits are reverted, and we 
>>>> live with the binary 5.0-beta1 breakage for now, while we more patiently 
>>>> work on a more complete and thorough fix.  Furthermore to the specific 
>>>> question in the post, I don't believe we should be removing working 
>>>> functionality without first a deprecation cycle, given that ccm has many 
>>>> unknown consumers.  This depreciation period can be time-based, since ccm 
>>>> doesn't have versions.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 

Reply via email to