[jira] [Commented] (CASSANDRA-17563) Fix CircleCI Midres config

David Capwell (Jira) Thu, 21 Apr 2022 10:32:06 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525883#comment-17525883
 ]


David Capwell commented on CASSANDRA-17563:
-------------------------------------------

bq.  has only an addition of a new job in_jvm which match as resource usage the 
old single in_jvm job and no other changes applied.

Currently this is a massive amount of manual work to confirm, and there isn't a 
good source of truth to compare against; this patch moves the source of truth 
into a map so we know what happens (if you don't update you get default for the 
level, else you add an override)

We talked about this a lot in slack, and my personal feeling is every step in a 
manual process is another chance for error, so the more steps done the higher 
risk to do it incorrectly; the current process as I can find is the following

1) create config-2_1.yml.MIDRES and config-2_1.yml.HIGHER using the current 
patches
2) update config-2_1.yml with your change
3) update config-2_1.yml.MIDRES with your change, and figure out how to apply 
the updated resources
4) update config-2_1.yml.HIGHER with your change and figure out how to apply 
the updated resources (method does not match step 3, so the "how" is different 
here)
5) generate diff for MIDRES and update patch
6) generate diff for HIGHER and update patch
7) test LOWER - success is defined as "what failed before is the only thing 
failing now"
8) test MIDRES - success is defined as "what failed before is the only thing 
failing now"
9) test HIGHER - success is defined as "what failed before is the only thing 
failing now"

did I miss anything?

bq. There are changes to the other generate.sh script I haven't looked at but 
any change there need to be tested

given there are 0 tests for the script, 2 different people "testing" could 
yield different results, so we would need to have some way to define success 
that is agreed upon.  

For example, I fixed what I saw was a bug, when you ask it to generate LOWER, 
MIDRES, or HIGHER it doesn't actually update those files and instead only 
updates config.yml; the help page says this but to me this is unexpected 
behavior

-a updates config.yml.LOWER, config.yml.MIDRES, and config.yml.HIGHER,
-h updates config.yml only!

I am totally cool with -h updating config.yml as well, but it feels like a bug 
that config.yml.HIGHER isn't updated... so my patch changing that... which one 
of us is the bug?

Now, if we want to define it as "they did the same thing regardless of personal 
feelings about correct behavior" then I do know for a fact my patch is 
different; I am 100% ok reverting that difference

bq. My concern is we don't know who was using what and how and it was working 
fine for quite some time

I feel like a politician... can you define the word "what"?

bq. Do we want to rewrite the whole approach one week before freeze when people 
highly utilize CI to push their latest work?

the core change is moving away from patch to modifying the yaml tree; there are 
other changes but those are personal preference and 100% fine to drop... 

To me I ask the following question "if you yaml diff the old and new files, are 
there a difference?" if the answer is no, then there isn't much of a risk other 
than the script not working on an unknown laptop (which impacts generate.sh 
only, not CI configs).

Now, if you want to de-risk that, we could use this script to generate the 
patches, but we don't solve the real problem of patches applying when they 
shouldn't (which is how I broke MIDRES).  If we want to do that to lower risk 
before 4.1 freeze I am cool with that, but do not think that is a valid long 
term solution

bq. now we will have a mix of python and shell scripts, are we sure the 
community will accept that? 

that is something anyone who touches this needs to answer, which is why I tried 
to pull in anyone who touched this logic to get their feedback.  I do know that 
many in the community basically do this already (can tell by looking at circle 
ci as my private scripts rename things and cleanup our DAG), so its just moving 
part of that private logic into OSS to help maintain these files.

bq. I really like and appreciate how you added diff but I am confused  from the 
output what I am seeing actually. I see the new name and resource change.

do you mean the output of the dump of what each job's resource is?

{code}
$ diff midres.resources midres.resources.new
9a10
> j11_jvm_dtests_vnode  medium  10
22a24
> j8_jvm_dtests_vnode   large   10
{code}

so this is diff output, so ">" means that the right-hand-side has the 
following, but there is no matching on the left hand side... aka "new job"

you see that MIDRES j8 and j11 do not have matching resources!  This is because 
they don't now (as defined before the vnode patch), so I am pointing out that 
j11 and j8 run with different resources on MIDRES and that I am not changing 
that behavior in this patch (more than glad to if you desire)

{code}
j11_jvm_dtests  medium  10
j11_jvm_dtests_vnode  medium  10
j8_jvm_dtests large 10
j8_jvm_dtests_vnode large 10
{code}

they match in HIGHER but not MIDRES, I do not know if there was a reason for 
this before so I left it to be true still.

> Fix CircleCI Midres config
> --------------------------
>
>                 Key: CASSANDRA-17563
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17563
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI
>            Reporter: Ekaterina Dimitrova
>            Priority: Normal
>             Fix For: 4.1
>
>
> During CircleCI addition of a new job to the config, the midres file got 
> messy. Two of the immediate issues (but we need to verify all jobs will use 
> the right executors and resources):
>  * the new job needs to use higher parallelism as the original in-jvm job
>  *  j8_dtests_with_vnodes should get from midres 50 large but currently 
> midres makes it run with 25 and medium which fails around 100 tests



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-17563) Fix CircleCI Midres config

Reply via email to