Thanks Brandon and Mick, This is exactly the feedback I was looking for, the last thing we want to do is reduce the throughput of the already strained CI pipelines.
Sounds like it's a bigger task than just cutting over to ARM, just want to reassure you Brandon we certainly won't change anything without discussion on this thread first, especially if we're going to be reducing the number of boxes available by ~21% for no immediate value. I'll be in touch next week, enjoy your weekends Thanks, Jackson ________________________________ From: Mick Semb Wever <m...@apache.org> Sent: Saturday, May 25, 2024 2:37:14 AM To: dev@cassandra.apache.org <dev@cassandra.apache.org> Subject: Re: Updating Instaclustr donated Jenkins Agents EXTERNAL EMAIL - USE CAUTION when clicking links or attachments Jackson, we are very thankful for all the donations from Instaclustr. Getting people (and resources) involved in ARM maintenance and testing is desperately needed. More detailed feedback below. On Fri, 24 May 2024 at 16:08, Brandon Williams <dri...@gmail.com<mailto:dri...@gmail.com>> wrote: On Thu, May 23, 2024 at 5:51 PM Fleming, Jackson via dev <dev@cassandra.apache.org<mailto:dev@cassandra.apache.org>> wrote: > Primarily this would be moving from x86 instances to Graviton ARM based ones, > as we’ve seen a pretty good uptake of ARM usage, and we’d like to help ensure > that there’s good testing coverage across both x86 and ARM architectures. I just want to note that this will reduce the x86 pool from 42 to 31, and then we will have a parallel pipeline of 9 ARM agents (15 if the other 6 come back.) Currently I think we have about an 8 hour post-commit run time with 42 machines (though I'm sure there is room for improvement.) Today only artifact/packaging jobs are routinely run on ARM servers, due to their limited number. They are currently disabled waiting on INFRA-25819. There are test jobs for arm, but they are not run routinely, as they are not part of any branch's pipeline. Last run of any arm job was 1 yr 8 mo ago. This was mostly to discover what the arm test failures were, on a one-off basis (iirc there's a small handful, like supporting the older snappy compression option). To include testing on arm in pre- and post- commit, we would need to 1. fix all failures, and 2. have a lot more arm agents. We currently have 42 x86 agents. If we took away 9 we'd see throughput reduce to ~75% (turn-around times become 1.3x longer). And then, if we included arm testing in the pipeline the bottleneck would be the new 15 arm agents, meaning the overall throughput reduces to ~35% (turn-around times become 2.8x longer). Our biggest hurdle to begin with is really people's time, not hardware. When we get to the hardware problem, 15 agents will be quite limiting (and likely deemed not enough). Note, the standalone jenkinsfile in 5.0+ was designed to make running arm CI jobs (also on your own k8s) much easier.