Paolo, I understand your proposal, I'm just concerned that the build sheriff role keeps rotating around the same people, as not everyone is available to volunteer and/or act as such. I know it's not your intention, but it's something that can happen. We can't expect all contributors to express interest in being the build sheriff for a month, nor that this will be something we can maintain running sustainably.
Francisco, filing a new issue and fixing the problem on a separate PR is indeed the way to go, IMHO. What we usually do on `kie-tools` is: 1. Send a PR with a change. 2. Observe red PR checks, unrelated to the changes introduced. 3. Open an issue and send a separate PR targeting the same branch of the original PR, fixing the problem on the PR checks. 4. Review and merge this second PR, closing the new issue. 5. Retrigger PR checks on the original PR. 6. Observe a green build, review and merge it normally. * Sometimes we skip opening an issue, if the effort to fix it is small enough, and we can use the PR description to provide enough context for reviewers and watchers of the repo. The important thing, IMHO, is that the original PR doesn't get merged before the unrelated issue on PR checks is fixed. Otherwise we open a credit line that allows us to fall into tech debt :) My view has always been that we need to collectively cherish our CI, PR checks and automations, seeing those as the canonical way to build our software. But it's been really hard to cherish something so distant from our day-to-day work, especially when we all can, to some extent, continue operating the same way we've been for the last who knows how many years, somewhat ignoring the systems we currently have :/ On Thu, Aug 1, 2024 at 11:54 AM Francisco Javier Tirado Sarti <[email protected]> wrote: > > I forgot to mention that another topic that is difficult to fix but easy to > discuss is if the Jenkins machines executing the test are properly > dimensioned for the test we are executing. > For example, in the previous PR, the timeout to startup the Keycloak > quarkus instance IT test was increased to 2 minutes, because the default of > 1 minute does not seem to be enough for downloading and running the > keycloak image. > The CI is already taking ages. > Either we increase our HW resources for testing, or we start reducing our > test scope. > > On Thu, Aug 1, 2024 at 5:42 PM Francisco Javier Tirado Sarti < > [email protected]> wrote: > > > By the way I opened > > https://github.com/apache/incubator-kie-kogito-examples/pull/1991 for > > fixing > > https://ci-builds.apache.org/job/KIE/job/kogito/job/10.0.x/job/nightly/job/kogito-examples.build-and-deploy/17/ > > So it can be said that I acted as sheriff, but since Im weak and cannot > > hold the pressure, I pass the torch (or the start) to the next one ;) > > > > > > On Thu, Aug 1, 2024 at 5:37 PM Francisco Javier Tirado Sarti < > > [email protected]> wrote: > > > >> Hi Tiago, > >> About point 2, when the issue blocking the merge is really unrelated, it > >> won't be a better approach to open a separate issue to fix the unrelated > >> issue? > >> I think we agree that is better for tracking (so you do not see an > >> unrelated change in a PR history) and will avoid the undesired situation of > >> two developers trying to fix the same unrelated issue from two simultaneous > >> PRs (one of the two eventually has to trigger the rebase and realize the > >> broken test is already fixed, but still, there are less chances of them > >> working in the same problem if there is an issue in the issue list) > >> > >> > >> On Thu, Aug 1, 2024 at 4:58 PM Tiago Bento <[email protected]> wrote: > >> > >>> Thanks Paolo for starting this conversation. Let me bring a little bit > >>> of my perspective to it. > >>> > >>> Although I agree that having people "dedicated" to the quality and > >>> stability of our CI and other automations would be better than what we > >>> have today, having our builds break so often that we need a system in > >>> place to deal with them is a symptom of other problems, IMHO. > >>> > >>> The complexity of our CI systems and automations is discouraging for > >>> most people to get involved. Without the system itself changing and > >>> being more approachable, having "build sheriffs" will only make the > >>> separation between "development" and "CI" bigger, and we'll be reliant > >>> on a small group of people who'll become solely responsible for either > >>> fixing stuff other people broke, or chasing them to fix it. When > >>> inevitably these experts can't or simply don't want to contribute to > >>> this area of the community anymore, we're in big trouble. > >>> > >>> My opinion is that we could try and concentrate our efforts to reduce > >>> the barrier of entry to maintaining the CI and automations we have, > >>> while putting a system in place that will naturally have each one of > >>> us know at least the basics of how the CI and automations work. > >>> > >>> From my experience maintaining `kie-tools`, a few things help reaching > >>> that point: > >>> 1. Having local builds be as similar as possible to CI builds. No > >>> fancy commands or profiles that only run on CI. > >>> 2. Red PRs can't be merged. Ever. If your PR became red for "unrelated > >>> reasons", you then become responsible to fix the "unrelated issue", > >>> helping everyone else not face the same problem. > >>> 3. Having a CI system with the least amount of abstractions possible. > >>> Less CI code == less cognitive load == smaller barrier of entry. > >>> > >>> Moving away from Jenkins for PR checks and concentrating on GitHub > >>> Actions is, IMHO, already a great step in that direction. > >>> > >>> I hope I could bring something positive to the discussion. > >>> > >>> Thanks! > >>> > >>> Regards, > >>> > >>> Tiago Bento > >>> > >>> On Thu, Aug 1, 2024 at 10:08 AM Gabriele Cardosi > >>> <[email protected]> wrote: > >>> > > >>> > Thanks for clarification, Paolo! > >>> > > >>> > Il giorno gio 1 ago 2024 alle ore 15:46 Paolo Bizzarri < > >>> [email protected]> > >>> > ha scritto: > >>> > > >>> > > Hi Gabriele, > >>> > > > >>> > > it is a mix of various stuff. > >>> > > > >>> > > For example, take the various issues that I reported in the analysis > >>> done > >>> > > for 10.x branch. Most of them apply just the same for the main > >>> branch. > >>> > > > >>> > > For example > >>> > > > >>> > > > >>> https://ci-builds.apache.org/job/KIE/job/kogito/job/main/job/tools/job/kogito-clean-old-nightly-images/ > >>> > > > >>> > > Now this is probably a build that has to be just deleted - but still > >>> it is > >>> > > always red, and we need someone that looks at it and decide that > >>> yes, we > >>> > > need to get rid of it, create a corresponding kie issue and go after > >>> it. > >>> > > > >>> > > Another example: > >>> > > > >>> > > > >>> https://ci-builds.apache.org/job/KIE/job/kogito/job/10.0.x/job/nightly/job/kogito-examples.build-and-deploy/17/ > >>> > > > >>> > > This test has been failing almost every day in the last few days. > >>> Either we > >>> > > need to make it a little more stable, or get rid of it. > >>> > > > >>> > > And so on. > >>> > > > >>> > > The goal of the sheriff is to keep the top level folder in good > >>> health, and > >>> > > that means that all the underlying jobs are healthy. > >>> > > > >>> > > I hope this clarifies my proposal. > >>> > > > >>> > > Regards > >>> > > > >>> > > Paolo > >>> > > > >>> > > > >>> > > > >>> > > On Thu, Aug 1, 2024 at 3:18 PM Gabriele Cardosi < > >>> > > [email protected]> > >>> > > wrote: > >>> > > > >>> > > > Hi Paolo, > >>> > > > may you explain exactly what you mean with "builds are often > >>> broken" ? > >>> > > May > >>> > > > you give an example of such and, in the example, what should the > >>> > > "sheriff" > >>> > > > do to manage it ? (Sorry, I just need to understand what you are > >>> > > referring > >>> > > > to) > >>> > > > > >>> > > > Thanks! > >>> > > > > >>> > > > Il giorno gio 1 ago 2024 alle ore 15:09 Paolo Bizzarri < > >>> > > [email protected]> > >>> > > > ha scritto: > >>> > > > > >>> > > > > Hello kie mates, > >>> > > > > > >>> > > > > please find my proposal in the following. > >>> > > > > > >>> > > > > PROBLEM > >>> > > > > - builds are often broken and they stay broken for a long time. > >>> There > >>> > > > seem > >>> > > > > to be not a clear definition of who should take care of this > >>> > > > > > >>> > > > > CONTEXT > >>> > > > > - fixing builds is slow, annoying and tipically is more a job of > >>> > > chasing > >>> > > > > someone else than fixing it yourself. So it becomes quickly > >>> wearing. > >>> > > > > > >>> > > > > PROPOSED SOLUTION > >>> > > > > - identify a number of build sheriffs that look at the various > >>> builds, > >>> > > > open > >>> > > > > the relevant issues for tracking and chase other devs and > >>> contributors > >>> > > to > >>> > > > > fix the issues themselves. The sheriffs are not supposed to fix > >>> > > > everything > >>> > > > > by themselves, but instead to keep the attention of other > >>> developers on > >>> > > > the > >>> > > > > status of the builds. > >>> > > > > I suggest we have three sheriffs, that stay around for one > >>> month and > >>> > > > then > >>> > > > > pass the token to someone else: one for drools and optaplanner, > >>> one for > >>> > > > > kogito, one for kie-tools. > >>> > > > > > >>> > > > > Let me know your ideas and feedback. > >>> > > > > > >>> > > > > Regards > >>> > > > > > >>> > > > > Paolo > >>> > > > > > >>> > > > > >>> > > > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >>> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
