The slowness of the CI is definitely causing us a lot of pain. I wonder if
we should move to a dedicated CI infrastructure for Kafka. Our integration
tests are quite heavy and ASF's CI is not really tuned for them. We could
tune it for our needs and this would also allow external companies to
sponsor more workers. I heard that we have a few cloud providers in
the community ;). I think that we should consider this. What do you think?
I already discussed this with the INFRA team. I could continue if we
believe that it is a way forward.

Best,
David

On Wed, Dec 20, 2023 at 12:17 AM Stanislav Kozlovski
<stanis...@confluent.io.invalid> wrote:

> Hey Николай,
>
> Apologies about this - I wasn't aware of this behavior. I have made all the
> gists public.
>
>
>
> On Wed, Dec 20, 2023 at 12:09 AM Greg Harris <greg.har...@aiven.io.invalid
> >
> wrote:
>
> > Hey Stan,
> >
> > Thanks for opening the discussion. I haven't been looking at overall
> > build duration recently, so it's good that you are calling it out.
> >
> > I worry about us over-indexing on this one build, which itself appears
> > to be an outlier. I only see one other build [1] above 6h overall in
> > the last 90 days in this view: [2]
> > And I don't see any overlap of failed tests in these two builds, which
> > makes it less likely that these particular failed tests are the causes
> > of long build times.
> >
> > Separately, I've been investigating build environment slowness, and
> > trying to connect it with test failures [3]. I observed that the CI
> > build environment is 2-20 times slower than my developer machine (M1
> > mac).
> > When I simulate a similar slowdown locally, there are tests which
> > become significantly more flakey, often due to hard-coded timeouts.
> > I think that these particularly nasty builds could be explained by
> > long-tail slowdowns causing arbitrary tests to take an excessive time
> > to execute.
> >
> > Rather than trying to find signals in these rare test failures, I
> > think we should find tests that have these sorts of failures more
> > regularly.
> > There are lots of builds in the 5-6h duration bracket, which is
> > certainly unacceptably long. We should look into these builds to find
> > improvements and optimizations.
> >
> > [1] https://ge.apache.org/s/ygh4gbz4uma6i/
> > [2]
> >
> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York
> > [3] https://github.com/apache/kafka/pull/15008
> >
> > Thanks for looking into this!
> > Greg
> >
> > On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков <nizhi...@apache.org>
> > wrote:
> > >
> > > Hello, Stanislav.
> > >
> > > Can you, please, make the gist public.
> > > Private gists not available for some GitHub users even if link are
> known.
> > >
> > > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski <
> stanis...@confluent.io.INVALID>
> > написал(а):
> > > >
> > > > Hey everybody,
> > > > I've heard various complaints that build times in trunk are taking
> too
> > > > long, some taking as much as 8 hours (the timeout) - and this is
> > slowing us
> > > > down from being able to meet the code freeze deadline for 3.7.
> > > >
> > > > I took it upon myself to gather up some data in Gradle Enterprise to
> > see if
> > > > there are any outlier tests that are causing this slowness. Turns out
> > there
> > > > are a few, in this particular build -
> > https://ge.apache.org/s/un2hv7n6j374k/
> > > > - which took 10 hours and 29 minutes in total.
> > > >
> > > > I have compiled the tests that took a disproportionately large amount
> > of
> > > > time (20m+), alongside their time, error message and a link to their
> > full
> > > > log output here -
> > > >
> >
> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2
> > > >
> > > > It includes failures from core, streams, storage and clients.
> > > > Interestingly, some other tests that don't fail also take a long time
> > in
> > > > what is apparently the test harness framework. See the gist for more
> > > > information.
> > > >
> > > > I am starting this thread with the intention of getting the
> discussion
> > > > started and brainstorming what we can do to get the build times back
> > under
> > > > control.
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Stanislav
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Reply via email to