Thanks David! I think this should help a lot!

While we should include these improvements, I think it is also good to
remind folks that a lot of these issues come from merging on builds that
regress the CI.
I know I'm not perfect at this (and have merged on flaky and failing
tests), but let's all be super careful going forward. There were a few
times I retried the build 10+ times and thought it was other issues with
the CI but the failed builds were actually due to the changes I wrote/was
reviewing.

We all need to work together on this to ensure the builds stay healthy!
Thanks all for being concerned about our builds!

Justine

On Fri, Dec 22, 2023 at 6:02 AM David Jacot <david.ja...@gmail.com> wrote:

> I just merged both PRs.
>
> Cheers,
> David
>
> Le ven. 22 déc. 2023 à 14:38, David Jacot <david.ja...@gmail.com> a écrit
> :
>
> > Hey folks,
> >
> > I believe that my two PRs will fix most of the issues. I have also
> tweaked
> > the configuration of Jenkins to fix the issues relating to cloning the
> > repo. There may be other issues but the overall situation should be much
> > better when I merge those two.
> >
> > I will update this thread when I merge them.
> >
> > Cheers,
> > David
> >
> > Le ven. 22 déc. 2023 à 14:22, Divij Vaidya <divijvaidy...@gmail.com> a
> > écrit :
> >
> >> Hey folks
> >>
> >> I think David (dajac) has some fixes lined-up to improve CI such as
> >> https://github.com/apache/kafka/pull/15063 and
> >> https://github.com/apache/kafka/pull/15062.
> >>
> >> I have some bandwidth for the next two days to work on fixing the CI.
> Let
> >> me start by taking a look at the list that Sophie shared here.
> >>
> >> --
> >> Divij Vaidya
> >>
> >>
> >>
> >> On Fri, Dec 22, 2023 at 2:05 PM Luke Chen <show...@gmail.com> wrote:
> >>
> >> > Hi Sophie and Philip and all,
> >> >
> >> > I share the same pain as you.
> >> > I've been waiting for a CI build result in a PR for days.
> >> Unfortunately, I
> >> > can only get 1 result each day because it takes 8 hours for each run,
> >> and
> >> > with failed results. :(
> >> >
> >> > I've looked into the 8 hour timeout build issue and would like to
> >> propose
> >> > to set a global test timeout as 10 mins using the junit5 feature
> >> > <
> >> >
> >>
> https://junit.org/junit5/docs/current/user-guide/#writing-tests-declarative-timeouts-default-timeouts
> >> > >
> >> > .
> >> > This way, we can fail those long running tests quickly without
> impacting
> >> > other tests.
> >> > PR: https://github.com/apache/kafka/pull/15065
> >> > I've tested in my local environment and it works as expected.
> >> >
> >> > Any feedback is welcome.
> >> >
> >> > Thanks.
> >> > Luke
> >> >
> >> > On Fri, Dec 22, 2023 at 8:08 AM Philip Nee <philip...@gmail.com>
> wrote:
> >> >
> >> > > Hey Sophie - I've gotten 2 inflight PRs each with more than 15
> >> retries...
> >> > > Namely: https://github.com/apache/kafka/pull/15023 and
> >> > > https://github.com/apache/kafka/pull/15035
> >> > >
> >> > > justin filed a flaky test report here though:
> >> > > https://issues.apache.org/jira/browse/KAFKA-16045
> >> > >
> >> > > P
> >> > >
> >> > > On Thu, Dec 21, 2023 at 3:18 PM Sophie Blee-Goldman <
> >> > sop...@responsive.dev
> >> > > >
> >> > > wrote:
> >> > >
> >> > > > On a related note, has anyone else had trouble getting even a
> single
> >> > run
> >> > > > with no build failures lately? I've had multiple pure-docs PRs
> >> blocked
> >> > > for
> >> > > > days or even weeks because of miscellaneous infra, test, and
> timeout
> >> > > > failures. I know we just had a discussion about whether it's
> >> acceptable
> >> > > to
> >> > > > ever merge with a failing build, and the consensus (which I agree
> >> with)
> >> > > was
> >> > > > NO -- but seriously, this is getting ridiculous. The build might
> be
> >> the
> >> > > > worst I've ever seen it, and it just makes it really difficult to
> >> > > maintain
> >> > > > good will with external contributors.
> >> > > >
> >> > > > Take for example this small docs PR:
> >> > > > https://github.com/apache/kafka/pull/14949
> >> > > >
> >> > > > It's on its 7th replay, with the first 6 runs all having (at
> least)
> >> one
> >> > > > build that failed completely. The issues I saw on this one PR are
> a
> >> > good
> >> > > > summary of what I've been seeing elsewhere, so here's the
> briefing:
> >> > > >
> >> > > > 1. gradle issue:
> >> > > >
> >> > > > > * What went wrong:
> >> > > > >
> >> > > > > Gradle could not start your build.
> >> > > > >
> >> > > > > > Cannot create service of type BuildSessionActionExecutor using
> >> > method
> >> > > > >
> >> > >
> >> LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor()
> >> > > > as
> >> > > > > there is a problem with parameter #21 of type
> >> > > > FileSystemWatchingInformation.
> >> > > > >
> >> > > > >    > Cannot create service of type
> >> > BuildLifecycleAwareVirtualFileSystem
> >> > > > > using method
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem()
> >> > > > > as there is a problem with parameter #7 of type
> >> GlobalCacheLocations.
> >> > > > >       > Cannot create service of type GlobalCacheLocations using
> >> > method
> >> > > > > GradleUserHomeScopeServices.createGlobalCacheLocations() as
> there
> >> is
> >> > a
> >> > > > > problem with parameter #1 of type List<GlobalCache>.
> >> > > > >          > Could not create service of type
> FileAccessTimeJournal
> >> > using
> >> > > > > GradleUserHomeScopeServices.createFileAccessTimeJournal().
> >> > > > >             > Timeout waiting to lock journal cache
> >> > > > > (/home/jenkins/.gradle/caches/journal-1). It is currently in use
> >> by
> >> > > > another
> >> > > > > Gradle instance.
> >> > > > >
> >> > > >
> >> > > > 2. git issue:
> >> > > >
> >> > > > > ERROR: Error cloning remote repo 'origin'
> >> > > > > hudson.plugins.git.GitException: java.io.IOException: Remote
> call
> >> on
> >> > > > > builds43 failed
> >> > > >
> >> > > >
> >> > > > 3. storage test calling System.exit (I think)
> >> > > >
> >> > > > > * What went wrong:
> >> > > > >  Execution failed for task ':storage:test'.
> >> > > > >  > Process 'Gradle Test Executor 73' finished with non-zero exit
> >> > value
> >> > > 1
> >> > > >
> >> > > >     This problem might be caused by incorrect test process
> >> > configuration.
> >> > > >
> >> > > >
> >> > > > 4.  3/4 builds aborted suddenly for no clear reason
> >> > > >
> >> > > > 5. 1 build was aborted, 1 build failed due to a gradle(?) issue
> >> with a
> >> > > > storage test:
> >> > > >
> >> > > > Failed to map supported failure
> >> 'org.opentest4j.AssertionFailedError:
> >> > > > > Failed to observe commit callback before timeout' with mapper
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea
> >> > > > ':
> >> > > > > null
> >> > > >
> >> > > >
> >> > > >
> >> > > > * What went wrong:
> >> > > > > Execution failed for task ':storage:test'.
> >> > > > > > Process 'Gradle Test Executor 73' finished with non-zero exit
> >> > value 1
> >> > > > >   This problem might be caused by incorrect test process
> >> > configuration.
> >> > > > >
> >> > > >
> >> > > > 6.  Unknown issue with a core test:
> >> > > >
> >> > > > > Unexpected exception thrown.
> >> > > > > org.gradle.internal.remote.internal.MessageIOException: Could
> not
> >> > read
> >> > > > > message from '/127.0.0.1:46952'.
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> >> > > > >   at java.base/java.lang.Thread.run(Thread.java:1583)
> >> > > > > Caused by: java.lang.IllegalArgumentException
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81)
> >> > > > > ... 6 more
> >> > > > > org.gradle.internal.remote.internal.ConnectException: Could not
> >> > connect
> >> > > > to
> >> > > > > server [1d62bf97-6a3e-441d-93b6-093617cbbea9 port:41289,
> >> addresses:[/
> >> > > > > 127.0.0.1]]. Tried addresses: [/127.0.0.1].
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
> >> > > > > Caused by: java.net.ConnectException: Connection refused
> >> > > > >   at java.base/sun.nio.ch.Net.pollConnect(Native Method)
> >> > > > >   at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
> >> > > > >   at
> >> > > > > java.base/sun.nio.ch
> >> > > > .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191)
> >> > > > >   at
> >> > > > > java.base/sun.nio.ch
> >> > > > .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233)
> >> > > > >   at java.base/sun.nio.ch
> >> > > .SocketAdaptor.connect(SocketAdaptor.java:102)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
> >> > > > >   at
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
> >> > > > > ... 5 more
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > >  * What went wrong:
> >> > > >
> >> > > > Execution failed for task ':core:test'.
> >> > > >
> >> > > > > Process 'Gradle Test Executor 104' finished with non-zero exit
> >> value
> >> > 1
> >> > > >
> >> > > >   This problem might be caused by incorrect test process
> >> configuration.
> >> > > >
> >> > > >
> >> > > > I've seen almost all of the above issues multiple times, so it
> might
> >> > be a
> >> > > > good list to start with to focus any efforts on improving the
> build.
> >> > That
> >> > > > said, I'm not sure what we can really do about most of these, and
> >> not
> >> > > sure
> >> > > > how to narrow down the root cause in the more mysterious cases of
> >> > aborted
> >> > > > builds and the builds that end with "finished with non-zero exit
> >> value
> >> > 1
> >> > > "
> >> > > > with no additional context (that I could find)
> >> > > >
> >> > > > If nothing else, there seems to be something happening in one (or
> >> more)
> >> > > of
> >> > > > the storage tests, because by far the most common failure I've
> seen
> >> is
> >> > > that
> >> > > > in 3 & 5. Unfortunately it's not really clear to me how to tell
> >> which
> >> > is
> >> > > > the offending test, so I'm not even sure what to file a ticket for
> >> > > >
> >> > > > On Tue, Dec 19, 2023 at 11:55 PM David Jacot
> >> > <dja...@confluent.io.invalid
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > The slowness of the CI is definitely causing us a lot of pain. I
> >> > wonder
> >> > > > if
> >> > > > > we should move to a dedicated CI infrastructure for Kafka. Our
> >> > > > integration
> >> > > > > tests are quite heavy and ASF's CI is not really tuned for them.
> >> We
> >> > > could
> >> > > > > tune it for our needs and this would also allow external
> >> companies to
> >> > > > > sponsor more workers. I heard that we have a few cloud providers
> >> in
> >> > > > > the community ;). I think that we should consider this. What do
> >> you
> >> > > > think?
> >> > > > > I already discussed this with the INFRA team. I could continue
> if
> >> we
> >> > > > > believe that it is a way forward.
> >> > > > >
> >> > > > > Best,
> >> > > > > David
> >> > > > >
> >> > > > > On Wed, Dec 20, 2023 at 12:17 AM Stanislav Kozlovski
> >> > > > > <stanis...@confluent.io.invalid> wrote:
> >> > > > >
> >> > > > > > Hey Николай,
> >> > > > > >
> >> > > > > > Apologies about this - I wasn't aware of this behavior. I have
> >> made
> >> > > all
> >> > > > > the
> >> > > > > > gists public.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, Dec 20, 2023 at 12:09 AM Greg Harris
> >> > > > > <greg.har...@aiven.io.invalid
> >> > > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hey Stan,
> >> > > > > > >
> >> > > > > > > Thanks for opening the discussion. I haven't been looking at
> >> > > overall
> >> > > > > > > build duration recently, so it's good that you are calling
> it
> >> > out.
> >> > > > > > >
> >> > > > > > > I worry about us over-indexing on this one build, which
> itself
> >> > > > appears
> >> > > > > > > to be an outlier. I only see one other build [1] above 6h
> >> overall
> >> > > in
> >> > > > > > > the last 90 days in this view: [2]
> >> > > > > > > And I don't see any overlap of failed tests in these two
> >> builds,
> >> > > > which
> >> > > > > > > makes it less likely that these particular failed tests are
> >> the
> >> > > > causes
> >> > > > > > > of long build times.
> >> > > > > > >
> >> > > > > > > Separately, I've been investigating build environment
> >> slowness,
> >> > and
> >> > > > > > > trying to connect it with test failures [3]. I observed that
> >> the
> >> > CI
> >> > > > > > > build environment is 2-20 times slower than my developer
> >> machine
> >> > > (M1
> >> > > > > > > mac).
> >> > > > > > > When I simulate a similar slowdown locally, there are tests
> >> which
> >> > > > > > > become significantly more flakey, often due to hard-coded
> >> > timeouts.
> >> > > > > > > I think that these particularly nasty builds could be
> >> explained
> >> > by
> >> > > > > > > long-tail slowdowns causing arbitrary tests to take an
> >> excessive
> >> > > time
> >> > > > > > > to execute.
> >> > > > > > >
> >> > > > > > > Rather than trying to find signals in these rare test
> >> failures, I
> >> > > > > > > think we should find tests that have these sorts of failures
> >> more
> >> > > > > > > regularly.
> >> > > > > > > There are lots of builds in the 5-6h duration bracket, which
> >> is
> >> > > > > > > certainly unacceptably long. We should look into these
> builds
> >> to
> >> > > find
> >> > > > > > > improvements and optimizations.
> >> > > > > > >
> >> > > > > > > [1] https://ge.apache.org/s/ygh4gbz4uma6i/
> >> > > > > > > [2]
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York
> >> > > > > > > [3] https://github.com/apache/kafka/pull/15008
> >> > > > > > >
> >> > > > > > > Thanks for looking into this!
> >> > > > > > > Greg
> >> > > > > > >
> >> > > > > > > On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков <
> >> > > nizhi...@apache.org>
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > Hello, Stanislav.
> >> > > > > > > >
> >> > > > > > > > Can you, please, make the gist public.
> >> > > > > > > > Private gists not available for some GitHub users even if
> >> link
> >> > > are
> >> > > > > > known.
> >> > > > > > > >
> >> > > > > > > > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski <
> >> > > > > > stanis...@confluent.io.INVALID>
> >> > > > > > > написал(а):
> >> > > > > > > > >
> >> > > > > > > > > Hey everybody,
> >> > > > > > > > > I've heard various complaints that build times in trunk
> >> are
> >> > > > taking
> >> > > > > > too
> >> > > > > > > > > long, some taking as much as 8 hours (the timeout) - and
> >> this
> >> > > is
> >> > > > > > > slowing us
> >> > > > > > > > > down from being able to meet the code freeze deadline
> for
> >> > 3.7.
> >> > > > > > > > >
> >> > > > > > > > > I took it upon myself to gather up some data in Gradle
> >> > > Enterprise
> >> > > > > to
> >> > > > > > > see if
> >> > > > > > > > > there are any outlier tests that are causing this
> >> slowness.
> >> > > Turns
> >> > > > > out
> >> > > > > > > there
> >> > > > > > > > > are a few, in this particular build -
> >> > > > > > > https://ge.apache.org/s/un2hv7n6j374k/
> >> > > > > > > > > - which took 10 hours and 29 minutes in total.
> >> > > > > > > > >
> >> > > > > > > > > I have compiled the tests that took a disproportionately
> >> > large
> >> > > > > amount
> >> > > > > > > of
> >> > > > > > > > > time (20m+), alongside their time, error message and a
> >> link
> >> > to
> >> > > > > their
> >> > > > > > > full
> >> > > > > > > > > log output here -
> >> > > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2
> >> > > > > > > > >
> >> > > > > > > > > It includes failures from core, streams, storage and
> >> clients.
> >> > > > > > > > > Interestingly, some other tests that don't fail also
> take
> >> a
> >> > > long
> >> > > > > time
> >> > > > > > > in
> >> > > > > > > > > what is apparently the test harness framework. See the
> >> gist
> >> > for
> >> > > > > more
> >> > > > > > > > > information.
> >> > > > > > > > >
> >> > > > > > > > > I am starting this thread with the intention of getting
> >> the
> >> > > > > > discussion
> >> > > > > > > > > started and brainstorming what we can do to get the
> build
> >> > times
> >> > > > > back
> >> > > > > > > under
> >> > > > > > > > > control.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > Best,
> >> > > > > > > > > Stanislav
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Best,
> >> > > > > > Stanislav
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply via email to