Hey Sophie - I've gotten 2 inflight PRs each with more than 15 retries...
Namely: https://github.com/apache/kafka/pull/15023 and
https://github.com/apache/kafka/pull/15035

justin filed a flaky test report here though:
https://issues.apache.org/jira/browse/KAFKA-16045

P

On Thu, Dec 21, 2023 at 3:18 PM Sophie Blee-Goldman <sop...@responsive.dev>
wrote:

> On a related note, has anyone else had trouble getting even a single run
> with no build failures lately? I've had multiple pure-docs PRs blocked for
> days or even weeks because of miscellaneous infra, test, and timeout
> failures. I know we just had a discussion about whether it's acceptable to
> ever merge with a failing build, and the consensus (which I agree with) was
> NO -- but seriously, this is getting ridiculous. The build might be the
> worst I've ever seen it, and it just makes it really difficult to maintain
> good will with external contributors.
>
> Take for example this small docs PR:
> https://github.com/apache/kafka/pull/14949
>
> It's on its 7th replay, with the first 6 runs all having (at least) one
> build that failed completely. The issues I saw on this one PR are a good
> summary of what I've been seeing elsewhere, so here's the briefing:
>
> 1. gradle issue:
>
> > * What went wrong:
> >
> > Gradle could not start your build.
> >
> > > Cannot create service of type BuildSessionActionExecutor using method
> > LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor()
> as
> > there is a problem with parameter #21 of type
> FileSystemWatchingInformation.
> >
> >    > Cannot create service of type BuildLifecycleAwareVirtualFileSystem
> > using method
> >
> VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem()
> > as there is a problem with parameter #7 of type GlobalCacheLocations.
> >       > Cannot create service of type GlobalCacheLocations using method
> > GradleUserHomeScopeServices.createGlobalCacheLocations() as there is a
> > problem with parameter #1 of type List<GlobalCache>.
> >          > Could not create service of type FileAccessTimeJournal using
> > GradleUserHomeScopeServices.createFileAccessTimeJournal().
> >             > Timeout waiting to lock journal cache
> > (/home/jenkins/.gradle/caches/journal-1). It is currently in use by
> another
> > Gradle instance.
> >
>
> 2. git issue:
>
> > ERROR: Error cloning remote repo 'origin'
> > hudson.plugins.git.GitException: java.io.IOException: Remote call on
> > builds43 failed
>
>
> 3. storage test calling System.exit (I think)
>
> > * What went wrong:
> >  Execution failed for task ':storage:test'.
> >  > Process 'Gradle Test Executor 73' finished with non-zero exit value 1
>
>     This problem might be caused by incorrect test process configuration.
>
>
> 4.  3/4 builds aborted suddenly for no clear reason
>
> 5. 1 build was aborted, 1 build failed due to a gradle(?) issue with a
> storage test:
>
> Failed to map supported failure 'org.opentest4j.AssertionFailedError:
> > Failed to observe commit callback before timeout' with mapper
> >
> 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea
> ':
> > null
>
>
>
> * What went wrong:
> > Execution failed for task ':storage:test'.
> > > Process 'Gradle Test Executor 73' finished with non-zero exit value 1
> >   This problem might be caused by incorrect test process configuration.
> >
>
> 6.  Unknown issue with a core test:
>
> > Unexpected exception thrown.
> > org.gradle.internal.remote.internal.MessageIOException: Could not read
> > message from '/127.0.0.1:46952'.
> >   at
> >
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94)
> >   at
> >
> org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
> >   at
> >
> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
> >   at
> >
> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
> >   at
> >
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> >   at
> >
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> >   at java.base/java.lang.Thread.run(Thread.java:1583)
> > Caused by: java.lang.IllegalArgumentException
> >   at
> >
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72)
> >   at
> >
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52)
> >   at
> >
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81)
> > ... 6 more
> > org.gradle.internal.remote.internal.ConnectException: Could not connect
> to
> > server [1d62bf97-6a3e-441d-93b6-093617cbbea9 port:41289, addresses:[/
> > 127.0.0.1]]. Tried addresses: [/127.0.0.1].
> >   at
> >
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
> >   at
> >
> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
> >   at
> >
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
> >   at
> >
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
> >   at
> >
> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
> >   at
> >
> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
> > Caused by: java.net.ConnectException: Connection refused
> >   at java.base/sun.nio.ch.Net.pollConnect(Native Method)
> >   at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
> >   at
> > java.base/sun.nio.ch
> .SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191)
> >   at
> > java.base/sun.nio.ch
> .SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233)
> >   at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:102)
> >   at
> >
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
> >   at
> >
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
> > ... 5 more
> >
>
>
>
> >  * What went wrong:
>
> Execution failed for task ':core:test'.
>
> > Process 'Gradle Test Executor 104' finished with non-zero exit value 1
>
>   This problem might be caused by incorrect test process configuration.
>
>
> I've seen almost all of the above issues multiple times, so it might be a
> good list to start with to focus any efforts on improving the build. That
> said, I'm not sure what we can really do about most of these, and not sure
> how to narrow down the root cause in the more mysterious cases of aborted
> builds and the builds that end with "finished with non-zero exit value 1 "
> with no additional context (that I could find)
>
> If nothing else, there seems to be something happening in one (or more) of
> the storage tests, because by far the most common failure I've seen is that
> in 3 & 5. Unfortunately it's not really clear to me how to tell which is
> the offending test, so I'm not even sure what to file a ticket for
>
> On Tue, Dec 19, 2023 at 11:55 PM David Jacot <dja...@confluent.io.invalid>
> wrote:
>
> > The slowness of the CI is definitely causing us a lot of pain. I wonder
> if
> > we should move to a dedicated CI infrastructure for Kafka. Our
> integration
> > tests are quite heavy and ASF's CI is not really tuned for them. We could
> > tune it for our needs and this would also allow external companies to
> > sponsor more workers. I heard that we have a few cloud providers in
> > the community ;). I think that we should consider this. What do you
> think?
> > I already discussed this with the INFRA team. I could continue if we
> > believe that it is a way forward.
> >
> > Best,
> > David
> >
> > On Wed, Dec 20, 2023 at 12:17 AM Stanislav Kozlovski
> > <stanis...@confluent.io.invalid> wrote:
> >
> > > Hey Николай,
> > >
> > > Apologies about this - I wasn't aware of this behavior. I have made all
> > the
> > > gists public.
> > >
> > >
> > >
> > > On Wed, Dec 20, 2023 at 12:09 AM Greg Harris
> > <greg.har...@aiven.io.invalid
> > > >
> > > wrote:
> > >
> > > > Hey Stan,
> > > >
> > > > Thanks for opening the discussion. I haven't been looking at overall
> > > > build duration recently, so it's good that you are calling it out.
> > > >
> > > > I worry about us over-indexing on this one build, which itself
> appears
> > > > to be an outlier. I only see one other build [1] above 6h overall in
> > > > the last 90 days in this view: [2]
> > > > And I don't see any overlap of failed tests in these two builds,
> which
> > > > makes it less likely that these particular failed tests are the
> causes
> > > > of long build times.
> > > >
> > > > Separately, I've been investigating build environment slowness, and
> > > > trying to connect it with test failures [3]. I observed that the CI
> > > > build environment is 2-20 times slower than my developer machine (M1
> > > > mac).
> > > > When I simulate a similar slowdown locally, there are tests which
> > > > become significantly more flakey, often due to hard-coded timeouts.
> > > > I think that these particularly nasty builds could be explained by
> > > > long-tail slowdowns causing arbitrary tests to take an excessive time
> > > > to execute.
> > > >
> > > > Rather than trying to find signals in these rare test failures, I
> > > > think we should find tests that have these sorts of failures more
> > > > regularly.
> > > > There are lots of builds in the 5-6h duration bracket, which is
> > > > certainly unacceptably long. We should look into these builds to find
> > > > improvements and optimizations.
> > > >
> > > > [1] https://ge.apache.org/s/ygh4gbz4uma6i/
> > > > [2]
> > > >
> > >
> >
> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York
> > > > [3] https://github.com/apache/kafka/pull/15008
> > > >
> > > > Thanks for looking into this!
> > > > Greg
> > > >
> > > > On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков <nizhi...@apache.org>
> > > > wrote:
> > > > >
> > > > > Hello, Stanislav.
> > > > >
> > > > > Can you, please, make the gist public.
> > > > > Private gists not available for some GitHub users even if link are
> > > known.
> > > > >
> > > > > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski <
> > > stanis...@confluent.io.INVALID>
> > > > написал(а):
> > > > > >
> > > > > > Hey everybody,
> > > > > > I've heard various complaints that build times in trunk are
> taking
> > > too
> > > > > > long, some taking as much as 8 hours (the timeout) - and this is
> > > > slowing us
> > > > > > down from being able to meet the code freeze deadline for 3.7.
> > > > > >
> > > > > > I took it upon myself to gather up some data in Gradle Enterprise
> > to
> > > > see if
> > > > > > there are any outlier tests that are causing this slowness. Turns
> > out
> > > > there
> > > > > > are a few, in this particular build -
> > > > https://ge.apache.org/s/un2hv7n6j374k/
> > > > > > - which took 10 hours and 29 minutes in total.
> > > > > >
> > > > > > I have compiled the tests that took a disproportionately large
> > amount
> > > > of
> > > > > > time (20m+), alongside their time, error message and a link to
> > their
> > > > full
> > > > > > log output here -
> > > > > >
> > > >
> > >
> >
> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2
> > > > > >
> > > > > > It includes failures from core, streams, storage and clients.
> > > > > > Interestingly, some other tests that don't fail also take a long
> > time
> > > > in
> > > > > > what is apparently the test harness framework. See the gist for
> > more
> > > > > > information.
> > > > > >
> > > > > > I am starting this thread with the intention of getting the
> > > discussion
> > > > > > started and brainstorming what we can do to get the build times
> > back
> > > > under
> > > > > > control.
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best,
> > > > > > Stanislav
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>

Reply via email to