Re: Kafka trunk test & build stability

Sophie Blee-Goldman Thu, 21 Dec 2023 15:18:33 -0800

On a related note, has anyone else had trouble getting even a single run
with no build failures lately? I've had multiple pure-docs PRs blocked for
days or even weeks because of miscellaneous infra, test, and timeout
failures. I know we just had a discussion about whether it's acceptable to
ever merge with a failing build, and the consensus (which I agree with) was
NO -- but seriously, this is getting ridiculous. The build might be the
worst I've ever seen it, and it just makes it really difficult to maintain
good will with external contributors.


Take for example this small docs PR:
https://github.com/apache/kafka/pull/14949

It's on its 7th replay, with the first 6 runs all having (at least) one
build that failed completely. The issues I saw on this one PR are a good
summary of what I've been seeing elsewhere, so here's the briefing:

1. gradle issue:

> * What went wrong:
>
> Gradle could not start your build.
>
> > Cannot create service of type BuildSessionActionExecutor using method
> LauncherServices$ToolingBuildSessionScopeServices.createActionExecutor() as
> there is a problem with parameter #21 of type FileSystemWatchingInformation.
>
>    > Cannot create service of type BuildLifecycleAwareVirtualFileSystem
> using method
> VirtualFileSystemServices$GradleUserHomeServices.createVirtualFileSystem()
> as there is a problem with parameter #7 of type GlobalCacheLocations.
>       > Cannot create service of type GlobalCacheLocations using method
> GradleUserHomeScopeServices.createGlobalCacheLocations() as there is a
> problem with parameter #1 of type List<GlobalCache>.
>          > Could not create service of type FileAccessTimeJournal using
> GradleUserHomeScopeServices.createFileAccessTimeJournal().
>             > Timeout waiting to lock journal cache
> (/home/jenkins/.gradle/caches/journal-1). It is currently in use by another
> Gradle instance.
>

2. git issue:

> ERROR: Error cloning remote repo 'origin'
> hudson.plugins.git.GitException: java.io.IOException: Remote call on
> builds43 failed


3. storage test calling System.exit (I think)

> * What went wrong:
>  Execution failed for task ':storage:test'.
>  > Process 'Gradle Test Executor 73' finished with non-zero exit value 1

    This problem might be caused by incorrect test process configuration.


4.  3/4 builds aborted suddenly for no clear reason

5. 1 build was aborted, 1 build failed due to a gradle(?) issue with a
storage test:

Failed to map supported failure 'org.opentest4j.AssertionFailedError:
> Failed to observe commit callback before timeout' with mapper
> 'org.gradle.api.internal.tasks.testing.failure.mappers.OpenTestAssertionFailedMapper@38bb78ea':
> null



* What went wrong:
> Execution failed for task ':storage:test'.
> > Process 'Gradle Test Executor 73' finished with non-zero exit value 1
>   This problem might be caused by incorrect test process configuration.
>

6.  Unknown issue with a core test:

> Unexpected exception thrown.
> org.gradle.internal.remote.internal.MessageIOException: Could not read
> message from '/127.0.0.1:46952'.
>   at
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94)
>   at
> org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270)
>   at
> org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
>   at
> org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47)
>   at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>   at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>   at java.base/java.lang.Thread.run(Thread.java:1583)
> Caused by: java.lang.IllegalArgumentException
>   at
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72)
>   at
> org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52)
>   at
> org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81)
> ... 6 more
> org.gradle.internal.remote.internal.ConnectException: Could not connect to
> server [1d62bf97-6a3e-441d-93b6-093617cbbea9 port:41289, addresses:[/
> 127.0.0.1]]. Tried addresses: [/127.0.0.1].
>   at
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67)
>   at
> org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36)
>   at
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103)
>   at
> org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65)
>   at
> worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
>   at
> worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)
> Caused by: java.net.ConnectException: Connection refused
>   at java.base/sun.nio.ch.Net.pollConnect(Native Method)
>   at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:682)
>   at
> java.base/sun.nio.ch.SocketChannelImpl.finishTimedConnect(SocketChannelImpl.java:1191)
>   at
> java.base/sun.nio.ch.SocketChannelImpl.blockingConnect(SocketChannelImpl.java:1233)
>   at java.base/sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:102)
>   at
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81)
>   at
> org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54)
> ... 5 more
>



>  * What went wrong:

Execution failed for task ':core:test'.

> Process 'Gradle Test Executor 104' finished with non-zero exit value 1

  This problem might be caused by incorrect test process configuration.


I've seen almost all of the above issues multiple times, so it might be a
good list to start with to focus any efforts on improving the build. That
said, I'm not sure what we can really do about most of these, and not sure
how to narrow down the root cause in the more mysterious cases of aborted
builds and the builds that end with "finished with non-zero exit value 1 "
with no additional context (that I could find)

If nothing else, there seems to be something happening in one (or more) of
the storage tests, because by far the most common failure I've seen is that
in 3 & 5. Unfortunately it's not really clear to me how to tell which is
the offending test, so I'm not even sure what to file a ticket for

On Tue, Dec 19, 2023 at 11:55 PM David Jacot <dja...@confluent.io.invalid>
wrote:

> The slowness of the CI is definitely causing us a lot of pain. I wonder if
> we should move to a dedicated CI infrastructure for Kafka. Our integration
> tests are quite heavy and ASF's CI is not really tuned for them. We could
> tune it for our needs and this would also allow external companies to
> sponsor more workers. I heard that we have a few cloud providers in
> the community ;). I think that we should consider this. What do you think?
> I already discussed this with the INFRA team. I could continue if we
> believe that it is a way forward.
>
> Best,
> David
>
> On Wed, Dec 20, 2023 at 12:17 AM Stanislav Kozlovski
> <stanis...@confluent.io.invalid> wrote:
>
> > Hey Николай,
> >
> > Apologies about this - I wasn't aware of this behavior. I have made all
> the
> > gists public.
> >
> >
> >
> > On Wed, Dec 20, 2023 at 12:09 AM Greg Harris
> <greg.har...@aiven.io.invalid
> > >
> > wrote:
> >
> > > Hey Stan,
> > >
> > > Thanks for opening the discussion. I haven't been looking at overall
> > > build duration recently, so it's good that you are calling it out.
> > >
> > > I worry about us over-indexing on this one build, which itself appears
> > > to be an outlier. I only see one other build [1] above 6h overall in
> > > the last 90 days in this view: [2]
> > > And I don't see any overlap of failed tests in these two builds, which
> > > makes it less likely that these particular failed tests are the causes
> > > of long build times.
> > >
> > > Separately, I've been investigating build environment slowness, and
> > > trying to connect it with test failures [3]. I observed that the CI
> > > build environment is 2-20 times slower than my developer machine (M1
> > > mac).
> > > When I simulate a similar slowdown locally, there are tests which
> > > become significantly more flakey, often due to hard-coded timeouts.
> > > I think that these particularly nasty builds could be explained by
> > > long-tail slowdowns causing arbitrary tests to take an excessive time
> > > to execute.
> > >
> > > Rather than trying to find signals in these rare test failures, I
> > > think we should find tests that have these sorts of failures more
> > > regularly.
> > > There are lots of builds in the 5-6h duration bracket, which is
> > > certainly unacceptably long. We should look into these builds to find
> > > improvements and optimizations.
> > >
> > > [1] https://ge.apache.org/s/ygh4gbz4uma6i/
> > > [2]
> > >
> >
> https://ge.apache.org/scans?list.sortColumn=buildDuration&search.relativeStartTime=P90D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=America%2FNew_York
> > > [3] https://github.com/apache/kafka/pull/15008
> > >
> > > Thanks for looking into this!
> > > Greg
> > >
> > > On Tue, Dec 19, 2023 at 3:45 PM Николай Ижиков <nizhi...@apache.org>
> > > wrote:
> > > >
> > > > Hello, Stanislav.
> > > >
> > > > Can you, please, make the gist public.
> > > > Private gists not available for some GitHub users even if link are
> > known.
> > > >
> > > > > 19 дек. 2023 г., в 17:33, Stanislav Kozlovski <
> > stanis...@confluent.io.INVALID>
> > > написал(а):
> > > > >
> > > > > Hey everybody,
> > > > > I've heard various complaints that build times in trunk are taking
> > too
> > > > > long, some taking as much as 8 hours (the timeout) - and this is
> > > slowing us
> > > > > down from being able to meet the code freeze deadline for 3.7.
> > > > >
> > > > > I took it upon myself to gather up some data in Gradle Enterprise
> to
> > > see if
> > > > > there are any outlier tests that are causing this slowness. Turns
> out
> > > there
> > > > > are a few, in this particular build -
> > > https://ge.apache.org/s/un2hv7n6j374k/
> > > > > - which took 10 hours and 29 minutes in total.
> > > > >
> > > > > I have compiled the tests that took a disproportionately large
> amount
> > > of
> > > > > time (20m+), alongside their time, error message and a link to
> their
> > > full
> > > > > log output here -
> > > > >
> > >
> >
> https://gist.github.com/stanislavkozlovski/8959f7ee59434f774841f4ae2f5228c2
> > > > >
> > > > > It includes failures from core, streams, storage and clients.
> > > > > Interestingly, some other tests that don't fail also take a long
> time
> > > in
> > > > > what is apparently the test harness framework. See the gist for
> more
> > > > > information.
> > > > >
> > > > > I am starting this thread with the intention of getting the
> > discussion
> > > > > started and brainstorming what we can do to get the build times
> back
> > > under
> > > > > control.
> > > > >
> > > > >
> > > > > --
> > > > > Best,
> > > > > Stanislav
> > > >
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >
>

Re: Kafka trunk test & build stability

Reply via email to