[
https://issues.apache.org/jira/browse/FLINK-35571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-35571:
-----------------------------------
Labels: pull-request-available (was: )
> ProfilingServiceTest.testRollingDeletion intermittently fails due to improper
> test isolation
> --------------------------------------------------------------------------------------------
>
> Key: FLINK-35571
> URL: https://issues.apache.org/jira/browse/FLINK-35571
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Environment: *Git revision:*
> {code:bash}
> $ git show
> commit b8d527166e095653ae3ff5c0431bf27297efe229 (HEAD -> master)
> {code}
> *Java info:*
> {code:bash}
> $ java -version
> openjdk version "17.0.11" 2024-04-16
> OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode)
> {code}
> {code:bash}
> $ sdk current
> Using:
> java: 17.0.11-tem
> maven: 3.8.6
> scala: 2.12.19
> {code}
> *OS info:*
> {code:bash}
> $ uname -av
> Darwin MacBook-Pro 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:14:38
> PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 arm64
> {code}
> *Hardware info:*
> {code:bash}
> $ sysctl -a | grep -e 'machdep\.cpu\.brand_string\:' -e
> 'machdep\.cpu\.core_count\:' -e 'hw\.memsize\:'
> hw.memsize: 34359738368
> machdep.cpu.core_count: 12
> machdep.cpu.brand_string: Apple M2 Pro
> {code}
> Reporter: Grace Grimwood
> Priority: Major
> Labels: pull-request-available
> Attachments: 20240612_181148_mvn-clean-package_flink-runtime.log
>
>
> *Symptom:*
> The test *{{ProfilingServiceTest.testRollingDeletion}}* fails with the
> following error:
> {code:java}
> [ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.32
> s <<< FAILURE! -- in
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest
> [ERROR]
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion
> -- Time elapsed: 9.264 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: expected: <3> but was: <6>
> at
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
> at
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
> at
> org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
> at
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
> at
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
> at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
> at
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.verifyRollingDeletionWorks(ProfilingServiceTest.java:175)
> at
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion(ProfilingServiceTest.java:117)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> at
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
> at
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
> at
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
> at
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
> at
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
> at
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {code}
> The number of extra files found varies from failure to failure.
> *Cause:*
> Many of the tests in *{{ProfilingServiceTest}}* rely on a specific
> configuration of the *{{ProfilingService}}* instance, but
> *{{ProfilingService.getInstance}}* does not check whether an existing
> instance's config matches the provided config before returning it. Because of
> this, and because JUnit does not guarantee a specific ordering of tests
> (unless they are specifically annotated), it is possible for these tests to
> receive an instance that does not behave in the expected way and therefore
> fail.
> *Analysis:*
> In troubleshooting the test failure, we tried adding an extra assertion to
> *{{ProfilingServiceTest.setUp}}* to validate the directories being written to
> were correct:
> {code:java}
> Assertions.assertEquals(tempDir.toString(),
> profilingService.getProfilingResultDir());
> {code}
> That assert produced the following failure:
> {code:java}
> org.opentest4j.AssertionFailedError: expected:
> </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/junit9871405123519368112>
> but was: </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/>
> {code}
> This failure shows that the *{{ProfilingService}}* returned by
> *{{ProfilingService.getInstance}}* in the setup is not using the correct
> directory, and therefore cannot be the correct instance for this test class
> because it has the wrong config.
> This is because the static method *{{ProfilingService.getInstance}}* attempts
> to reuse any existing instance of *{{ProfilingService}}* before it creates a
> new one and disregards any differences in config in doing so, which means
> that if another test instantiates a *{{ProfilingService}}* with different
> config first and does not close it, that previous instance will be provided
> to *{{ProfilingServiceTest}}* rather than the new instance those tests seem
> to expect. This only happens with the first test run in this class, as the
> teardown method run after every test explicitly closes the existing
> *{{ProfilingService}}* instance.
> Specifically in the case of the test failures I have observed, it seems that
> if *{{ProfilingServiceTest.testRollingDeletion}}* is run _before_ any other
> *{{ProfilingServiceTest}}* tests but _after_ the test methods in
> *{{JobIntermediateDatasetReuseTest}}* (or any other tests that create a
> *{{TaskExecutor}}* via a {*}{{MiniCluster}}{*}), it will fail. From what I've
> been able to gather, *{{TaskExecutor}}* calls
> *{{ProfilingService.getInstance}}* with default config, and holds on to that
> instance internally but doesn't attempt to close that *{{ProfilingService}}*
> instance when the *{{TaskExecutor}}* instance is itself closed. This means
> that instance is sometimes still around when *{{ProfilingServiceTest.setUp}}*
> is run, so it gets passed to *{{ProfilingServiceTest.testRollingDeletion}}*
> at which point that test will fail as it incorrectly assumes that it has a
> new *{{ProfilingService}}* instance with a clean directory configured.
> .
> Logs are attached, produced with the following command:
> {code:bash}
> mvn clean package -Denforcer.skip -Dcheckstyle.skip -Drat.skip=true -pl
> :flink-runtime
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)