Currently, an ML test is reliably failing and occasionally some HA tests. Is someone looking into the ML test?
For HA, I will revert a commit, which might cause the HA instabilities. Till is working on a proper fix as far as I know. On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org> wrote: > Thanks for the great work! :-) > > Regards, > Chiwan Park > >> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <pomperma...@okkam.it> wrote: >> >> Awesome work guys! >> And even more thanks for the detailed report...This troubleshooting summary >> will be undoubtedly useful for all our maven projects! >> >> Best, >> Flavio >> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote: >> >>> Thanks for the effort, Max and Stephan! Happy to see the green light again. >>> >>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> wrote: >>>> Hi all! >>>> >>>> After a few weeks of terrible build issues, I am happy to announce that >>> the >>>> build works again properly, and we actually get meaningful CI results. >>>> >>>> Here is a story in many acts, from builds deep red to bright green joy. >>>> Kudos to Max, who did most of this troubleshooting. This evening, Max and >>>> me debugged the final issue and got the build back on track. >>>> >>>> ------------------ >>>> The Journey >>>> ------------------ >>>> >>>> (1) Failsafe Plugin >>>> >>>> The Maven Failsafe Build Plugin had a critical bug due to which failed >>>> tests did not result in a failed build. >>>> >>>> That is a pretty bad bug for a plugin whose only task is to run tests and >>>> fail the build if a test fails. >>>> >>>> After we recognized that, we upgraded the Failsafe Plugin. >>>> >>>> >>>> (2) Failsafe Plugin Dependency Issues >>>> >>>> After the upgrade, the Failsafe Plugin behaved differently and did not >>>> interoperate with Dependency Shading any more. >>>> >>>> Because of that, we switched to the Surefire Plugin. >>>> >>>> >>>> (3) Fixing all the issues introduced in the meantime >>>> >>>> Naturally, a number of test instabilities had been introduced, which >>> needed >>>> to be fixed. >>>> >>>> >>>> (4) Yarn Tests and Test Scope Refactoring >>>> >>>> In the meantime, a Pull Request was merged that moved the Yarn Tests to >>> the >>>> test scope. >>>> Because the configuration searched for tests in the "main" scope, no Yarn >>>> tests were executed for a while, until the scope was fixed. >>>> >>>> >>>> (5) Yarn Tests and JMX Metrics >>>> >>>> After the Yarn Tests were re-activated, we saw them fail due to warnings >>>> created by the newly introduced metrics code. We could fix that by >>> updating >>>> the metrics code and temporarily not registering JMX beans for all >>> metrics. >>>> >>>> >>>> (6) Yarn / Surefire Deadlock >>>> >>>> Finally, some Yarn tests failed reliably in Maven (though not in the >>> IDE). >>>> It turned out that those test a command line interface that interacts >>> with >>>> the standard input stream. >>>> >>>> The newly deployed Surefire Plugin uses standard input as well, for >>>> communication with forked JVMs. Since Surefire internally locks the >>>> standard input stream, the Yarn CLI cannot poll the standard input stream >>>> without locking up and stalling the tests. >>>> >>>> We adjusted the tests and now the build happily builds again. >>>> >>>> ----------------- >>>> Conclusions: >>>> ----------------- >>>> >>>> - CI is terribly crucial It took us weeks with the fallout of having a >>>> period of unreliably CI. >>>> >>>> - Maven could do a better job. A bug as crucial as the one that started >>>> our problem should not occur in a test plugin like surefire. Also, the >>>> constant change of semantics and dependency scopes is annoying. The >>>> semantic changes are subtle, but for a build as complex as Flink, they >>> make >>>> a difference. >>>> >>>> - File-based communication is rarely a good idea. The bug in the >>> failsafe >>>> plugin was caused by improper file-based communication, and some of our >>>> discovered instabilities as well. >>>> >>>> Greetings, >>>> Stephan >>>> >>>> >>>> PS: Some issues and mysteries remain for us to solve: When we allow our >>>> metrics subsystem to register JMX beans, we see some tests failing due to >>>> spontaneous JVM process kills. Whoever has a pointer there, please ping >>> us! >>> >