Currently, an ML test is reliably failing and occasionally some HA
tests. Is someone looking into the ML test?

For HA, I will revert a commit, which might cause the HA
instabilities. Till is working on a proper fix as far as I know.

On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org> wrote:
> Thanks for the great work! :-)
>
> Regards,
> Chiwan Park
>
>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <pomperma...@okkam.it> wrote:
>>
>> Awesome work guys!
>> And even more thanks for the detailed report...This troubleshooting summary
>> will be undoubtedly useful for all our maven projects!
>>
>> Best,
>> Flavio
>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote:
>>
>>> Thanks for the effort, Max and Stephan! Happy to see the green light again.
>>>
>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> wrote:
>>>> Hi all!
>>>>
>>>> After a few weeks of terrible build issues, I am happy to announce that
>>> the
>>>> build works again properly, and we actually get meaningful CI results.
>>>>
>>>> Here is a story in many acts, from builds deep red to bright green joy.
>>>> Kudos to Max, who did most of this troubleshooting. This evening, Max and
>>>> me debugged the final issue and got the build back on track.
>>>>
>>>> ------------------
>>>> The Journey
>>>> ------------------
>>>>
>>>> (1) Failsafe Plugin
>>>>
>>>> The Maven Failsafe Build Plugin had a critical bug due to which failed
>>>> tests did not result in a failed build.
>>>>
>>>> That is a pretty bad bug for a plugin whose only task is to run tests and
>>>> fail the build if a test fails.
>>>>
>>>> After we recognized that, we upgraded the Failsafe Plugin.
>>>>
>>>>
>>>> (2) Failsafe Plugin Dependency Issues
>>>>
>>>> After the upgrade, the Failsafe Plugin behaved differently and did not
>>>> interoperate with Dependency Shading any more.
>>>>
>>>> Because of that, we switched to the Surefire Plugin.
>>>>
>>>>
>>>> (3) Fixing all the issues introduced in the meantime
>>>>
>>>> Naturally, a number of test instabilities had been introduced, which
>>> needed
>>>> to be fixed.
>>>>
>>>>
>>>> (4) Yarn Tests and Test Scope Refactoring
>>>>
>>>> In the meantime, a Pull Request was merged that moved the Yarn Tests to
>>> the
>>>> test scope.
>>>> Because the configuration searched for tests in the "main" scope, no Yarn
>>>> tests were executed for a while, until the scope was fixed.
>>>>
>>>>
>>>> (5) Yarn Tests and JMX Metrics
>>>>
>>>> After the Yarn Tests were re-activated, we saw them fail due to warnings
>>>> created by the newly introduced metrics code. We could fix that by
>>> updating
>>>> the metrics code and temporarily not registering JMX beans for all
>>> metrics.
>>>>
>>>>
>>>> (6) Yarn / Surefire Deadlock
>>>>
>>>> Finally, some Yarn tests failed reliably in Maven (though not in the
>>> IDE).
>>>> It turned out that those test a command line interface that interacts
>>> with
>>>> the standard input stream.
>>>>
>>>> The newly deployed Surefire Plugin uses standard input as well, for
>>>> communication with forked JVMs. Since Surefire internally locks the
>>>> standard input stream, the Yarn CLI cannot poll the standard input stream
>>>> without locking up and stalling the tests.
>>>>
>>>> We adjusted the tests and now the build happily builds again.
>>>>
>>>> -----------------
>>>> Conclusions:
>>>> -----------------
>>>>
>>>>  - CI is terribly crucial It took us weeks with the fallout of having a
>>>> period of unreliably CI.
>>>>
>>>>  - Maven could do a better job. A bug as crucial as the one that started
>>>> our problem should not occur in a test plugin like surefire. Also, the
>>>> constant change of semantics and dependency scopes is annoying. The
>>>> semantic changes are subtle, but for a build as complex as Flink, they
>>> make
>>>> a difference.
>>>>
>>>>  - File-based communication is rarely a good idea. The bug in the
>>> failsafe
>>>> plugin was caused by improper file-based communication, and some of our
>>>> discovered instabilities as well.
>>>>
>>>> Greetings,
>>>> Stephan
>>>>
>>>>
>>>> PS: Some issues and mysteries remain for us to solve: When we allow our
>>>> metrics subsystem to register JMX beans, we see some tests failing due to
>>>> spontaneous JVM process kills. Whoever has a pointer there, please ping
>>> us!
>>>
>

Reply via email to