Re: [ANNOUNCE] Build Issues Solved

Ufuk Celebi Thu, 02 Jun 2016 04:15:43 -0700

With the recent fixes, the builds are more stable, but I still see
many failing, because of the Scala shell tests, which lead to the JVMs
crashing. I've researched this a little bit, but didn't find an
obvious solution to the problem.


Does it make sense to disable the tests until someone has time to look into it?

– Ufuk

On Tue, May 31, 2016 at 1:46 PM, Stephan Ewen <se...@apache.org> wrote:
> You are right, Chiwan.
>
> I think that this pattern you use should be supported, though. Would be
> good to check if the job executes at the point of the "collect()" calls
> more than is necessary.
> That would explain the network buffer issue then...
>
> On Tue, May 31, 2016 at 12:18 PM, Chiwan Park <chiwanp...@apache.org> wrote:
>
>> Hi Stephan,
>>
>> Yes, right. But KNNITSuite calls
>> ExecutionEnvironment.getExecutionEnvironment only once [1]. I’m testing
>> with moving method call of getExecutionEnvironment to each test case.
>>
>> [1]:
>> https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/nn/KNNITSuite.scala#L45
>>
>> Regards,
>> Chiwan Park
>>
>> > On May 31, 2016, at 7:09 PM, Stephan Ewen <se...@apache.org> wrote:
>> >
>> > Hi Chiwan!
>> >
>> > I think the Execution environment is not shared, because what the
>> > TestEnvironment sets is a Context Environment Factory. Every time you
>> call
>> > "ExecutionEnvironment.getExecutionEnvironment()", you get a new
>> environment.
>> >
>> > Stephan
>> >
>> >
>> > On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <chiwanp...@apache.org>
>> wrote:
>> >
>> >> I’ve created a JIRA issue [1] related to KNN test cases. I will send a
>> PR
>> >> for it.
>> >>
>> >> From my investigation [2], cluster for ML tests have only one
>> taskmanager
>> >> with 4 slots. Is 2048 insufficient for total number of network numbers?
>> I
>> >> still think the problem is sharing ExecutionEnvironment between test
>> cases.
>> >>
>> >> [1]: https://issues.apache.org/jira/browse/FLINK-3994
>> >> [2]:
>> >>
>> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56
>> >>
>> >> Regards,
>> >> Chiwan Park
>> >>
>> >>> On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org>
>> wrote:
>> >>>
>> >>> Thanks Stephan for the synopsis of our last weeks test instability
>> >>> madness. It's sad to see the shortcomings of Maven test plugins but
>> >>> another lesson learned is that our testing infrastructure should get a
>> >>> bit more attention. We have reached a point several times where our
>> >>> tests where inherently instable. Now we saw that even more problems
>> >>> were hidden in the dark. I would like to see more maintenance
>> >>> dedicated to testing.
>> >>>
>> >>> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
>> >>> request with a systematic fix. Those things are too crucial to be
>> >>> fixed on the go. The problems is that Travis reports the number of
>> >>> processors to be "32" (which is used for the number of task slots in
>> >>> local execution). The network buffers are not adjusted accordingly. We
>> >>> should set them correctly in the MiniCluster. Also, we could define an
>> >>> upper limit to the number of task slots for tests.
>> >>>
>> >>> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org>
>> >> wrote:
>> >>>> I think that the tests fail because of sharing ExecutionEnvironment
>> >> between test cases. I’m not sure why it is problem, but it is only
>> >> difference between other ML tests.
>> >>>>
>> >>>> I created a hotfix and pushed it to my repository. When it seems fixed
>> >> [1], I’ll merge the hotfix to master branch.
>> >>>>
>> >>>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
>> >>>>
>> >>>> Regards,
>> >>>> Chiwan Park
>> >>>>
>> >>>>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org>
>> >> wrote:
>> >>>>>
>> >>>>> Maybe it seems about KNN test case which is merged into yesterday.
>> >> I’ll look into ML test.
>> >>>>>
>> >>>>> Regards,
>> >>>>> Chiwan Park
>> >>>>>
>> >>>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote:
>> >>>>>>
>> >>>>>> Currently, an ML test is reliably failing and occasionally some HA
>> >>>>>> tests. Is someone looking into the ML test?
>> >>>>>>
>> >>>>>> For HA, I will revert a commit, which might cause the HA
>> >>>>>> instabilities. Till is working on a proper fix as far as I know.
>> >>>>>>
>> >>>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org
>> >
>> >> wrote:
>> >>>>>>> Thanks for the great work! :-)
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>> Chiwan Park
>> >>>>>>>
>> >>>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <
>> >> pomperma...@okkam.it> wrote:
>> >>>>>>>>
>> >>>>>>>> Awesome work guys!
>> >>>>>>>> And even more thanks for the detailed report...This
>> troubleshooting
>> >> summary
>> >>>>>>>> will be undoubtedly useful for all our maven projects!
>> >>>>>>>>
>> >>>>>>>> Best,
>> >>>>>>>> Flavio
>> >>>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote:
>> >>>>>>>>
>> >>>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green
>> >> light again.
>> >>>>>>>>>
>> >>>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org
>> >
>> >> wrote:
>> >>>>>>>>>> Hi all!
>> >>>>>>>>>>
>> >>>>>>>>>> After a few weeks of terrible build issues, I am happy to
>> >> announce that
>> >>>>>>>>> the
>> >>>>>>>>>> build works again properly, and we actually get meaningful CI
>> >> results.
>> >>>>>>>>>>
>> >>>>>>>>>> Here is a story in many acts, from builds deep red to bright
>> >> green joy.
>> >>>>>>>>>> Kudos to Max, who did most of this troubleshooting. This
>> evening,
>> >> Max and
>> >>>>>>>>>> me debugged the final issue and got the build back on track.
>> >>>>>>>>>>
>> >>>>>>>>>> ------------------
>> >>>>>>>>>> The Journey
>> >>>>>>>>>> ------------------
>> >>>>>>>>>>
>> >>>>>>>>>> (1) Failsafe Plugin
>> >>>>>>>>>>
>> >>>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which
>> >> failed
>> >>>>>>>>>> tests did not result in a failed build.
>> >>>>>>>>>>
>> >>>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run
>> >> tests and
>> >>>>>>>>>> fail the build if a test fails.
>> >>>>>>>>>>
>> >>>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (2) Failsafe Plugin Dependency Issues
>> >>>>>>>>>>
>> >>>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and
>> >> did not
>> >>>>>>>>>> interoperate with Dependency Shading any more.
>> >>>>>>>>>>
>> >>>>>>>>>> Because of that, we switched to the Surefire Plugin.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (3) Fixing all the issues introduced in the meantime
>> >>>>>>>>>>
>> >>>>>>>>>> Naturally, a number of test instabilities had been introduced,
>> >> which
>> >>>>>>>>> needed
>> >>>>>>>>>> to be fixed.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (4) Yarn Tests and Test Scope Refactoring
>> >>>>>>>>>>
>> >>>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn
>> >> Tests to
>> >>>>>>>>> the
>> >>>>>>>>>> test scope.
>> >>>>>>>>>> Because the configuration searched for tests in the "main"
>> scope,
>> >> no Yarn
>> >>>>>>>>>> tests were executed for a while, until the scope was fixed.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (5) Yarn Tests and JMX Metrics
>> >>>>>>>>>>
>> >>>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to
>> >> warnings
>> >>>>>>>>>> created by the newly introduced metrics code. We could fix that
>> by
>> >>>>>>>>> updating
>> >>>>>>>>>> the metrics code and temporarily not registering JMX beans for
>> all
>> >>>>>>>>> metrics.
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> (6) Yarn / Surefire Deadlock
>> >>>>>>>>>>
>> >>>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in
>> >> the
>> >>>>>>>>> IDE).
>> >>>>>>>>>> It turned out that those test a command line interface that
>> >> interacts
>> >>>>>>>>> with
>> >>>>>>>>>> the standard input stream.
>> >>>>>>>>>>
>> >>>>>>>>>> The newly deployed Surefire Plugin uses standard input as well,
>> >> for
>> >>>>>>>>>> communication with forked JVMs. Since Surefire internally locks
>> >> the
>> >>>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard
>> >> input stream
>> >>>>>>>>>> without locking up and stalling the tests.
>> >>>>>>>>>>
>> >>>>>>>>>> We adjusted the tests and now the build happily builds again.
>> >>>>>>>>>>
>> >>>>>>>>>> -----------------
>> >>>>>>>>>> Conclusions:
>> >>>>>>>>>> -----------------
>> >>>>>>>>>>
>> >>>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of
>> >> having a
>> >>>>>>>>>> period of unreliably CI.
>> >>>>>>>>>>
>> >>>>>>>>>> - Maven could do a better job. A bug as crucial as the one that
>> >> started
>> >>>>>>>>>> our problem should not occur in a test plugin like surefire.
>> >> Also, the
>> >>>>>>>>>> constant change of semantics and dependency scopes is annoying.
>> >> The
>> >>>>>>>>>> semantic changes are subtle, but for a build as complex as
>> Flink,
>> >> they
>> >>>>>>>>> make
>> >>>>>>>>>> a difference.
>> >>>>>>>>>>
>> >>>>>>>>>> - File-based communication is rarely a good idea. The bug in the
>> >>>>>>>>> failsafe
>> >>>>>>>>>> plugin was caused by improper file-based communication, and some
>> >> of our
>> >>>>>>>>>> discovered instabilities as well.
>> >>>>>>>>>>
>> >>>>>>>>>> Greetings,
>> >>>>>>>>>> Stephan
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we
>> >> allow our
>> >>>>>>>>>> metrics subsystem to register JMX beans, we see some tests
>> >> failing due to
>> >>>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there,
>> >> please ping
>> >>>>>>>>> us!
>> >>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>
>> >>
>>
>>

Re: [ANNOUNCE] Build Issues Solved

Reply via email to