With the recent fixes, the builds are more stable, but I still see many failing, because of the Scala shell tests, which lead to the JVMs crashing. I've researched this a little bit, but didn't find an obvious solution to the problem.
Does it make sense to disable the tests until someone has time to look into it? – Ufuk On Tue, May 31, 2016 at 1:46 PM, Stephan Ewen <se...@apache.org> wrote: > You are right, Chiwan. > > I think that this pattern you use should be supported, though. Would be > good to check if the job executes at the point of the "collect()" calls > more than is necessary. > That would explain the network buffer issue then... > > On Tue, May 31, 2016 at 12:18 PM, Chiwan Park <chiwanp...@apache.org> wrote: > >> Hi Stephan, >> >> Yes, right. But KNNITSuite calls >> ExecutionEnvironment.getExecutionEnvironment only once [1]. I’m testing >> with moving method call of getExecutionEnvironment to each test case. >> >> [1]: >> https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/nn/KNNITSuite.scala#L45 >> >> Regards, >> Chiwan Park >> >> > On May 31, 2016, at 7:09 PM, Stephan Ewen <se...@apache.org> wrote: >> > >> > Hi Chiwan! >> > >> > I think the Execution environment is not shared, because what the >> > TestEnvironment sets is a Context Environment Factory. Every time you >> call >> > "ExecutionEnvironment.getExecutionEnvironment()", you get a new >> environment. >> > >> > Stephan >> > >> > >> > On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <chiwanp...@apache.org> >> wrote: >> > >> >> I’ve created a JIRA issue [1] related to KNN test cases. I will send a >> PR >> >> for it. >> >> >> >> From my investigation [2], cluster for ML tests have only one >> taskmanager >> >> with 4 slots. Is 2048 insufficient for total number of network numbers? >> I >> >> still think the problem is sharing ExecutionEnvironment between test >> cases. >> >> >> >> [1]: https://issues.apache.org/jira/browse/FLINK-3994 >> >> [2]: >> >> >> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56 >> >> >> >> Regards, >> >> Chiwan Park >> >> >> >>> On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org> >> wrote: >> >>> >> >>> Thanks Stephan for the synopsis of our last weeks test instability >> >>> madness. It's sad to see the shortcomings of Maven test plugins but >> >>> another lesson learned is that our testing infrastructure should get a >> >>> bit more attention. We have reached a point several times where our >> >>> tests where inherently instable. Now we saw that even more problems >> >>> were hidden in the dark. I would like to see more maintenance >> >>> dedicated to testing. >> >>> >> >>> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull >> >>> request with a systematic fix. Those things are too crucial to be >> >>> fixed on the go. The problems is that Travis reports the number of >> >>> processors to be "32" (which is used for the number of task slots in >> >>> local execution). The network buffers are not adjusted accordingly. We >> >>> should set them correctly in the MiniCluster. Also, we could define an >> >>> upper limit to the number of task slots for tests. >> >>> >> >>> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org> >> >> wrote: >> >>>> I think that the tests fail because of sharing ExecutionEnvironment >> >> between test cases. I’m not sure why it is problem, but it is only >> >> difference between other ML tests. >> >>>> >> >>>> I created a hotfix and pushed it to my repository. When it seems fixed >> >> [1], I’ll merge the hotfix to master branch. >> >>>> >> >>>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491 >> >>>> >> >>>> Regards, >> >>>> Chiwan Park >> >>>> >> >>>>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org> >> >> wrote: >> >>>>> >> >>>>> Maybe it seems about KNN test case which is merged into yesterday. >> >> I’ll look into ML test. >> >>>>> >> >>>>> Regards, >> >>>>> Chiwan Park >> >>>>> >> >>>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote: >> >>>>>> >> >>>>>> Currently, an ML test is reliably failing and occasionally some HA >> >>>>>> tests. Is someone looking into the ML test? >> >>>>>> >> >>>>>> For HA, I will revert a commit, which might cause the HA >> >>>>>> instabilities. Till is working on a proper fix as far as I know. >> >>>>>> >> >>>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org >> > >> >> wrote: >> >>>>>>> Thanks for the great work! :-) >> >>>>>>> >> >>>>>>> Regards, >> >>>>>>> Chiwan Park >> >>>>>>> >> >>>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier < >> >> pomperma...@okkam.it> wrote: >> >>>>>>>> >> >>>>>>>> Awesome work guys! >> >>>>>>>> And even more thanks for the detailed report...This >> troubleshooting >> >> summary >> >>>>>>>> will be undoubtedly useful for all our maven projects! >> >>>>>>>> >> >>>>>>>> Best, >> >>>>>>>> Flavio >> >>>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote: >> >>>>>>>> >> >>>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green >> >> light again. >> >>>>>>>>> >> >>>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org >> > >> >> wrote: >> >>>>>>>>>> Hi all! >> >>>>>>>>>> >> >>>>>>>>>> After a few weeks of terrible build issues, I am happy to >> >> announce that >> >>>>>>>>> the >> >>>>>>>>>> build works again properly, and we actually get meaningful CI >> >> results. >> >>>>>>>>>> >> >>>>>>>>>> Here is a story in many acts, from builds deep red to bright >> >> green joy. >> >>>>>>>>>> Kudos to Max, who did most of this troubleshooting. This >> evening, >> >> Max and >> >>>>>>>>>> me debugged the final issue and got the build back on track. >> >>>>>>>>>> >> >>>>>>>>>> ------------------ >> >>>>>>>>>> The Journey >> >>>>>>>>>> ------------------ >> >>>>>>>>>> >> >>>>>>>>>> (1) Failsafe Plugin >> >>>>>>>>>> >> >>>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which >> >> failed >> >>>>>>>>>> tests did not result in a failed build. >> >>>>>>>>>> >> >>>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run >> >> tests and >> >>>>>>>>>> fail the build if a test fails. >> >>>>>>>>>> >> >>>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> (2) Failsafe Plugin Dependency Issues >> >>>>>>>>>> >> >>>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and >> >> did not >> >>>>>>>>>> interoperate with Dependency Shading any more. >> >>>>>>>>>> >> >>>>>>>>>> Because of that, we switched to the Surefire Plugin. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> (3) Fixing all the issues introduced in the meantime >> >>>>>>>>>> >> >>>>>>>>>> Naturally, a number of test instabilities had been introduced, >> >> which >> >>>>>>>>> needed >> >>>>>>>>>> to be fixed. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> (4) Yarn Tests and Test Scope Refactoring >> >>>>>>>>>> >> >>>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn >> >> Tests to >> >>>>>>>>> the >> >>>>>>>>>> test scope. >> >>>>>>>>>> Because the configuration searched for tests in the "main" >> scope, >> >> no Yarn >> >>>>>>>>>> tests were executed for a while, until the scope was fixed. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> (5) Yarn Tests and JMX Metrics >> >>>>>>>>>> >> >>>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to >> >> warnings >> >>>>>>>>>> created by the newly introduced metrics code. We could fix that >> by >> >>>>>>>>> updating >> >>>>>>>>>> the metrics code and temporarily not registering JMX beans for >> all >> >>>>>>>>> metrics. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> (6) Yarn / Surefire Deadlock >> >>>>>>>>>> >> >>>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in >> >> the >> >>>>>>>>> IDE). >> >>>>>>>>>> It turned out that those test a command line interface that >> >> interacts >> >>>>>>>>> with >> >>>>>>>>>> the standard input stream. >> >>>>>>>>>> >> >>>>>>>>>> The newly deployed Surefire Plugin uses standard input as well, >> >> for >> >>>>>>>>>> communication with forked JVMs. Since Surefire internally locks >> >> the >> >>>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard >> >> input stream >> >>>>>>>>>> without locking up and stalling the tests. >> >>>>>>>>>> >> >>>>>>>>>> We adjusted the tests and now the build happily builds again. >> >>>>>>>>>> >> >>>>>>>>>> ----------------- >> >>>>>>>>>> Conclusions: >> >>>>>>>>>> ----------------- >> >>>>>>>>>> >> >>>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of >> >> having a >> >>>>>>>>>> period of unreliably CI. >> >>>>>>>>>> >> >>>>>>>>>> - Maven could do a better job. A bug as crucial as the one that >> >> started >> >>>>>>>>>> our problem should not occur in a test plugin like surefire. >> >> Also, the >> >>>>>>>>>> constant change of semantics and dependency scopes is annoying. >> >> The >> >>>>>>>>>> semantic changes are subtle, but for a build as complex as >> Flink, >> >> they >> >>>>>>>>> make >> >>>>>>>>>> a difference. >> >>>>>>>>>> >> >>>>>>>>>> - File-based communication is rarely a good idea. The bug in the >> >>>>>>>>> failsafe >> >>>>>>>>>> plugin was caused by improper file-based communication, and some >> >> of our >> >>>>>>>>>> discovered instabilities as well. >> >>>>>>>>>> >> >>>>>>>>>> Greetings, >> >>>>>>>>>> Stephan >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we >> >> allow our >> >>>>>>>>>> metrics subsystem to register JMX beans, we see some tests >> >> failing due to >> >>>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there, >> >> please ping >> >>>>>>>>> us! >> >>>>>>>>> >> >>>>>>> >> >>>>> >> >>>> >> >> >> >> >> >>