[ https://issues.apache.org/jira/browse/FLUME-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327321#comment-14327321 ]
Santiago M. Mola commented on FLUME-2625: ----------------------------------------- [~trex58] Could you provide specific errors (with stacktraces) for TestLoadBalancingRpcClient abd TestThriftClient? > There are several unstable tests within FLUME > --------------------------------------------- > > Key: FLUME-2625 > URL: https://issues.apache.org/jira/browse/FLUME-2625 > Project: Flume > Issue Type: Bug > Components: Test > Affects Versions: v1.5.0.1 > Environment: RHEL 7.1 / x86_64 / Open JDK 1.7 > Reporter: Tony Reix > > Hi, > I'm working on porting FLUME in a RHEL 7.1 / PPC64LE / IBM JVM 1.7 > environment. > As an example, I've found that the test .source.TestSyslogUdpSource fails, > but not always, only 7 times out of 10 tries. Testing on RHEL 7.1 / x86_64 / > IBM JVM, I've also had random failures. > Running the same .source.TestSyslogUdpSource test in RHEL 7.1 / x86_64 / Open > JDK 1.7 environment, I've found that this test fails only once out of 30 > tries: it is an "unstable" test. > In order to find which test issues are specific to PPC64 or IBMJVM > environment, I've run 10 times all the FLUME tests in the RHEL 7.1 / x86_64 / > Open JDK 1.7 environment, which I call my "reference" environment. > Then, using a tool that compares all the results, I've found that there are > 16 tests that are "unstable" in my "reference" (x86_64/OpenJDK) . > By "unstable", I mean to say that the results vary, though the environment is > exactly the same. > These tests are: > .api.TestLoadBalancingRpcClient > .api.TestThriftRpcClient > .channel.file.TestFileChannelRestart > .channel.TestSpillableMemoryChannel > .instrumentation.http.TestHTTPMetricsServer > .sink.TestAvroSink > .sink.TestThriftSink > .source.avroLegacy.TestLegacyAvroSource > .source.http.TestHTTPSource > .source.TestAvroSource > .source.TestExecSource > .source.TestMultiportSyslogTCPSource > .source.TestSyslogTcpSource > .source.TestSyslogUdpSource > .source.TestThriftSource > .source.thriftLegacy.TestThriftLegacySource > About ".source.TestSyslogUdpSource" test, my analysis is that the test code > is not reliable since the test checks that some data is correct without > checking that all the "messages" have arrived (sometimes, a message has not > arrived in time, and a reference is NULL). > Adding "sleep(1000) to the test with IBM JVM, the test then failed only 3 > times out of 10. > So, I think that several FLUME tests are coded in a way that is not 100% > reliable. Or it could also be that some core code of FLUME is not 100% > reliable. > I mean to say that some code may have been written based on the specific > behaviour of the OpenJDK Java Virtual Machine, which was used for testing. > Some change about how the order of threads are launched, or about the time > needed to send messages in the JVM/OS, may lead to issues that are not > correctly handled by the code (mainly test code, but maybe core code too). > And it seems that, though being perfectly correct, the IBM JVM does not work > the same way compared to OpenJDK. > So, this is a pain. Mainly in my PPC64LE/IBMJVM environment. > I think that these 16 tests must be analysed and improved. > Also, running tests with OpenJDK AND IBM JVM in your development and > test/Jenkins environments would help to see these random issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)