Tony Reix created FLUME-2625:
--------------------------------
Summary: There are several unstable tests within FLUME
Key: FLUME-2625
URL: https://issues.apache.org/jira/browse/FLUME-2625
Project: Flume
Issue Type: Bug
Components: Test
Affects Versions: v1.5.0.1
Environment: RHEL 7.1 / x86_64 / Open JDK 1.7
Reporter: Tony Reix
Hi,
I'm working on porting FLUME in a RHEL 7.1 / PPC64LE / IBM JVM 1.7 environment.
As an example, I've found that the test .source.TestSyslogUdpSource fails, but
not always, only 7 times out of 10 tries. Testing on RHEL 7.1 / x86_64 / IBM
JVM, I've also had random failures.
Running the same .source.TestSyslogUdpSource test in RHEL 7.1 / x86_64 / Open
JDK 1.7 environment, I've found that this test fails only once out of 30 tries:
it is an "unstable" test.
In order to find which test issues are specific to PPC64 or IBMJVM environment,
I've run 10 times all the FLUME tests in the RHEL 7.1 / x86_64 / Open JDK 1.7
environment, which I call my "reference" environment.
Then, using a tool that compares all the results, I've found that there are 16
tests that are "unstable" in my "reference" (x86_64/OpenJDK) .
By "unstable", I mean to say that the results vary, though the environment is
exactly the same.
These tests are:
.api.TestLoadBalancingRpcClient
.api.TestThriftRpcClient
.channel.file.TestFileChannelRestart
.channel.TestSpillableMemoryChannel
.instrumentation.http.TestHTTPMetricsServer
.sink.TestAvroSink
.sink.TestThriftSink
.source.avroLegacy.TestLegacyAvroSource
.source.http.TestHTTPSource
.source.TestAvroSource
.source.TestExecSource
.source.TestMultiportSyslogTCPSource
.source.TestSyslogTcpSource
.source.TestSyslogUdpSource
.source.TestThriftSource
.source.thriftLegacy.TestThriftLegacySource
About ".source.TestSyslogUdpSource" test, my analysis is that the test code is
not reliable since the test checks that some data is correct without checking
that all the "messages" have arrived (sometimes, a message has not arrived in
time, and a reference is NULL).
Adding "sleep(1000) to the test with IBM JVM, the test then failed only 3 times
out of 10.
So, I think that several FLUME tests are coded in a way that is not 100%
reliable. Or it could also be that some core code of FLUME is not 100% reliable.
I mean to say that some code may have been written based on the specific
behaviour of the OpenJDK Java Virtual Machine, which was used for testing. Some
change about how the order of threads are launched, or about the time needed to
send messages in the JVM/OS, may lead to issues that are not correctly handled
by the code (mainly test code, but maybe core code too). And it seems that,
though being perfectly correct, the IBM JVM does not work the same way compared
to OpenJDK.
So, this is a pain. Mainly in my PPC64LE/IBMJVM environment.
I think that these 16 tests must be analysed and improved.
Also, running tests with OpenJDK AND IBM JVM in your development and
test/Jenkins environments would help to see these random issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)