[jira] [Created] (FLUME-2625) There are several unstable tests within FLUME

Tony Reix (JIRA) Mon, 16 Feb 2015 03:27:45 -0800

Tony Reix created FLUME-2625:
--------------------------------

             Summary: There are several unstable tests within FLUME
                 Key: FLUME-2625
                 URL: https://issues.apache.org/jira/browse/FLUME-2625
             Project: Flume
          Issue Type: Bug
          Components: Test
    Affects Versions: v1.5.0.1
         Environment: RHEL 7.1 / x86_64 / Open JDK 1.7
            Reporter: Tony Reix



Hi,

I'm working on porting FLUME in a RHEL 7.1 / PPC64LE / IBM JVM 1.7 environment.
As an example, I've found that the test .source.TestSyslogUdpSource fails, but 
not always, only 7 times out of 10 tries. Testing on RHEL 7.1 / x86_64 / IBM 
JVM, I've also had random failures.
Running the same .source.TestSyslogUdpSource test in RHEL 7.1 / x86_64 / Open 
JDK 1.7 environment, I've found that this test fails only once out of 30 tries: 
it is an "unstable" test.

In order to find which test issues are specific to PPC64 or IBMJVM environment, 
I've run 10 times all the FLUME tests in the RHEL 7.1 / x86_64 / Open JDK 1.7 
environment, which I call my "reference" environment.

Then, using a tool that compares all the results, I've found that there are 16 
tests that are "unstable" in my "reference" (x86_64/OpenJDK) .
By "unstable", I mean to say that the results vary, though the environment is 
exactly the same.

These tests are:

.api.TestLoadBalancingRpcClient
.api.TestThriftRpcClient
.channel.file.TestFileChannelRestart
.channel.TestSpillableMemoryChannel
.instrumentation.http.TestHTTPMetricsServer
.sink.TestAvroSink
.sink.TestThriftSink
.source.avroLegacy.TestLegacyAvroSource
.source.http.TestHTTPSource
.source.TestAvroSource
.source.TestExecSource
.source.TestMultiportSyslogTCPSource
.source.TestSyslogTcpSource
.source.TestSyslogUdpSource
.source.TestThriftSource
.source.thriftLegacy.TestThriftLegacySource

About ".source.TestSyslogUdpSource" test, my analysis is that the test code is 
not reliable since the test checks that some data is correct without checking 
that all the "messages" have arrived (sometimes, a message has not arrived in 
time, and a reference is NULL).
Adding "sleep(1000) to the test with IBM JVM, the test then failed only 3 times 
out of 10.

So, I think that several FLUME tests are coded in a way that is not 100% 
reliable. Or it could also be that some core code of FLUME is not 100% reliable.

I mean to say that some code may have been written based on the specific 
behaviour of the OpenJDK Java Virtual Machine, which was used for testing. Some 
change about how the order of threads are launched, or about the time needed to 
send messages in the JVM/OS, may lead to issues that are not correctly handled 
by the code (mainly test code, but maybe core code too). And it seems that, 
though being perfectly correct, the IBM JVM does not work the same way compared 
to OpenJDK.

So, this is a pain. Mainly in my PPC64LE/IBMJVM environment.
I think that these 16 tests must be analysed and improved.
Also, running tests with OpenJDK  AND  IBM JVM in your development and 
test/Jenkins environments would help to see these random issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (FLUME-2625) There are several unstable tests within FLUME

Reply via email to