[
https://issues.apache.org/jira/browse/FLUME-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345103#comment-14345103
]
Tony Reix commented on FLUME-2625:
----------------------------------
Hi Hari, As I said in another JIRA, I have no specific skills about Flume. I'm
mainly involved in testing on PPC64, thus warning community about unstability.
As a non-expert of FLUME, finding tests that fail on IBMJVM/PPC64 (much more
often than on OpenJDK, or always, due to internal differences (faster JVM, or
GC working differently)) and not (or once about 30 or more runs) on
openJDK/x86_64 is a pain. Warning Flume people was aimed to help. However, I
have no useful skills for this and I'm involved in new testings now.
> There are several unstable tests within FLUME
> ---------------------------------------------
>
> Key: FLUME-2625
> URL: https://issues.apache.org/jira/browse/FLUME-2625
> Project: Flume
> Issue Type: Bug
> Components: Test
> Affects Versions: v1.5.0.1
> Environment: RHEL 7.1 / x86_64 / Open JDK 1.7
> Reporter: Tony Reix
>
> Hi,
> I'm working on porting FLUME in a RHEL 7.1 / PPC64LE / IBM JVM 1.7
> environment.
> As an example, I've found that the test .source.TestSyslogUdpSource fails,
> but not always, only 7 times out of 10 tries. Testing on RHEL 7.1 / x86_64 /
> IBM JVM, I've also had random failures.
> Running the same .source.TestSyslogUdpSource test in RHEL 7.1 / x86_64 / Open
> JDK 1.7 environment, I've found that this test fails only once out of 30
> tries: it is an "unstable" test.
> In order to find which test issues are specific to PPC64 or IBMJVM
> environment, I've run 10 times all the FLUME tests in the RHEL 7.1 / x86_64 /
> Open JDK 1.7 environment, which I call my "reference" environment.
> Then, using a tool that compares all the results, I've found that there are
> 16 tests that are "unstable" in my "reference" (x86_64/OpenJDK) .
> By "unstable", I mean to say that the results vary, though the environment is
> exactly the same.
> These tests are:
> .api.TestLoadBalancingRpcClient
> .api.TestThriftRpcClient
> .channel.file.TestFileChannelRestart
> .channel.TestSpillableMemoryChannel
> .instrumentation.http.TestHTTPMetricsServer
> .sink.TestAvroSink
> .sink.TestThriftSink
> .source.avroLegacy.TestLegacyAvroSource
> .source.http.TestHTTPSource
> .source.TestAvroSource
> .source.TestExecSource
> .source.TestMultiportSyslogTCPSource
> .source.TestSyslogTcpSource
> .source.TestSyslogUdpSource
> .source.TestThriftSource
> .source.thriftLegacy.TestThriftLegacySource
> About ".source.TestSyslogUdpSource" test, my analysis is that the test code
> is not reliable since the test checks that some data is correct without
> checking that all the "messages" have arrived (sometimes, a message has not
> arrived in time, and a reference is NULL).
> Adding "sleep(1000) to the test with IBM JVM, the test then failed only 3
> times out of 10.
> So, I think that several FLUME tests are coded in a way that is not 100%
> reliable. Or it could also be that some core code of FLUME is not 100%
> reliable.
> I mean to say that some code may have been written based on the specific
> behaviour of the OpenJDK Java Virtual Machine, which was used for testing.
> Some change about how the order of threads are launched, or about the time
> needed to send messages in the JVM/OS, may lead to issues that are not
> correctly handled by the code (mainly test code, but maybe core code too).
> And it seems that, though being perfectly correct, the IBM JVM does not work
> the same way compared to OpenJDK.
> So, this is a pain. Mainly in my PPC64LE/IBMJVM environment.
> I think that these 16 tests must be analysed and improved.
> Also, running tests with OpenJDK AND IBM JVM in your development and
> test/Jenkins environments would help to see these random issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)