Robert Kanter created OOZIE-2385:
------------------------------------
Summary:
org.apache.oozie.TestCoordinatorEngineStreamLog.testCoordLogStreaming is flakey
Key: OOZIE-2385
URL: https://issues.apache.org/jira/browse/OOZIE-2385
Project: Oozie
Issue Type: Bug
Components: tests
Affects Versions: trunk
Reporter: Robert Kanter
Assignee: Robert Kanter
Fix For: trunk
{{org.apache.oozie.TestCoordinatorEngineStreamLog.testCoordLogStreaming}} is
flakey.
It can fail with an error like this:
{noformat}
junit.framework.ComparisonFailure: expected:<Tue Oct 06 16:27:2[5] PDT 2015>
but was:<Tue Oct 06 16:27:2[4] PDT 2015>
at junit.framework.Assert.assertEquals(Assert.java:85)
at junit.framework.Assert.assertEquals(Assert.java:91)
at
org.apache.oozie.TestCoordinatorEngineStreamLog.__CLR4_0_38acukov0f(TestCoordinatorEngineStreamLog.java:202)
at
org.apache.oozie.TestCoordinatorEngineStreamLog.testCoordLogStreaming(TestCoordinatorEngineStreamLog.java:103)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:243)
at junit.framework.TestSuite.run(TestSuite.java:238)
at
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:24)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{noformat}
This was tricky to track down. The problem is essentially that the filtering
API uses minute granularity but the actions are seconds, or even milliseconds
apart, so there are some edge cases here. The code takes minute granularity,
goes back to milliseconds when querying the DB, and then uses seconds when
checking that the test worked correctly.
This can lead to a situation like this example:
\@1 created/nominal time: Oct 06 16:27:24 <and some milliseconds>
\@2 created/nominal time: Oct 06 16:27:25 <and some milliseconds>
The test passes \@2's created time as the starting time argument to the filter,
converted with {{DateUtill's}} UTC MASK (i.e. minutes granularity):
- Oct 06 16:27
This gets parsed back to a Date object, which seconds/milliseconds assumed to
be 0's
- Oct 06 16:27:00 0000
This gets turned into a TimeStamp object, and used in the DB filtering query
- gets all actions with created time > Oct 06 16:27:00, which includes \@1
CoordinatorEngine sorts by action ID (i.e. \@#) and uses created time of
smallest, which is \@1
- Oct 06 16:27:24
The test checks that \@2’s created time is equal to previously determined time
from filter (i.e. \@1’s created time)
- Does a toString(), which uses seconds —-> not equal
Note: The filter querying is actually done against the nominal times, but the
example is simpler without all that and conveys the same idea.
The actions get created within 100s of milliseconds during the test (at least
on my machine), so most of them time, they have the same seconds, and the
toString() comparison is equal. When the timing works out that the first and
second action are created at different clock seconds (even if less than a
second apart), it fails. I couldn’t reproduce this on my machine, but is more
likely to happen on the generally slower Jenkins machines we run because the
time between action creations will be longer.
Basically the problem is that the filtering works at a minute granularity but
the actions are seconds, if not milliseconds, apart. In the real world this
isn’t a big deal because the API takes minutes and we simply use that, but to
verify the test is working we should simply make the actions minutes apart. To
fix this, we can just directly insert the job and actions into the DB instead
of actually running it so we can space the actions apart by minutes; this
should also make the test faster as well (the test takes about 6 minutes on my
machine as-is, but only 1 minute and 30 seconds with the fix).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)