GitHub user srdo opened a pull request:
https://github.com/apache/storm/pull/2127
STORM-2525: Fix flaky integration tests
https://issues.apache.org/jira/browse/STORM-2525
* EvictionPolicies based on watermarks could throw NPEs if they were asked
to decide eviction before first watermark had arrived. This is possible if the
WindowingManager decides to compact the window. Made the policies refuse to
evict events until the first watermark has arrived.
* TopoWrap readLogs was hiding IOExceptions when reading from the
logviewer, replacing the real log with a dummy. If this happened, the tests
were likely to fail with a misleading error. Made the IOException propagate out
to the test instead.
* The tests are now checking all the windows produced by the windowing
bolt. I couldn't tell what the previous calculation of the expected number of
windows was going for. The original was `final int numberOfWindows =
allBoltLog.size() - windowSec / slideSec;`, but it seems pretty reasonable to
me to just check all the windows instead.
* Made the tests fail if the emit count reported by the built in metrics
does not manage to reach the expected minEmits within the time limit.
* The windowing integration tests are using the built in metrics to wait
for the bolt or spout to emit enough tuples, then using the logviewer to read
the worker log. The log entries are then used to determine if the windowing
bolt produced the right windows. If the log rolls after the tests have decided
enough tuples have been emitted, but before the test manages to read the log,
the test is likely to fail, because the log won't account for the emits listed
in the archived log file. Increased the rate of tuple emission from the test
spouts, but reduced the number of tuples required before the test will start
evaluating the log. This is not a complete fix, but producing 100MB of log
before we emit 500 tuples (current largest test) seems unlikely.
* The metrics sample rate has been set to 1 when using the TopoWrap class.
One of the TopoWrap methods offers to get the number of tuples emitted from a
component, but the accuracy of the returned number depends on the metrics
sample rate. This was causing test flakiness when using
TopoWrap.waitForProgress.
* Replaced the vagrant image with a newer Ubuntu image. The older image
seemed to perform poorly, maybe due to older Virtualbox guest additions? The
running time for me went from 40+ minutes to ~15.
* Remove the cluster.xml in integration-test/config. It was unused.
I've run the integration tests 5 times with no errors, so it seems stable
now.
There is another issue semi-related to this that I'd like some input on.
I'll raise another issue if this behavior seems broken to anyone else:
The windowing docs claim that windowing offers at least once processing.
This is true for configurations using tuple timestamps (i.e. those ensuring
that
https://github.com/apache/storm/blob/64e29f365c9b5d3e15b33f33ab64e200345333e4/storm-client/src/jvm/org/apache/storm/topology/IWindowedBolt.java#L50
does not return null), but the count and time based configuration without
tuple timestamps may expire tuples before they've been presented to the user's
bolt in a window. This is because the WindowManager will attempt to expire
tuples if there are more than 100 tuples in the current window (see
https://github.com/apache/storm/blob/a4afacd9617d620f50cf026fc599821f7ac25c79/storm-client/src/jvm/org/apache/storm/windowing/WindowManager.java#L50).
The tuples are acked as soon as they're expired
https://github.com/apache/storm/blob/a4afacd9617d620f50cf026fc599821f7ac25c79/storm-client/src/jvm/org/apache/storm/windowing/WindowManager.java#L215
When tuple timestamps are used, tuples will only expire when the watermark
moves, and when the watermark moves, the trigger policy ensures all stored
tuples get passed to the user's bolt in an appropriate window. When tuple
timestamps, and thus watermarks, are disabled, tuples may get expired during
compaction regardless of whether the user's bolt has seen those tuples yet.
I'm not sure if we should just put a note in the windowing docs about this,
or if it is better to remove the compaction mechanism? I'm leaning towards
removing the compaction mechanism. I don't think it's actually saving much
memory since evicted tuples remain stored in a list until the user's bolt is
passed the next window (at which point the expiration code would be run again,
and the tuples would be expired anyway), and the behavior is not that easy to
understand without reading the source. For example you can only get a count
window larger than 100 if you use tuple timestamps, since the manager will
expire tuples once the window reaches that size if watermarking is disabled.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/srdo/storm STORM-2525
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/2127.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2127
----
commit 1f2b7a05ad454dfb7d6d7e3737b56caf6fc36293
Author: Stig Rohde Døssing <[email protected]>
Date: 2017-05-20T17:46:03Z
STORM-2525: Fix flaky integration tests
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---