GitHub user srdo opened a pull request:

    https://github.com/apache/storm/pull/2127

    STORM-2525: Fix flaky integration tests

    https://issues.apache.org/jira/browse/STORM-2525
    
    * EvictionPolicies based on watermarks could throw NPEs if they were asked 
to decide eviction before first watermark had arrived. This is possible if the 
WindowingManager decides to compact the window. Made the policies refuse to 
evict events until the first watermark has arrived.
    * TopoWrap readLogs was hiding IOExceptions when reading from the 
logviewer, replacing the real log with a dummy. If this happened, the tests 
were likely to fail with a misleading error. Made the IOException propagate out 
to the test instead.
    * The tests are now checking all the windows produced by the windowing 
bolt. I couldn't tell what the previous calculation of the expected number of 
windows was going for. The original was `final int numberOfWindows = 
allBoltLog.size() - windowSec / slideSec;`, but it seems pretty reasonable to 
me to just check all the windows instead.
    * Made the tests fail if the emit count reported by the built in metrics 
does not manage to reach the expected minEmits within the time limit.
    * The windowing integration tests are using the built in metrics to wait 
for the bolt or spout to emit enough tuples, then using the logviewer to read 
the worker log. The log entries are then used to determine if the windowing 
bolt produced the right windows. If the log rolls after the tests have decided 
enough tuples have been emitted, but before the test manages to read the log, 
the test is likely to fail, because the log won't account for the emits listed 
in the archived log file. Increased the rate of tuple emission from the test 
spouts, but reduced the number of tuples required before the test will start 
evaluating the log. This is not a complete fix, but producing 100MB of log 
before we emit 500 tuples (current largest test) seems unlikely.
    * The metrics sample rate has been set to 1 when using the TopoWrap class. 
One of the TopoWrap methods offers to get the number of tuples emitted from a 
component, but the accuracy of the returned number depends on the metrics 
sample rate. This was causing test flakiness when using 
TopoWrap.waitForProgress. 
    * Replaced the vagrant image with a newer Ubuntu image. The older image 
seemed to perform poorly, maybe due to older Virtualbox guest additions? The 
running time for me went from 40+ minutes to ~15.
    * Remove the cluster.xml in integration-test/config. It was unused.
    
    I've run the integration tests 5 times with no errors, so it seems stable 
now.
    
    There is another issue semi-related to this that I'd like some input on. 
I'll raise another issue if this behavior seems broken to anyone else:
    
    The windowing docs claim that windowing offers at least once processing. 
This is true for configurations using tuple timestamps (i.e. those ensuring 
that 
https://github.com/apache/storm/blob/64e29f365c9b5d3e15b33f33ab64e200345333e4/storm-client/src/jvm/org/apache/storm/topology/IWindowedBolt.java#L50
 does not return null), but the count and time based configuration without 
tuple timestamps may expire tuples before they've been presented to the user's 
bolt in a window. This is because the WindowManager will attempt to expire 
tuples if there are more than 100 tuples in the current window (see 
https://github.com/apache/storm/blob/a4afacd9617d620f50cf026fc599821f7ac25c79/storm-client/src/jvm/org/apache/storm/windowing/WindowManager.java#L50).
 The tuples are acked as soon as they're expired 
https://github.com/apache/storm/blob/a4afacd9617d620f50cf026fc599821f7ac25c79/storm-client/src/jvm/org/apache/storm/windowing/WindowManager.java#L215
    
    When tuple timestamps are used, tuples will only expire when the watermark 
moves, and when the watermark moves, the trigger policy ensures all stored 
tuples get passed to the user's bolt in an appropriate window. When tuple 
timestamps, and thus watermarks, are disabled, tuples may get expired during 
compaction regardless of whether the user's bolt has seen those tuples yet.
    
    I'm not sure if we should just put a note in the windowing docs about this, 
or if it is better to remove the compaction mechanism? I'm leaning towards 
removing the compaction mechanism. I don't think it's actually saving much 
memory since evicted tuples remain stored in a list until the user's bolt is 
passed the next window (at which point the expiration code would be run again, 
and the tuples would be expired anyway), and the behavior is not that easy to 
understand without reading the source. For example you can only get a count 
window larger than 100 if you use tuple timestamps, since the manager will 
expire tuples once the window reaches that size if watermarking is disabled.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srdo/storm STORM-2525

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/2127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2127
    
----
commit 1f2b7a05ad454dfb7d6d7e3737b56caf6fc36293
Author: Stig Rohde Døssing <[email protected]>
Date:   2017-05-20T17:46:03Z

    STORM-2525: Fix flaky integration tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to