GitHub user nickwallen opened a pull request:

    https://github.com/apache/metron/pull/1289

    METRON-1810 Storm Profiler Intermittent Test Failure

    This PR hopefully resolves some of the more frequent, intermittent 
`ProfilerIntegrationTest` failures.  
    
    ### Testing
    
    Run the integration tests and they should not fail.
    
    I have repeatedly run the integration tests on my laptop and have yet to 
see a failure.  I am also repeatedly triggering Travis CI builds on this branch 
to see if any failures occur there.  I can't prove the problem is solved, but I 
am hoping this helps.
    
    ### Changes
    
    * This change uses Caffeine's CacheWriter in place of the RemovalListener 
for profile expiration.  This is explained more in the next section.
    
    * Removed the ability to define the cache maintenance executor for the 
`DefaultMessageDistributor`.  This was only needed for testing 
RemovalListeners, which no longer exist.
    
    * I re-jiggered some of the integration tests to provide more information 
when they do fail. I removed the use of `waitOrTimeout` and replaced it with 
`assertEventually`.  I am also using a different style of assertion; 
Hamcrest-y.  These changes should provide additional information and also be 
more consistent with the rest of the code base.
    
    * After removing the dependency that provided `waitOrTimeout`, I had to 
rework some of the project dependencies because multiple versions 
of`httpclient` lib was being pulled in.  This caused the new REST function in 
`stellar-common` to blow up when a Stellar execution environment is loaded 
during the integration test.
    
    * Added additional debug logging for the caches including an estimate of 
the cache size.
    
    #### RemovalListener to CacheWriter
    
    Profiles are designed to expire and flush after a fixed period of time; the 
TTL.  The integration tests rely on some of the profile values being flushed by 
this TTL mechanism.  
    
    The TTL mechanism is driven by cache expiration. There is a cache of 
"active" profiles.  This cache is configured to have values timeout after they 
have not been accessed for a fixed period of time.  Once they timeout from the 
active cache, they are placed on an "expired" cache.  Messages are not applied 
to expired profiles, but these expired profiles hang around for a fixed period 
of time to allow them to be flushed.
    
    Previously, a Caffeine 
[RemovalListener](https://github.com/ben-manes/caffeine/wiki/Removal#removal-listeners)
 was used so that when a profile expires from the active cache, it is placed 
into the expired cache. When the tests fail, it seems that the 
`RemovalListener` is not notified in a timely fashion, so the profile doesn't 
make it to the expired cache and so never flushes to be read by the integration 
test.
    
     In Caffeine, these listeners are notified asynchronously, on a separate 
thread (via ForkJoinPool.commonPool()), not inline with cache reads or writes.  
For running tests that depend on RemovalListener's it is recommended to set the 
cache maintenance executor to something like 
`MoreExecutors.sameThreadExecutor()` so that cache maintenance is executed on 
the main execution thread when `cleanUp` is called.  This was done for the unit 
tests, but was not done for the integration tests. 
    
    The Caffeine Wiki mentions that when notification should be performed 
synchronously, which logically works for this use case, to use a 
[CacheWriter](https://github.com/ben-manes/caffeine/wiki/Writer) instead.
    
    > Removal listener operations are executed asynchronously using an 
Executor. The default executor is ForkJoinPool.commonPool() and can be 
overridden via Caffeine.executor(Executor). When the operation must be 
performed synchronously with the removal, use 
[CacheWriter](https://github.com/ben-manes/caffeine/wiki/Writer) instead.
    
    This does not negatively impact production performance because the 
`ActiveCacheWriter` only does work on a delete which occurs only rarely when a 
profile stops receiving messages, not on a write.  In addition, these caches 
are very read-heavy as in most cases the cache is only written to when a new 
profile is defined or on topology start-up.
    
    ## Pull Request Checklist
    - [ ] Is there a JIRA ticket associated with this PR? If not one needs to 
be created at [Metron 
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
    - [ ] Does your PR title start with METRON-XXXX where XXXX is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
    - [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
    - [ ] Have you included steps to reproduce the behavior or problem that is 
being changed or addressed?
    - [ ] Have you included steps or a guide to how the change may be verified 
and tested manually?
    - [ ] Have you ensured that the full suite of tests and checks have been 
executed in the root metron folder via:
      ```
      mvn -q clean integration-test install && 
dev-utilities/build-utils/verify_licenses.sh 
      ```
    - [ ] Have you written or updated unit tests and or integration tests to 
verify your changes?
    - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
    - [ ] Have you verified the basic functionality of the build by building 
and running locally with Vagrant full-dev environment or the equivalent?
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nickwallen/metron METRON-1810

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/1289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1289
    
----
commit db902ea6db116cdb92516a68196270dd61f2b566
Author: Nick Allen <nick@...>
Date:   2018-12-01T13:39:43Z

    WIP.. Still getting reliable failures when run in IDE

commit f68d89999310392c8354f64be34c3cae3bac0b86
Author: Nick Allen <nick@...>
Date:   2018-12-04T13:45:50Z

    WIP

commit 1d13682b14df6d7d389d633f07a2f3b4d156b3ef
Author: Nick Allen <nick@...>
Date:   2018-12-04T16:40:42Z

    Resolve classpath issues

commit 67d945781d12a952765ccba5272d01e30c451599
Author: Nick Allen <nick@...>
Date:   2018-12-04T17:09:03Z

    Undo change to GetProfileT
    est

commit 21ab3384cf73d005118b3ae04a4998f7718b7bd3
Author: Nick Allen <nick@...>
Date:   2018-12-04T19:35:46Z

    Enhanced logging of caches

commit 3728a55287ce7cee9242399da9cec3236c0e522d
Author: Nick Allen <nick@...>
Date:   2018-12-04T19:39:02Z

    Making sure useful information is available if failure occurs

commit 67574b8a69acf21a46b6caae51e495f02a55f53b
Author: Nick Allen <nick@...>
Date:   2018-12-04T19:42:01Z

    Restore original logger props

commit d4e144a2cd55f785776e03f424560b60fa98eaa7
Author: Nick Allen <nick@...>
Date:   2018-12-04T20:05:56Z

    Refactored testProcessingTimeWithTimeToLiveFlush to use assertEventually

commit 44df64df63f3aa26b688f196f8fe7576a8ccd9bc
Author: Nick Allen <nick@...>
Date:   2018-12-05T18:36:22Z

    Using Caffeine's CacheWriter instead of a RemovalListener for profile 
expiration

----


---

Reply via email to