GitHub user nickwallen opened a pull request:
https://github.com/apache/metron/pull/1289
METRON-1810 Storm Profiler Intermittent Test Failure
This PR hopefully resolves some of the more frequent, intermittent
`ProfilerIntegrationTest` failures.
### Testing
Run the integration tests and they should not fail.
I have repeatedly run the integration tests on my laptop and have yet to
see a failure. I am also repeatedly triggering Travis CI builds on this branch
to see if any failures occur there. I can't prove the problem is solved, but I
am hoping this helps.
### Changes
* This change uses Caffeine's CacheWriter in place of the RemovalListener
for profile expiration. This is explained more in the next section.
* Removed the ability to define the cache maintenance executor for the
`DefaultMessageDistributor`. This was only needed for testing
RemovalListeners, which no longer exist.
* I re-jiggered some of the integration tests to provide more information
when they do fail. I removed the use of `waitOrTimeout` and replaced it with
`assertEventually`. I am also using a different style of assertion;
Hamcrest-y. These changes should provide additional information and also be
more consistent with the rest of the code base.
* After removing the dependency that provided `waitOrTimeout`, I had to
rework some of the project dependencies because multiple versions
of`httpclient` lib was being pulled in. This caused the new REST function in
`stellar-common` to blow up when a Stellar execution environment is loaded
during the integration test.
* Added additional debug logging for the caches including an estimate of
the cache size.
#### RemovalListener to CacheWriter
Profiles are designed to expire and flush after a fixed period of time; the
TTL. The integration tests rely on some of the profile values being flushed by
this TTL mechanism.
The TTL mechanism is driven by cache expiration. There is a cache of
"active" profiles. This cache is configured to have values timeout after they
have not been accessed for a fixed period of time. Once they timeout from the
active cache, they are placed on an "expired" cache. Messages are not applied
to expired profiles, but these expired profiles hang around for a fixed period
of time to allow them to be flushed.
Previously, a Caffeine
[RemovalListener](https://github.com/ben-manes/caffeine/wiki/Removal#removal-listeners)
was used so that when a profile expires from the active cache, it is placed
into the expired cache. When the tests fail, it seems that the
`RemovalListener` is not notified in a timely fashion, so the profile doesn't
make it to the expired cache and so never flushes to be read by the integration
test.
In Caffeine, these listeners are notified asynchronously, on a separate
thread (via ForkJoinPool.commonPool()), not inline with cache reads or writes.
For running tests that depend on RemovalListener's it is recommended to set the
cache maintenance executor to something like
`MoreExecutors.sameThreadExecutor()` so that cache maintenance is executed on
the main execution thread when `cleanUp` is called. This was done for the unit
tests, but was not done for the integration tests.
The Caffeine Wiki mentions that when notification should be performed
synchronously, which logically works for this use case, to use a
[CacheWriter](https://github.com/ben-manes/caffeine/wiki/Writer) instead.
> Removal listener operations are executed asynchronously using an
Executor. The default executor is ForkJoinPool.commonPool() and can be
overridden via Caffeine.executor(Executor). When the operation must be
performed synchronously with the removal, use
[CacheWriter](https://github.com/ben-manes/caffeine/wiki/Writer) instead.
This does not negatively impact production performance because the
`ActiveCacheWriter` only does work on a delete which occurs only rarely when a
profile stops receiving messages, not on a write. In addition, these caches
are very read-heavy as in most cases the cache is only written to when a new
profile is defined or on topology start-up.
## Pull Request Checklist
- [ ] Is there a JIRA ticket associated with this PR? If not one needs to
be created at [Metron
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [ ] Does your PR title start with METRON-XXXX where XXXX is the JIRA
number you are trying to resolve? Pay particular attention to the hyphen "-"
character.
- [ ] Has your PR been rebased against the latest commit within the target
branch (typically master)?
- [ ] Have you included steps to reproduce the behavior or problem that is
being changed or addressed?
- [ ] Have you included steps or a guide to how the change may be verified
and tested manually?
- [ ] Have you ensured that the full suite of tests and checks have been
executed in the root metron folder via:
```
mvn -q clean integration-test install &&
dev-utilities/build-utils/verify_licenses.sh
```
- [ ] Have you written or updated unit tests and or integration tests to
verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] Have you verified the basic functionality of the build by building
and running locally with Vagrant full-dev environment or the equivalent?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nickwallen/metron METRON-1810
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/metron/pull/1289.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1289
----
commit db902ea6db116cdb92516a68196270dd61f2b566
Author: Nick Allen <nick@...>
Date: 2018-12-01T13:39:43Z
WIP.. Still getting reliable failures when run in IDE
commit f68d89999310392c8354f64be34c3cae3bac0b86
Author: Nick Allen <nick@...>
Date: 2018-12-04T13:45:50Z
WIP
commit 1d13682b14df6d7d389d633f07a2f3b4d156b3ef
Author: Nick Allen <nick@...>
Date: 2018-12-04T16:40:42Z
Resolve classpath issues
commit 67d945781d12a952765ccba5272d01e30c451599
Author: Nick Allen <nick@...>
Date: 2018-12-04T17:09:03Z
Undo change to GetProfileT
est
commit 21ab3384cf73d005118b3ae04a4998f7718b7bd3
Author: Nick Allen <nick@...>
Date: 2018-12-04T19:35:46Z
Enhanced logging of caches
commit 3728a55287ce7cee9242399da9cec3236c0e522d
Author: Nick Allen <nick@...>
Date: 2018-12-04T19:39:02Z
Making sure useful information is available if failure occurs
commit 67574b8a69acf21a46b6caae51e495f02a55f53b
Author: Nick Allen <nick@...>
Date: 2018-12-04T19:42:01Z
Restore original logger props
commit d4e144a2cd55f785776e03f424560b60fa98eaa7
Author: Nick Allen <nick@...>
Date: 2018-12-04T20:05:56Z
Refactored testProcessingTimeWithTimeToLiveFlush to use assertEventually
commit 44df64df63f3aa26b688f196f8fe7576a8ccd9bc
Author: Nick Allen <nick@...>
Date: 2018-12-05T18:36:22Z
Using Caffeine's CacheWriter instead of a RemovalListener for profile
expiration
----
---