GitHub user StefanRRichter opened a pull request:
https://github.com/apache/flink/pull/4855
[FLINK-5372] [tests] Fix RocksDBAsyncSnapshotTest.testCancelFullyAsynâ¦
â¦cCheckpoints()
## What is the purpose of the change
This PR fixes `RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()`,
so that the test can be reactivated.
Furthermore, it also fixes some problems in the same test class related to
unreleased native RocksDB resources (missing calls to
`RocksDBKeyedStateBackend::dispose()`). I did a search for similar problems by
placing log statements for instance creation and dispose calls in the log and
ensured that there are always matching pairs. I did this for the rocksdb tests
and flink-tests - found and fixed one more case in
`RocksDBStateBackendConfigTest`.
## Brief change log
First problem is, that the test uses a checkpoint stream factory that
produces streams that block on latches, then wants to call close() while the
stream is supposed to be blocked in an async thread executing and writing the
rocksdb snapshot. While the test fails is, that the first stream is actually
used to take s snapshot of the timer service into raw keyed state. The timer
service is not async and therefore the test blocks and cannot reach close(). I
will fix the test through a different factory, that gives a blocking stream on
the second call, i.e. for the request that is actually from the keyed backend.
For the second problem, this was a bit nasty to reproduce but simple in
cause and solution. The test testCleanupOfSnapshotsInFailureCase created an
instance of RocksDBKeyedStateBackend, but never called the dispose method to
release native resources. Since the latest RocksDB update, their code has
assertions in place that are executed in finalize() and will detect leaking
native resources. This happens only when the resource is GC'ed, and this
explains why you could only observe the test crash when it was executed with
more tests, eventually reaching a GC.
*Changes*
- *`RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()` used a
checkpoint stream factory that produces streams that block on latches, then
wants to call `close()` while the stream is supposed to be blocked in an async
thread executing and writing the rocksdb snapshot. The test fails because the
first created stream is actually used to take a snapshot of the timer service
into raw keyed state. The timer service is not async and therefore the test
blocks and cannot reach `close()`. The fix introduces a different factory that
gives a blocking stream on the second call, i.e. for the request that is
actually from the keyed backend.*
- *Introduces some missing calls to `RocksDBKeyedStateBackend::dispose()`*
## Verifying this change
`RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()` is reactivated
and works.
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (no)
- The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
## Documentation
- Does this pull request introduce a new feature? (no)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StefanRRichter/flink rocksdb-async-test
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/4855.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4855
----
commit 15a4c7413f002d1ae336c076f1d82ac896f8c9f4
Author: Stefan Richter <[email protected]>
Date: 2017-10-19T05:10:21Z
[FLINK-5372] [tests] Fix
RocksDBAsyncSnapshotTest.testCancelFullyAsyncCheckpoints()
----
---