adoroszlai opened a new pull request, #4672:
URL: https://github.com/apache/ozone/pull/4672
## What changes were proposed in this pull request?
Surefire fork intermittently times out in `TestDecommissionAndMaintenance`.
Container DB is added to the cache:
```
2023-05-03 08:18:26,909 [EndpointStateMachine task thread for /0.0.0.0:43723
- 0 ] INFO utils.DatanodeStoreCache (DatanodeStoreCache.java:addDB(58)) -
Added db
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ff176d5b-bea5-4cbe-a997-8236a6853a89/datanode-0/data-0/containers/hdds/ff176d5b-bea5-4cbe-a997-8236a6853a89/DS-4328e108-8c1a-4a6f-8bff-6f686dd50b24/container.db
to cache
```
but then not found and tried to open again, which fails since RocksDB is
already open and protected by OS-level lock (`lock hold by current process`):
```
2023-05-03 08:18:57,086 [Command processor thread] ERROR
utils.DatanodeStoreCache (DatanodeStoreCache.java:getDB(74)) - Failed to get DB
store
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ff176d5b-bea5-4cbe-a997-8236a6853a89/datanode-0/data-0/containers/hdds/ff176d5b-bea5-4cbe-a997-8236a6853a89/DS-4328e108-8c1a-4a6f-8bff-6f686dd50b24/container.db
java.io.IOException: Failed init RocksDB, db path :
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ff176d5b-bea5-4cbe-a997-8236a6853a89/datanode-0/data-0/containers/hdds/ff176d5b-bea5-4cbe-a997-8236a6853a89/DS-4328e108-8c1a-4a6f-8bff-6f686dd50b24/container.db,
exception :org.rocksdb.RocksDBException lock hold by current process, acquire
time 1683101936 acquiring thread 139985634854656:
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ff176d5b-bea5-4cbe-a997-8236a6853a89/datanode-0/data-0/containers/hdds/ff176d5b-bea5-4cbe-a997-8236a6853a89/DS-4328e108-8c1a-4a6f-8bff-6f686dd50b24/container.db/LOCK:
No locks available
at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:182)
at
org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:212)
at
org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.start(AbstractDatanodeStore.java:147)
at
org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(AbstractDatanodeStore.java:99)
at
org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaThreeImpl.<init>(DatanodeStoreSchemaThreeImpl.java:66)
at
org.apache.hadoop.ozone.container.common.utils.DatanodeStoreCache.getDB(DatanodeStoreCache.java:69)
at
org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getDB(BlockUtils.java:132)
```
The problem is that `DatanodeStoreCache` is a singleton, shared between
datanodes in integration tests using `MiniOzoneCluster`. Stopping a datanode
clears the cache, affecting all other datanodes.
This change adds a "mini cluster mode" flag in `DatanodeStoreCache`, which
prevents clearing the cache when set to true. Individual items are still
removed (`removeDB` called by `DbVolume.shutdown` and `HddsVolume.shutdown`).
https://issues.apache.org/jira/browse/HDDS-8539
## How was this patch tested?
No fork timeout in [100 runs of
`TestDecommissionAndMaintenance`](https://github.com/adoroszlai/hadoop-ozone/actions/runs/4897439875)
(but there were plenty of other failures, being fixed in other tasks).
CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/4900898212
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]