adoroszlai opened a new pull request, #4672:
URL: https://github.com/apache/ozone/pull/4672

   ## What changes were proposed in this pull request?
   
   Surefire fork intermittently times out in `TestDecommissionAndMaintenance`.
   
   Container DB is added to the cache:
   
   ```
   2023-05-03 08:18:26,909 [EndpointStateMachine task thread for /0.0.0.0:43723 
- 0 ] INFO  utils.DatanodeStoreCache (DatanodeStoreCache.java:addDB(58)) - 
Added db 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ff176d5b-bea5-4cbe-a997-8236a6853a89/datanode-0/data-0/containers/hdds/ff176d5b-bea5-4cbe-a997-8236a6853a89/DS-4328e108-8c1a-4a6f-8bff-6f686dd50b24/container.db
 to cache
   ```
   
   but then not found and tried to open again, which fails since RocksDB is 
already open and protected by OS-level lock (`lock hold by current process`):
   
   ```
   2023-05-03 08:18:57,086 [Command processor thread] ERROR 
utils.DatanodeStoreCache (DatanodeStoreCache.java:getDB(74)) - Failed to get DB 
store 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ff176d5b-bea5-4cbe-a997-8236a6853a89/datanode-0/data-0/containers/hdds/ff176d5b-bea5-4cbe-a997-8236a6853a89/DS-4328e108-8c1a-4a6f-8bff-6f686dd50b24/container.db
   java.io.IOException: Failed init RocksDB, db path : 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ff176d5b-bea5-4cbe-a997-8236a6853a89/datanode-0/data-0/containers/hdds/ff176d5b-bea5-4cbe-a997-8236a6853a89/DS-4328e108-8c1a-4a6f-8bff-6f686dd50b24/container.db,
 exception :org.rocksdb.RocksDBException lock hold by current process, acquire 
time 1683101936 acquiring thread 139985634854656: 
/home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-ff176d5b-bea5-4cbe-a997-8236a6853a89/datanode-0/data-0/containers/hdds/ff176d5b-bea5-4cbe-a997-8236a6853a89/DS-4328e108-8c1a-4a6f-8bff-6f686dd50b24/container.db/LOCK:
 No locks available
        at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:182)
        at 
org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:212)
        at 
org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.start(AbstractDatanodeStore.java:147)
        at 
org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(AbstractDatanodeStore.java:99)
        at 
org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaThreeImpl.<init>(DatanodeStoreSchemaThreeImpl.java:66)
        at 
org.apache.hadoop.ozone.container.common.utils.DatanodeStoreCache.getDB(DatanodeStoreCache.java:69)
        at 
org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getDB(BlockUtils.java:132)
   ```
   
   The problem is that `DatanodeStoreCache` is a singleton, shared between 
datanodes in integration tests using `MiniOzoneCluster`.  Stopping a datanode 
clears the cache, affecting all other datanodes.
   
   This change adds a "mini cluster mode" flag in `DatanodeStoreCache`, which 
prevents clearing the cache when set to true.  Individual items are still 
removed (`removeDB` called by `DbVolume.shutdown` and `HddsVolume.shutdown`).
   
   https://issues.apache.org/jira/browse/HDDS-8539
   
   ## How was this patch tested?
   
   No fork timeout in [100 runs of 
`TestDecommissionAndMaintenance`](https://github.com/adoroszlai/hadoop-ozone/actions/runs/4897439875)
 (but there were plenty of other failures, being fixed in other tasks).
   
   CI:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/4900898212


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to