ivandika3 opened a new pull request, #955: URL: https://github.com/apache/ratis/pull/955
## What changes were proposed in this pull request? ### Problem When benchmarking Ozone streaming pipeline using `ozone freon ockg`, the benchmark would not end although all the keys have been written. `hdds.ratis.raft.netty.dataStream.client.worker-group.share` is set to true to use the shared worker group optimization. Using Arthas, it's found that non-daemon threads like "NettyClientStreamRpc-workerGroup–thread1" are still running even after the benchmark has finished. The root cause is that the shared worker group will never be closed, causing the JVM shutdown hook to never be triggered. The benchmark was able to shutdown normally if the share configuration is disabled. ### Background It seems that shared worker group is a lazily instantiated singleton. The shared worker group will be instantiated when the first `WorkerGroupGetter` is instantiated and passed to a new `Connection`during construction. Due to the nature of singleton, this worker group will be shared across all the Raft clients under the same Ozone client (please correct me if I'm wrong). In the case of Ozone streaming write pipeline, every `BlockDatastreamOutput` (whose scope is a single block) will create a new `RaftClient` which corresponds to a single `NettyClientStreamRpc` instance. These `RaftClient`s will share the shared worker group. ### Solution The current solution uses a "modified" reference counted `EventLoopGroup` using the `ReferenceCountedObject` interface. Previously, I was trying to use the implementation from `ReferenceCountedObject#wrap`. However, when the reference count is 1 and `release()` is invoked, it will completely release the object and throw exceptions for any further operations. This implementation does not seem to suit the use case. The "modified" reference count instantiates the shared worker group whenever it's retained and the previous reference count is 0. It will also gracefully shutdown the worker group when it's released and the reference count becomes 0. it's retained whenever a new `WorkerGroupGetter` is instantiated (i.e when connection is created), and released when the connection is closed. Technically, the worker group is not a singleton anymore, the shared worker group will be shared with all the connections, but will be removed and shutdown when all connections are removed, subsequent connections will use a new shared worker group. This should guarantee that there will be one worker group shared among the connections, but not necessarily the same instance. The current solution is still a rough solution for reference what the solution might be, any advise is greatly appreciated. Also enabled the `hdds.ratis.raft.netty.dataStream.client.worker-group.share` according to (https://github.com/szetszwo/ozone-benchmark/blob/master/benchmark-conf.xml). ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/RATIS-1921 ## How was this patch tested? Existing unit tests (enable worker group sharing by default). Also tested using `ozone freon ockg` again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
