ivandika3 opened a new pull request, #955:
URL: https://github.com/apache/ratis/pull/955

   ## What changes were proposed in this pull request?
   
   ### Problem
   
   When benchmarking Ozone streaming pipeline using `ozone freon ockg`, the 
benchmark would not end although all the keys have been written.  
`hdds.ratis.raft.netty.dataStream.client.worker-group.share` is set to true to 
use the shared worker group optimization.
   
   Using Arthas, it's found that non-daemon threads like 
"NettyClientStreamRpc-workerGroup–thread1" are still running even after the 
benchmark has finished. The root cause is that the shared worker group will 
never be closed, causing the JVM shutdown hook to never be triggered. The 
benchmark was able to shutdown normally if the share configuration is disabled.
   
   ### Background
   
   It seems that shared worker group is a lazily instantiated singleton. The 
shared worker group will be instantiated when the first `WorkerGroupGetter` is 
instantiated and passed to a new `Connection`during construction. Due to the 
nature of singleton, this worker group will be shared across all the Raft 
clients under the same Ozone client (please correct me if I'm wrong). 
   
   In the case of Ozone streaming write pipeline, every `BlockDatastreamOutput` 
(whose scope is a single block) will create a new `RaftClient` which 
corresponds to a single `NettyClientStreamRpc` instance. These `RaftClient`s 
will share the shared worker group.
   
   ### Solution
   
   The current solution uses a "modified" reference counted `EventLoopGroup` 
using the `ReferenceCountedObject` interface. Previously, I was trying to use 
the implementation from `ReferenceCountedObject#wrap`. However, when the 
reference count is 1 and `release()` is invoked, it will completely release the 
object and throw exceptions for any further operations. This implementation 
does not seem to suit the use case.
   
   The "modified" reference count instantiates the shared worker group whenever 
it's retained and the previous reference count is 0. It will also gracefully 
shutdown the worker group when it's released and the reference count becomes 0. 
 it's retained whenever a new `WorkerGroupGetter` is instantiated (i.e when 
connection is created), and released when the connection is closed. 
   
   Technically, the worker group is not a singleton anymore, the shared worker 
group will be shared with all the connections, but will be removed and shutdown 
when all connections are removed, subsequent connections will use a new shared 
worker group. This should guarantee that there will be one worker group shared 
among the connections, but not necessarily the same instance.
   
   The current solution is still a rough solution for reference what the 
solution might be,  any advise is greatly appreciated.
   
   Also enabled the 
`hdds.ratis.raft.netty.dataStream.client.worker-group.share` according to 
(https://github.com/szetszwo/ozone-benchmark/blob/master/benchmark-conf.xml).
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/RATIS-1921
   
   ## How was this patch tested?
   
   Existing unit tests (enable worker group sharing by default). Also tested 
using `ozone freon ockg` again.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to