fapifta opened a new pull request, #5122:
URL: https://github.com/apache/ozone/pull/5122

   ## What changes were proposed in this pull request?
   
   A detailed description of the problem can be found in the JIRA ticket.
   The TLDR:
   the deadlock happens because SCM startup calls the listCA method on the 
SCM's cert client, which is synchronized, but it connects to the leader SCM to 
get data, and waits until the leader comes out of safe mode because of how the 
SCM Security protocol server is implemented. Also it waits infinitely.
   In the meantime, as the SCM's security protocol server is already started, 
two other service filed a request already and the server uses the SCM's cert 
client to get the data requested by the other services, but the request 
processing can not be finished, as it can not get into the also synchronized 
method within the certificate client.
   
   The proposed solution is to separate the two locks.
   ListCA, and the related methods are used only from SCM code, and from 
container operation clients for recovery. As recovery is not initiated during 
safe mode, we can safely say that having a separate lock for the listCA method 
and all other methods that access the pemEncodedCACerts solves the problem, as 
with that we unblock other operations for the clients while the server side is 
working on to have the certificates properly persisted.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-9061
   
   ## How was this patch tested?
   I have no idea how we can write a stable reproduction case, as these threads 
are scheduled as they are being scheduled by the JVM, and we would need to run 
all things in a particular order with two other requests being in a specific 
stage of processing each at the right time and I think this one can not be 
achieved by any code hackery without adding more significant modifications to 
the production code in question, which I believe does not worth it. However if 
there is a cheap and easy way I am open to learn about the solution.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to