[ 
https://issues.apache.org/jira/browse/HADOOP-18851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755940#comment-17755940
 ] 

Vikas Kumar commented on HADOOP-18851:
--------------------------------------

Hello,

I have made the changes and uploading the patch file here for initial review 
and feedback. I parallel, I am trying to do more testing.

Requesting experts to please review the change. I have created branch from 
"trunk".

Changes:

1. made {*}AbstractDelegationTokenSecretManager.createPassword() 
non-synchronized{*}. There is one places where it requires synchronization,  
incrementDelegationTokenSeqNum() method. This method is still synchronized. But 
in case of HA, 

ZKDelegationTokenSecretManagerImpl.incrementDelegationTokenSeqNum() will be 
invoked and this doesn't needs to be synchronized on "this" object's monitor. 
As part of fix for HADOOP-16828, this method returns the pre-loaded seq number 
and if it reaches the limit, it again reserves the range. This part requires to 
be protected from being executed by concurrent threads, and here I used 
ReentrantLock to guard this instead of this.

2. *AbstractDelegationTokenSecretManager.checkToken()* :  removed the assertion 
for this object's monitor lock. As now all the DS has been made thread-safe. 
Read and update operations to the maps should be the issue now.

3. {*}Made AbstractDelegationTokenSecretManager.retrievePassword() 
non-synchronized{*}. Here getTokenInfo() was expected to be synchronized but 
since currentTokens map has been changed to thread-safe map , so this method is 
thared-safe. And for overriding class ZKDelegationTokenSecretManagerImpl, it 
reads from Zk that is also thread-safe.

4. {*}AbstractDelegationTokenSecretManager.{*}verifyToken(), this method also 
reads from the token map that is thread-safe. Or in case of HA, reading from ZK 
using CuratorFramework is thread safe.

5. ZKDelegationTokenSecretManager.incrementDelegationTokenSeqNum(): protected 
this method using RentrantLock to avoid being accessed concurrently from 
multiple threads.

 

Here intention is to make verifyToken and createPassword ( and few other 
methods as well) to run concurrently without blocking each other  much . In 
current implementation, even multiple verifyToken() method (that only reads ) 
blocks each other due to "synchronized" nature of the method.

Please let me know if I have missed anything, I would be happy to learn more. 
Please find below the patch:

[^0001-HADOOP-18851-Perfm-improvement-for-ZKDT-management.patch]

> AbstractDelegationTokenSecretManager- Performance improvement by optimising 
> the synchronization context
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18851
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18851
>             Project: Hadoop Common
>          Issue Type: Task
>            Reporter: Vikas Kumar
>            Assignee: Vikas Kumar
>            Priority: Major
>         Attachments: 
> 0001-HADOOP-18851-Perfm-improvement-for-ZKDT-management.patch, Screenshot 
> 2023-08-16 at 5.36.57 PM.png
>
>
> *Context:*
> KMS depends on hadoop-common for DT management. Recently we were analysing 
> one performance issue and following is out findings:
>  # Around 96% (196 out of 200) KMS container threads were in BLOCKED state at 
> following:
>  ## *AbstractDelegationTokenSecretManager.verifyToken()*
>  ## *AbstractDelegationTokenSecretManager.createPassword()* 
>  # And then process crashed.
>  
> {code:java}
> http-nio-9292-exec-200PRIORITY : 5THREAD ID : 0X00007F075C157800NATIVE ID : 
> 0X2C87FNATIVE ID (DECIMAL) : 182399STATE : BLOCKED
> stackTrace:
> java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.verifyToken(AbstractDelegationTokenSecretManager.java:474)
> - waiting to lock <0x00000005f2f545e8> (a 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenManager$ZKSecretManager)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenManager.verifyToken(DelegationTokenManager.java:213)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.authenticate(DelegationTokenAuthenticationHandler.java:396)
> at  {code}
> All the 199 out of 200 were blocked at above point.
> And the lock they are waiting for is acquired by a thread that was trying to 
> createPassword and publishing the same on ZK.
>  
> {code:java}
> stackTrace:
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1598)
> - locked <0x0000000749263ec0> (a org.apache.zookeeper.ClientCnxn$Packet)
> at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1570)
> at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2235)
> at 
> org.apache.curator.framework.imps.SetDataBuilderImpl$7.call(SetDataBuilderImpl.java:398)
> at 
> org.apache.curator.framework.imps.SetDataBuilderImpl$7.call(SetDataBuilderImpl.java:385)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
> at 
> org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:382)
> at 
> org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:358)
> at 
> org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:36)
> at 
> org.apache.curator.framework.recipes.shared.SharedValue.trySetValue(SharedValue.java:201)
> at 
> org.apache.curator.framework.recipes.shared.SharedCount.trySetCount(SharedCount.java:116)
> at 
> org.apache.hadoop.security.token.delegation.ZKDelegationTokenSecretManager.incrSharedCount(ZKDelegationTokenSecretManager.java:586)
> at 
> org.apache.hadoop.security.token.delegation.ZKDelegationTokenSecretManager.incrementDelegationTokenSeqNum(ZKDelegationTokenSecretManager.java:601)
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.createPassword(AbstractDelegationTokenSecretManager.java:402)
> - locked <0x00000005f2f545e8> (a 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenManager$ZKSecretManager)
> at 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.createPassword(AbstractDelegationTokenSecretManager.java:48)
> at org.apache.hadoop.security.token.Token.<init>(Token.java:67)
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenManager.createToken(DelegationTokenManager.java:183)
>  {code}
> We can say that this thread is slow and has blocked remaining all. But 
> following is my observation:
>  
>  # verifyToken() and createPaswword() has been synchronized because one is 
> reading the tokenMap and another is updating the map. If it's only to protect 
> the map, then can't we simply use ConcurrentHashMap and remove the 
> "synchronized" keyword. Because due to this, all reader threads ( to 
> verifyToken()) are also blocking each other.
>  # IN HA env, It is recommended to use ZK to store DTs. We know that 
> CuratorFramework is thread safe.  
> ZKDelegationTokenSecretManager.incrementDelegationTokenSeqNum() only requires 
> to be protected from concurrent execution and it should be protected using 
> some other locks instead of "this". 
>  # With these changes, verifyToken() and createPaswword() will not block each 
> other. It will be blocked only at the time of updating the map.
>  # Similarly other methods can also be considered but these two are critical.
> I made these changes on my local and got the significant performance 
> improvement. 
> I request community to provide their input and if we agree, I can provide the 
> patch. Please let me know if any other details are required.
> Thanks. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to