Sammi Chen created HDDS-11587:
---------------------------------

             Summary: Ozone Manager not processing file put requests while 
enabling multi-tenancy
                 Key: HDDS-11587
                 URL: https://issues.apache.org/jira/browse/HDDS-11587
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Sammi Chen
            Assignee: Sammi Chen


After multi-tenancy is enabled, OM state machine is stuck in Kerberos 
authentication failure retry,  here is the stack 
{noformat}
"OM StateMachine ApplyTransaction Thread - 0" #200 daemon prio=5 os_prio=0 
cpu=448066.02ms elapsed=407128.77s tid=0x00007f9c11d94000 nid=0x1196a waiting 
on condition  [0x00007f9bcf818000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep([email protected]/Native Method)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:131)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:108)
        - locked <0x0000000712b80788> (a 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
        at com.sun.proxy.$Proxy33.submitRequest(Unknown Source)
        at 
org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.submitRpcRequest(StorageContainerLocationProtocolClientSideTranslatorPB.java:185)
        at 
org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.submitRequest(StorageContainerLocationProtocolClientSideTranslatorPB.java:175)
        at 
org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.getContainerWithPipelineBatch(StorageContainerLocationProtocolClientSideTranslatorPB.java:308)
        at org.apache.hadoop.ozone.om.ScmClient$1.loadAll(ScmClient.java:89)
        at com.google.common.cache.LocalCache.loadAll(LocalCache.java:4118)
        at com.google.common.cache.LocalCache.getAll(LocalCache.java:4081)
        at 
com.google.common.cache.LocalCache$LocalLoadingCache.getAll(LocalCache.java:5025)
        at 
org.apache.hadoop.ozone.om.ScmClient.getContainerLocations(ScmClient.java:114)
        at 
org.apache.hadoop.ozone.om.KeyManagerImpl.refreshPipelineFromCache(KeyManagerImpl.java:1964)
        at 
org.apache.hadoop.ozone.om.KeyManagerImpl.sortPipelineInfo(KeyManagerImpl.java:1692)
        at 
org.apache.hadoop.ozone.om.KeyManagerImpl.buildFinalStatusList(KeyManagerImpl.java:1676)
        at 
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1495)
        at 
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1456)
        at 
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1449)
        at 
org.apache.hadoop.ozone.om.OzonePrefixPathImpl$PathIterator.getNextListOfKeys(OzonePrefixPathImpl.java:163)
        at 
org.apache.hadoop.ozone.om.OzonePrefixPathImpl$PathIterator.<init>(OzonePrefixPathImpl.java:107)
        at 
org.apache.hadoop.ozone.om.OzonePrefixPathImpl.getChildren(OzonePrefixPathImpl.java:91)
        at 
org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.isAccessAllowedForSubPaths(RangerOzoneAuthorizer.java:399)
        at 
org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.checkAccess(RangerOzoneAuthorizer.java:201)
        at 
org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.checkAccess(RangerOzoneAuthorizer.java:91)
        at 
org.apache.hadoop.ozone.om.OmMetadataReader.lambda$8(OmMetadataReader.java:509)
        at 
org.apache.hadoop.ozone.om.OmMetadataReader$$Lambda$825/0x0000000840a62c40.get(Unknown
 Source)
        at 
org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:61)
        at 
org.apache.hadoop.ozone.om.OmMetadataReader.checkAcls(OmMetadataReader.java:508)
        at 
org.apache.hadoop.ozone.om.request.OMClientRequest.checkACLsWithFSO(OMClientRequest.java:283)
        at 
org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:102)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:375)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:568)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:359)
        at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine$$Lambda$816/0x0000000840a5e440.get(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run([email protected]/CompletableFuture.java:1700)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
        at java.lang.Thread.run([email protected]/Thread.java:829) 
{noformat}

This is the OM logs

{noformat}
2024-10-09 07:24:10,752 WARN [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.ipc.Client: Exception encountered while connecting to the 
server : javax.security.sasl.SaslException: GSS initiate failed [Caused by 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)]
2024-10-09 07:24:10,753 INFO [OM StateMachine ApplyTransaction Thread - 
0]-org.apache.hadoop.io.retry.RetryInvocationHandler: 
com.google.protobuf.ServiceException: java.io.IOException: DestHost:destPort 
node3.ozone-test-sathishkumar.coelab.*.com:9860 , LocalHost:localPort 
node2.ozone-test-sathishkumar.coelab.*.com/10.129.116.49:0. Failed on local 
exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate 
failed [Caused by GSSException: No valid credentials provided (Mechanism level: 
Failed to find any Kerberos tgt)], while invoking $Proxy33.submitRequest over 
nodeId=node1,nodeAddress=node3.ozone-test-sathishkumar.coelab.*.com/10.129.116.126:9860
 after 79 failover attempts. Trying to failover after sleeping for 2000ms.
{noformat}

The root cause is when multi-tenancy is enabled, it will create a 
RangerClientMultiTenantAccessController instance, which in turn creates a 
RangerClient. In the RangerClient implementation, it will relogin again with OM 
kerberos principle, get a UgiB and set the static loginUserRef field of 
UserGroupInformation to this new UgiB, replacing the first UgiA created when 
OzoneManager is first started.   The first UgiA is passed into all OM RPC 
servers to communicate with remote peer. 

This is Client.java of hadoop common module. Once the kerberos authentication 
fails, Client side will try to relogin with keberos keytab if this 
shouldAuthenticateOverKrb returns true.  In the current case, it returns false 
for the loginUser(UgiB). doesn't equal to currentUser(UgiA). So once UgiA is 
expired, a new UgiA doesn't get created through relogin.  
{code:java}
    private synchronized boolean shouldAuthenticateOverKrb() throws IOException 
{
      UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
      UserGroupInformation currentUser = UserGroupInformation.getCurrentUser();
      UserGroupInformation realUser = currentUser.getRealUser();
      if (authMethod == AuthMethod.KERBEROS && loginUser != null &&
      // Make sure user logged in using Kerberos either keytab or TGT
          loginUser.hasKerberosCredentials() &&
          // relogin only in case it is the login user (e.g. JT)
          // or superuser (like oozie).
          (loginUser.equals(currentUser) || loginUser.equals(realUser))) {
        return true;
      }
      return false;
    }
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to