[ 
https://issues.apache.org/jira/browse/HDDS-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai updated HDDS-11587:
------------------------------------
    Status: Patch Available  (was: Open)

> Ozone Manager not processing file put requests while enabling multi-tenancy
> ---------------------------------------------------------------------------
>
>                 Key: HDDS-11587
>                 URL: https://issues.apache.org/jira/browse/HDDS-11587
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Sammi Chen
>            Assignee: Sammi Chen
>            Priority: Major
>              Labels: pull-request-available
>
> After multi-tenancy is enabled, OM state machine is stuck in Kerberos 
> authentication failure retry,  here is the stack 
> {noformat}
> "OM StateMachine ApplyTransaction Thread - 0" #200 daemon prio=5 os_prio=0 
> cpu=448066.02ms elapsed=407128.77s tid=0x00007f9c11d94000 nid=0x1196a waiting 
> on condition  [0x00007f9bcf818000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep([email protected]/Native Method)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:131)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:108)
>         - locked <0x0000000712b80788> (a 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>         at com.sun.proxy.$Proxy33.submitRequest(Unknown Source)
>         at 
> org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.submitRpcRequest(StorageContainerLocationProtocolClientSideTranslatorPB.java:185)
>         at 
> org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.submitRequest(StorageContainerLocationProtocolClientSideTranslatorPB.java:175)
>         at 
> org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.getContainerWithPipelineBatch(StorageContainerLocationProtocolClientSideTranslatorPB.java:308)
>         at org.apache.hadoop.ozone.om.ScmClient$1.loadAll(ScmClient.java:89)
>         at com.google.common.cache.LocalCache.loadAll(LocalCache.java:4118)
>         at com.google.common.cache.LocalCache.getAll(LocalCache.java:4081)
>         at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getAll(LocalCache.java:5025)
>         at 
> org.apache.hadoop.ozone.om.ScmClient.getContainerLocations(ScmClient.java:114)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.refreshPipelineFromCache(KeyManagerImpl.java:1964)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.sortPipelineInfo(KeyManagerImpl.java:1692)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.buildFinalStatusList(KeyManagerImpl.java:1676)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1495)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1456)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1449)
>         at 
> org.apache.hadoop.ozone.om.OzonePrefixPathImpl$PathIterator.getNextListOfKeys(OzonePrefixPathImpl.java:163)
>         at 
> org.apache.hadoop.ozone.om.OzonePrefixPathImpl$PathIterator.<init>(OzonePrefixPathImpl.java:107)
>         at 
> org.apache.hadoop.ozone.om.OzonePrefixPathImpl.getChildren(OzonePrefixPathImpl.java:91)
>         at 
> org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.isAccessAllowedForSubPaths(RangerOzoneAuthorizer.java:399)
>         at 
> org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.checkAccess(RangerOzoneAuthorizer.java:201)
>         at 
> org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.checkAccess(RangerOzoneAuthorizer.java:91)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataReader.lambda$8(OmMetadataReader.java:509)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataReader$$Lambda$825/0x0000000840a62c40.get(Unknown
>  Source)
>         at 
> org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:61)
>         at 
> org.apache.hadoop.ozone.om.OmMetadataReader.checkAcls(OmMetadataReader.java:508)
>         at 
> org.apache.hadoop.ozone.om.request.OMClientRequest.checkACLsWithFSO(OMClientRequest.java:283)
>         at 
> org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:102)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:375)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:568)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:359)
>         at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine$$Lambda$816/0x0000000840a5e440.get(Unknown
>  Source)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run([email protected]/CompletableFuture.java:1700)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run([email protected]/Thread.java:829) 
> {noformat}
> This is the OM logs
> {noformat}
> 2024-10-09 07:24:10,752 WARN [OM StateMachine ApplyTransaction Thread - 
> 0]-org.apache.hadoop.ipc.Client: Exception encountered while connecting to 
> the server : javax.security.sasl.SaslException: GSS initiate failed [Caused 
> by GSSException: No valid credentials provided (Mechanism level: Failed to 
> find any Kerberos tgt)]
> 2024-10-09 07:24:10,753 INFO [OM StateMachine ApplyTransaction Thread - 
> 0]-org.apache.hadoop.io.retry.RetryInvocationHandler: 
> com.google.protobuf.ServiceException: java.io.IOException: DestHost:destPort 
> node3.ozone-test-sathishkumar.coelab.*.com:9860 , LocalHost:localPort 
> node2.ozone-test-sathishkumar.coelab.*.com/10.129.116.49:0. Failed on local 
> exception: java.io.IOException: javax.security.sasl.SaslException: GSS 
> initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Failed to find any Kerberos tgt)], while invoking 
> $Proxy33.submitRequest over 
> nodeId=node1,nodeAddress=node3.ozone-test-sathishkumar.coelab.*.com/10.129.116.126:9860
>  after 79 failover attempts. Trying to failover after sleeping for 2000ms.
> {noformat}
> The root cause is when multi-tenancy is enabled, it will create a 
> RangerClientMultiTenantAccessController instance, which in turn creates a 
> RangerClient. In the RangerClient implementation, it will relogin again with 
> OM kerberos principle, get a UgiB and set the static loginUserRef field of 
> UserGroupInformation to this new UgiB, replacing the first UgiA created when 
> OzoneManager is first started.   The first UgiA is passed into all OM RPC 
> servers to communicate with remote peer. 
> This is Client.java of hadoop common module. Once the kerberos authentication 
> fails, Client side will try to relogin with keberos keytab if this 
> shouldAuthenticateOverKrb returns true.  In the current case, it returns 
> false for the loginUser(UgiB). doesn't equal to currentUser(UgiA). So once 
> UgiA is expired, a new UgiA doesn't get created through relogin.  
> {code:java}
>     private synchronized boolean shouldAuthenticateOverKrb() throws 
> IOException {
>       UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
>       UserGroupInformation currentUser = 
> UserGroupInformation.getCurrentUser();
>       UserGroupInformation realUser = currentUser.getRealUser();
>       if (authMethod == AuthMethod.KERBEROS && loginUser != null &&
>       // Make sure user logged in using Kerberos either keytab or TGT
>           loginUser.hasKerberosCredentials() &&
>           // relogin only in case it is the login user (e.g. JT)
>           // or superuser (like oozie).
>           (loginUser.equals(currentUser) || loginUser.equals(realUser))) {
>         return true;
>       }
>       return false;
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to