[
https://issues.apache.org/jira/browse/HDDS-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Attila Doroszlai updated HDDS-11587:
------------------------------------
Status: Patch Available (was: Open)
> Ozone Manager not processing file put requests while enabling multi-tenancy
> ---------------------------------------------------------------------------
>
> Key: HDDS-11587
> URL: https://issues.apache.org/jira/browse/HDDS-11587
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Sammi Chen
> Assignee: Sammi Chen
> Priority: Major
> Labels: pull-request-available
>
> After multi-tenancy is enabled, OM state machine is stuck in Kerberos
> authentication failure retry, here is the stack
> {noformat}
> "OM StateMachine ApplyTransaction Thread - 0" #200 daemon prio=5 os_prio=0
> cpu=448066.02ms elapsed=407128.77s tid=0x00007f9c11d94000 nid=0x1196a waiting
> on condition [0x00007f9bcf818000]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep([email protected]/Native Method)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:131)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:108)
> - locked <0x0000000712b80788> (a
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
> at com.sun.proxy.$Proxy33.submitRequest(Unknown Source)
> at
> org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.submitRpcRequest(StorageContainerLocationProtocolClientSideTranslatorPB.java:185)
> at
> org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.submitRequest(StorageContainerLocationProtocolClientSideTranslatorPB.java:175)
> at
> org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.getContainerWithPipelineBatch(StorageContainerLocationProtocolClientSideTranslatorPB.java:308)
> at org.apache.hadoop.ozone.om.ScmClient$1.loadAll(ScmClient.java:89)
> at com.google.common.cache.LocalCache.loadAll(LocalCache.java:4118)
> at com.google.common.cache.LocalCache.getAll(LocalCache.java:4081)
> at
> com.google.common.cache.LocalCache$LocalLoadingCache.getAll(LocalCache.java:5025)
> at
> org.apache.hadoop.ozone.om.ScmClient.getContainerLocations(ScmClient.java:114)
> at
> org.apache.hadoop.ozone.om.KeyManagerImpl.refreshPipelineFromCache(KeyManagerImpl.java:1964)
> at
> org.apache.hadoop.ozone.om.KeyManagerImpl.sortPipelineInfo(KeyManagerImpl.java:1692)
> at
> org.apache.hadoop.ozone.om.KeyManagerImpl.buildFinalStatusList(KeyManagerImpl.java:1676)
> at
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1495)
> at
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1456)
> at
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1449)
> at
> org.apache.hadoop.ozone.om.OzonePrefixPathImpl$PathIterator.getNextListOfKeys(OzonePrefixPathImpl.java:163)
> at
> org.apache.hadoop.ozone.om.OzonePrefixPathImpl$PathIterator.<init>(OzonePrefixPathImpl.java:107)
> at
> org.apache.hadoop.ozone.om.OzonePrefixPathImpl.getChildren(OzonePrefixPathImpl.java:91)
> at
> org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.isAccessAllowedForSubPaths(RangerOzoneAuthorizer.java:399)
> at
> org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.checkAccess(RangerOzoneAuthorizer.java:201)
> at
> org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.checkAccess(RangerOzoneAuthorizer.java:91)
> at
> org.apache.hadoop.ozone.om.OmMetadataReader.lambda$8(OmMetadataReader.java:509)
> at
> org.apache.hadoop.ozone.om.OmMetadataReader$$Lambda$825/0x0000000840a62c40.get(Unknown
> Source)
> at
> org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:61)
> at
> org.apache.hadoop.ozone.om.OmMetadataReader.checkAcls(OmMetadataReader.java:508)
> at
> org.apache.hadoop.ozone.om.request.OMClientRequest.checkACLsWithFSO(OMClientRequest.java:283)
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:102)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:375)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:568)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:359)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine$$Lambda$816/0x0000000840a5e440.get(Unknown
> Source)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run([email protected]/CompletableFuture.java:1700)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
> at java.lang.Thread.run([email protected]/Thread.java:829)
> {noformat}
> This is the OM logs
> {noformat}
> 2024-10-09 07:24:10,752 WARN [OM StateMachine ApplyTransaction Thread -
> 0]-org.apache.hadoop.ipc.Client: Exception encountered while connecting to
> the server : javax.security.sasl.SaslException: GSS initiate failed [Caused
> by GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
> 2024-10-09 07:24:10,753 INFO [OM StateMachine ApplyTransaction Thread -
> 0]-org.apache.hadoop.io.retry.RetryInvocationHandler:
> com.google.protobuf.ServiceException: java.io.IOException: DestHost:destPort
> node3.ozone-test-sathishkumar.coelab.*.com:9860 , LocalHost:localPort
> node2.ozone-test-sathishkumar.coelab.*.com/10.129.116.49:0. Failed on local
> exception: java.io.IOException: javax.security.sasl.SaslException: GSS
> initiate failed [Caused by GSSException: No valid credentials provided
> (Mechanism level: Failed to find any Kerberos tgt)], while invoking
> $Proxy33.submitRequest over
> nodeId=node1,nodeAddress=node3.ozone-test-sathishkumar.coelab.*.com/10.129.116.126:9860
> after 79 failover attempts. Trying to failover after sleeping for 2000ms.
> {noformat}
> The root cause is when multi-tenancy is enabled, it will create a
> RangerClientMultiTenantAccessController instance, which in turn creates a
> RangerClient. In the RangerClient implementation, it will relogin again with
> OM kerberos principle, get a UgiB and set the static loginUserRef field of
> UserGroupInformation to this new UgiB, replacing the first UgiA created when
> OzoneManager is first started. The first UgiA is passed into all OM RPC
> servers to communicate with remote peer.
> This is Client.java of hadoop common module. Once the kerberos authentication
> fails, Client side will try to relogin with keberos keytab if this
> shouldAuthenticateOverKrb returns true. In the current case, it returns
> false for the loginUser(UgiB). doesn't equal to currentUser(UgiA). So once
> UgiA is expired, a new UgiA doesn't get created through relogin.
> {code:java}
> private synchronized boolean shouldAuthenticateOverKrb() throws
> IOException {
> UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
> UserGroupInformation currentUser =
> UserGroupInformation.getCurrentUser();
> UserGroupInformation realUser = currentUser.getRealUser();
> if (authMethod == AuthMethod.KERBEROS && loginUser != null &&
> // Make sure user logged in using Kerberos either keytab or TGT
> loginUser.hasKerberosCredentials() &&
> // relogin only in case it is the login user (e.g. JT)
> // or superuser (like oozie).
> (loginUser.equals(currentUser) || loginUser.equals(realUser))) {
> return true;
> }
> return false;
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]