Sammi Chen created HDDS-11587:
---------------------------------
Summary: Ozone Manager not processing file put requests while
enabling multi-tenancy
Key: HDDS-11587
URL: https://issues.apache.org/jira/browse/HDDS-11587
Project: Apache Ozone
Issue Type: Bug
Reporter: Sammi Chen
Assignee: Sammi Chen
After multi-tenancy is enabled, OM state machine is stuck in Kerberos
authentication failure retry, here is the stack
{noformat}
"OM StateMachine ApplyTransaction Thread - 0" #200 daemon prio=5 os_prio=0
cpu=448066.02ms elapsed=407128.77s tid=0x00007f9c11d94000 nid=0x1196a waiting
on condition [0x00007f9bcf818000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep([email protected]/Native Method)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.processWaitTimeAndRetryInfo(RetryInvocationHandler.java:131)
at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:108)
- locked <0x0000000712b80788> (a
org.apache.hadoop.io.retry.RetryInvocationHandler$Call)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy33.submitRequest(Unknown Source)
at
org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.submitRpcRequest(StorageContainerLocationProtocolClientSideTranslatorPB.java:185)
at
org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.submitRequest(StorageContainerLocationProtocolClientSideTranslatorPB.java:175)
at
org.apache.hadoop.hdds.scm.protocolPB.StorageContainerLocationProtocolClientSideTranslatorPB.getContainerWithPipelineBatch(StorageContainerLocationProtocolClientSideTranslatorPB.java:308)
at org.apache.hadoop.ozone.om.ScmClient$1.loadAll(ScmClient.java:89)
at com.google.common.cache.LocalCache.loadAll(LocalCache.java:4118)
at com.google.common.cache.LocalCache.getAll(LocalCache.java:4081)
at
com.google.common.cache.LocalCache$LocalLoadingCache.getAll(LocalCache.java:5025)
at
org.apache.hadoop.ozone.om.ScmClient.getContainerLocations(ScmClient.java:114)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.refreshPipelineFromCache(KeyManagerImpl.java:1964)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.sortPipelineInfo(KeyManagerImpl.java:1692)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.buildFinalStatusList(KeyManagerImpl.java:1676)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1495)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1456)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:1449)
at
org.apache.hadoop.ozone.om.OzonePrefixPathImpl$PathIterator.getNextListOfKeys(OzonePrefixPathImpl.java:163)
at
org.apache.hadoop.ozone.om.OzonePrefixPathImpl$PathIterator.<init>(OzonePrefixPathImpl.java:107)
at
org.apache.hadoop.ozone.om.OzonePrefixPathImpl.getChildren(OzonePrefixPathImpl.java:91)
at
org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.isAccessAllowedForSubPaths(RangerOzoneAuthorizer.java:399)
at
org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.checkAccess(RangerOzoneAuthorizer.java:201)
at
org.apache.ranger.authorization.ozone.authorizer.RangerOzoneAuthorizer.checkAccess(RangerOzoneAuthorizer.java:91)
at
org.apache.hadoop.ozone.om.OmMetadataReader.lambda$8(OmMetadataReader.java:509)
at
org.apache.hadoop.ozone.om.OmMetadataReader$$Lambda$825/0x0000000840a62c40.get(Unknown
Source)
at
org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:61)
at
org.apache.hadoop.ozone.om.OmMetadataReader.checkAcls(OmMetadataReader.java:508)
at
org.apache.hadoop.ozone.om.request.OMClientRequest.checkACLsWithFSO(OMClientRequest.java:283)
at
org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:102)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:375)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:568)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:359)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine$$Lambda$816/0x0000000840a5e440.get(Unknown
Source)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run([email protected]/CompletableFuture.java:1700)
at
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:829)
{noformat}
This is the OM logs
{noformat}
2024-10-09 07:24:10,752 WARN [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.ipc.Client: Exception encountered while connecting to the
server : javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)]
2024-10-09 07:24:10,753 INFO [OM StateMachine ApplyTransaction Thread -
0]-org.apache.hadoop.io.retry.RetryInvocationHandler:
com.google.protobuf.ServiceException: java.io.IOException: DestHost:destPort
node3.ozone-test-sathishkumar.coelab.*.com:9860 , LocalHost:localPort
node2.ozone-test-sathishkumar.coelab.*.com/10.129.116.49:0. Failed on local
exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate
failed [Caused by GSSException: No valid credentials provided (Mechanism level:
Failed to find any Kerberos tgt)], while invoking $Proxy33.submitRequest over
nodeId=node1,nodeAddress=node3.ozone-test-sathishkumar.coelab.*.com/10.129.116.126:9860
after 79 failover attempts. Trying to failover after sleeping for 2000ms.
{noformat}
The root cause is when multi-tenancy is enabled, it will create a
RangerClientMultiTenantAccessController instance, which in turn creates a
RangerClient. In the RangerClient implementation, it will relogin again with OM
kerberos principle, get a UgiB and set the static loginUserRef field of
UserGroupInformation to this new UgiB, replacing the first UgiA created when
OzoneManager is first started. The first UgiA is passed into all OM RPC
servers to communicate with remote peer.
This is Client.java of hadoop common module. Once the kerberos authentication
fails, Client side will try to relogin with keberos keytab if this
shouldAuthenticateOverKrb returns true. In the current case, it returns false
for the loginUser(UgiB). doesn't equal to currentUser(UgiA). So once UgiA is
expired, a new UgiA doesn't get created through relogin.
{code:java}
private synchronized boolean shouldAuthenticateOverKrb() throws IOException
{
UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
UserGroupInformation currentUser = UserGroupInformation.getCurrentUser();
UserGroupInformation realUser = currentUser.getRealUser();
if (authMethod == AuthMethod.KERBEROS && loginUser != null &&
// Make sure user logged in using Kerberos either keytab or TGT
loginUser.hasKerberosCredentials() &&
// relogin only in case it is the login user (e.g. JT)
// or superuser (like oozie).
(loginUser.equals(currentUser) || loginUser.equals(realUser))) {
return true;
}
return false;
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]