[jira] [Comment Edited] (HADOOP-17996) UserGroupInformation#unprotectedRelogin sets the last login time before logging in
[ https://issues.apache.org/jira/browse/HADOOP-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17463589#comment-17463589 ] Surendra Singh Lilhore edited comment on HADOOP-17996 at 12/22/21, 5:40 AM: [~Sushma_28] and [~prabhujoseph]. Looks like this patch is trying to handle two scenario. # Set last login time after re-login in *UserGroupInformation#unprotectedRelogin().* # Handle re-login in Server when client and server running in same JVM and client trying to re-login but it failed. This impacted server also. #1 is absolutely not required and for this already configuration available if you want to reduce the time. #2 is different scenario and I tried reproducing it by adding some extra code in namenode. I added new thread which will logout in a 2 minute after namenode start and login again after waiting 2 minute. {code:java} new Thread() { public void run() { try { LOG.info("Logout from UGI"); Thread.sleep(12); UserGroupInformation.getLoginUser().getLogin().logout(); LOG.info("Waiting got 2 min"); Thread.sleep(12); LOG.info("Login again"); UserGroupInformation.getLoginUser().getLogin().login(); LOG.info("Relogin success.."); } catch (LoginException | IOException | InterruptedException e) { LOG.error("Failed log out thread ", e); } } }.start(); {code} For the 2 minute namenode not able to handle any client operation and keep on printing below exception. {code:java} Auth failed for x.x.x.x:42199:null (GSS initiate failed) with true cause: (GSS initiate failed) Auth failed for x.x.x.x:42199:null (GSS initiate failed) with true cause: (GSS initiate failed) Auth failed for x.x.x.x:42199:null (GSS initiate failed) with true cause: (GSS initiate failed) {code} I feel raise new Jira to handle Server side re-login and close this as Invalid. was (Author: surendrasingh): [~Sushma_28] and [~prabhujoseph]. Looks like this patch is trying to handle two scenario. # Set last login time after re-login in *UserGroupInformation#unprotectedRelogin().* # Handle re-login in Server when client and server running in same JVM and client trying to re-login but it failed. This impacted server also. #1 is absolutely not required and for this already configuration available if you want to reduce the time. #2 is different scenario and I tried reproducing it by adding some extra code in namenode. I added new thread which will logout in a 2 minute after namenode start and login again after waiting 2 minute. {code:java} new Thread() { public void run() { try { LOG.info("Logout from UGI"); Thread.sleep(12); UserGroupInformation.getLoginUser().getLogin().logout(); LOG.info("Waiting got 2 min"); Thread.sleep(12); LOG.info("Login again"); UserGroupInformation.getLoginUser().getLogin().login(); LOG.info("Relogin success.."); } catch (LoginException | IOException | InterruptedException e) { LOG.error("Failed log out thread ", e); } } }.start(); {code} For the 2 minute namenode not able to handle any client operation and keep on printing below exception. {code:java} Auth failed for x.x.x.x:42199:null (GSS initiate failed) with true cause: (GSS initiate failed) Auth failed for x.x.x.x:42199:null (GSS initiate failed) with true cause: (GSS initiate failed) Auth failed for x.x.x.x:42199:null (GSS initiate failed) with true cause: (GSS initiate failed) {code} I feel raise new Jira to handle Server side re-login and close this as Invalid. > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in > -- > > Key: HADOOP-17996 > URL: https://issues.apache.org/jira/browse/HADOOP-17996 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HADOOP-17996.001.patch > > > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in. IPC#Client does reloginFromKeytab when there is a connection > reset failure from AD which does logout and set the last login time to now > and then tries to login. The login also fails as not able to connect to AD. > Then the reattempts does not happen as kerberosMinSecondsBeforeRelogin check > fails. All Client and Server operations fails with *GSS initiate failed* > {code} > 2021-10-31 09:50:53,546 WARN ha.EditLogTailer - Unable to trigger a roll of > the active NN > java.util.concurrent.ExecutionException: >
[jira] [Comment Edited] (HADOOP-17996) UserGroupInformation#unprotectedRelogin sets the last login time before logging in
[ https://issues.apache.org/jira/browse/HADOOP-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448161#comment-17448161 ] Surendra Singh Lilhore edited comment on HADOOP-17996 at 11/23/21, 5:52 PM: >> Yes it can be workaround by setting re-login attempt time to a lower value. >>Every user has to modify this value after facing this issue. Instead this >>patch improves that by reattempting if a previous login failed. This is not workaround. This property added to avoid load on KDC server. If you feel your clusters are not putting enough load on KDC then change default value to 0. Changing it to 0 is same as your patch. >>This Jira is an improvement. Do you see any problem/impact with this patch. yes, it will impact the KDC server where KDC is shared by multiple cluster. All the processes will start re-login immediately and load will increase. >> Don't we immediately login into our laptop if the previous login failed? This is single user scenario, not for distributed system. :) was (Author: surendrasingh): >> Yes it can be workaround by setting re-login attempt time to a lower value. >>Every user has to modify this value after facing this issue. Instead this >>patch improves that by reattempting if a previous login failed. This is not workaround. This property added to avoid load on KDC server. If you feel your clusters are not putting enough load on KDC then change default value to 0. Changing it to 0 is same as your patch. >>This Jira is an improvement. Do you see any problem/impact with this patch. yes, it will impact the KDC server where is shared by multiple cluster. All the processes will start re-login immediately and load will increase. >> Don't we immediately login into our laptop if the previous login failed? This is single user scenario, not for distributed system. :) > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in > -- > > Key: HADOOP-17996 > URL: https://issues.apache.org/jira/browse/HADOOP-17996 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HADOOP-17996.001.patch > > > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in. IPC#Client does reloginFromKeytab when there is a connection > reset failure from AD which does logout and set the last login time to now > and then tries to login. The login also fails as not able to connect to AD. > Then the reattempts does not happen as kerberosMinSecondsBeforeRelogin check > fails. All Client and Server operations fails with *GSS initiate failed* > {code} > 2021-10-31 09:50:53,546 WARN ha.EditLogTailer - Unable to trigger a roll of > the active NN > java.util.concurrent.ExecutionException: > org.apache.hadoop.security.KerberosAuthException: DestHost:destPort > namenode0:8020 , LocalHost:localPort namenode1/1.2.3.4:0. Failed on local > exception: org.apache.hadoop.security.KerberosAuthException: Login failure > for user: nn/nameno...@example.com javax.security.auth.login.LoginException: > Connection reset > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:382) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:441) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1712) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > Caused by: org.apache.hadoop.security.KerberosAuthException: > DestHost:destPort namenode0:8020 , LocalHost:localPort namenode1/1.2.3.4:0. > Failed on local exception: org.apache.hadoop.security.KerberosAuthException: > Login failure for user: nn/nameno...@example.com > javax.security.auth.login.LoginException: Connection reset > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at >
[jira] [Comment Edited] (HADOOP-17996) UserGroupInformation#unprotectedRelogin sets the last login time before logging in
[ https://issues.apache.org/jira/browse/HADOOP-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448083#comment-17448083 ] Prabhu Joseph edited comment on HADOOP-17996 at 11/23/21, 3:49 PM: --- [~surendralilhore] The issue in existing code is if a re-login failed for some reason then the retries to re-login will be skipped for next configured re-login attempt time. Yes it can be workaround by setting re-login attempt time to a lower value. Every user has to modify this value after facing this issue. Instead this patch improves that by reattempting if a previous login failed. Don't we immediately login into our laptop if the previous login failed? Do we wait for configured re-login attempt time after every login failure. If so, what is the use in waiting for that period if you are sure you have the correct credentials? >> One question here, even after 60s second login was not successful ? Is this >> going in unnecessary loop ? It will be successful if AD is available. But for 60s, the HDFS Service is unavailable. All IPC Server and Client Operations will be failed with *GSS initiate failed*. This Jira is an improvement. Do you see any problem/impact with this patch. was (Author: prabhu joseph): [~surendralilhore] The issue in existing code is if a re-login failed for some reason then the retries to re-login will be skipped for next configured re-login attempt time. Yes it can be workaround by setting re-login attempt time to a lower value. Every user has to modify this value after facing this issue. Instead this patch improves that by reattempting if a previous login failed. Don't we immediately login into our laptop if the previous login failed? Do we wait for configured re-login attempt time after every login failure. If so, what is the use in waiting for that period? >> One question here, even after 60s second login was not successful ? Is this >> going in unnecessary loop ? It will be successful if AD is available. But for 60s, the HDFS Service is unavailable. All IPC Server and Client Operations will be failed with *GSS initiate failed*. This Jira is an improvement. Do you see any problem/impact with this patch. > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in > -- > > Key: HADOOP-17996 > URL: https://issues.apache.org/jira/browse/HADOOP-17996 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HADOOP-17996.001.patch > > > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in. IPC#Client does reloginFromKeytab when there is a connection > reset failure from AD which does logout and set the last login time to now > and then tries to login. The login also fails as not able to connect to AD. > Then the reattempts does not happen as kerberosMinSecondsBeforeRelogin check > fails. All Client and Server operations fails with *GSS initiate failed* > {code} > 2021-10-31 09:50:53,546 WARN ha.EditLogTailer - Unable to trigger a roll of > the active NN > java.util.concurrent.ExecutionException: > org.apache.hadoop.security.KerberosAuthException: DestHost:destPort > namenode0:8020 , LocalHost:localPort namenode1/1.2.3.4:0. Failed on local > exception: org.apache.hadoop.security.KerberosAuthException: Login failure > for user: nn/nameno...@example.com javax.security.auth.login.LoginException: > Connection reset > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:382) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:441) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1712) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > Caused by: org.apache.hadoop.security.KerberosAuthException: > DestHost:destPort namenode0:8020 , LocalHost:localPort
[jira] [Comment Edited] (HADOOP-17996) UserGroupInformation#unprotectedRelogin sets the last login time before logging in
[ https://issues.apache.org/jira/browse/HADOOP-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447931#comment-17447931 ] Surendra Singh Lilhore edited comment on HADOOP-17996 at 11/23/21, 11:18 AM: - [~Sushma_28] , last login time is not successful login time, it is just time which indicate when login attempted. So I don't thing setting it after login make any sense. HADOOP-7930 allow you to change relogin attempt time if you need, by default it is 60 sec. One question here, even after 60s second login was not successful ? Is this going in unnecessary loop ? was (Author: surendrasingh): [~Sushma_28] , last login time is not successful login time, it is just time which indicate when login attempted. So I don't thing setting it after login make any sense. HADOOP-7930 allow you to change relogin attempt time if you need, by default it is 60 sec. One question here, even after 60s second login was not successful ? Is this going in unnecessary loop ? > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in > -- > > Key: HADOOP-17996 > URL: https://issues.apache.org/jira/browse/HADOOP-17996 > Project: Hadoop Common > Issue Type: Bug > Components: security >Affects Versions: 3.3.1 >Reporter: Prabhu Joseph >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HADOOP-17996.001.patch > > > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in. IPC#Client does reloginFromKeytab when there is a connection > reset failure from AD which does logout and set the last login time to now > and then tries to login. The login also fails as not able to connect to AD. > Then the reattempts does not happen as kerberosMinSecondsBeforeRelogin check > fails. All Client and Server operations fails with *GSS initiate failed* > {code} > 2021-10-31 09:50:53,546 WARN ha.EditLogTailer - Unable to trigger a roll of > the active NN > java.util.concurrent.ExecutionException: > org.apache.hadoop.security.KerberosAuthException: DestHost:destPort > namenode0:8020 , LocalHost:localPort namenode1/1.2.3.4:0. Failed on local > exception: org.apache.hadoop.security.KerberosAuthException: Login failure > for user: nn/nameno...@example.com javax.security.auth.login.LoginException: > Connection reset > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:382) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:441) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1712) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:480) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > Caused by: org.apache.hadoop.security.KerberosAuthException: > DestHost:destPort namenode0:8020 , LocalHost:localPort namenode1/1.2.3.4:0. > Failed on local exception: org.apache.hadoop.security.KerberosAuthException: > Login failure for user: nn/nameno...@example.com > javax.security.auth.login.LoginException: Connection reset > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501) > at org.apache.hadoop.ipc.Client.call(Client.java:1443) > at org.apache.hadoop.ipc.Client.call(Client.java:1353) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source) > at >
[jira] [Comment Edited] (HADOOP-17996) UserGroupInformation#unprotectedRelogin sets the last login time before logging in
[ https://issues.apache.org/jira/browse/HADOOP-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444021#comment-17444021 ] Prabhu Joseph edited comment on HADOOP-17996 at 11/15/21, 6:18 PM: --- Thanks [~brahmareddy] for reviewing the patch. {quote}this was just to track the re-login attempt so that so many retries can be avoided.? {quote} There are two issues the patch addresses 1. When IPC#Client fails during {{{}saslConnect{}}}, it does re-login from {{{}handleSaslConnectionFailure{}}}. The re-login sets the last login time to current time irrespective of the login status, followed by logout and then login. When login fails for some reason like intermittent issue in connecting to AD, then all subsequent Client and Server operations will fail with GSS Initiate Failed for next configured {{kerberosMinSecondsBeforeLogin}} (60 seconds). {code:java} // try re-login if (UserGroupInformation.isLoginKeytabBased()) { UserGroupInformation.getLoginUser().reloginFromKeytab(); } else if (UserGroupInformation.isLoginTicketBased()) { UserGroupInformation.getLoginUser().reloginFromTicketCache(); } {code} This issue is addressed by setting the last login time to current time after the login succeeds. 2. Currently the re-login happens only from IPC#Client during {{{}handleSaslConnectionFailure(){}}}. Have observed cases where Client has logged out and have failed to login back leading to all IPC#Server operations failing in {{processSaslMessage}} with below error. {code:java} 2021-11-02 13:28:08,750 WARN ipc.Server - Auth failed for 10.25.35.45:37849:null (GSS initiate failed) with true cause: (GSS initiate failed) 2021-11-02 13:28:08,767 WARN ipc.Server - Auth failed for 10.25.35.46:35919:null (GSS initiate failed) with true cause: (GSS initiate failed) {code} This patch adds re-login from Server side as well during any Authentication Failure. {quote}Configuring kerberosMinSecondsBeforeRelogin with low value will not work here if it's needed.? {quote} This will workaround the first issue. {quote}After this fix , on failure it will continuously retry..? {quote} IPC#Client does re-login during Connection Failure. This patch adds at IPC#Server side as well. Retries are based on the retry mechanism of IPC#Client and IPC#Server. The real kerberos login will happen for every retry from IPC#Client and IPC#Server till the login succeeds. was (Author: prabhu joseph): Thanks [~brahmareddy] for reviewing the patch. {quote}this was just to track the re-login attempt so that so many retries can be avoided.? {quote} There are two issues the patch tries to address 1. When IPC#Client fails during {{{}saslConnect{}}}, it does re-login from {{{}handleSaslConnectionFailure{}}}. The re-login sets the last login time to current time irrespective of the login status, followed by logout and then login. When login fails for some reason like intermittent issue in connecting to AD, then all subsequent Client and Server operations will fail with GSS Initiate Failed for next configured {{kerberosMinSecondsBeforeLogin}} (60 seconds). {code:java} // try re-login if (UserGroupInformation.isLoginKeytabBased()) { UserGroupInformation.getLoginUser().reloginFromKeytab(); } else if (UserGroupInformation.isLoginTicketBased()) { UserGroupInformation.getLoginUser().reloginFromTicketCache(); } {code} This issue is addressed by setting the last login time to current time after the login succeeds. 2. Currently the re-login happens only from IPC#Client during {{{}handleSaslConnectionFailure(){}}}. Have observed cases where Client has logged out and have failed to login back leading to all IPC#Server operations failing in {{processSaslMessage}} with below error. {code:java} 2021-11-02 13:28:08,750 WARN ipc.Server - Auth failed for 10.25.35.45:37849:null (GSS initiate failed) with true cause: (GSS initiate failed) 2021-11-02 13:28:08,767 WARN ipc.Server - Auth failed for 10.25.35.46:35919:null (GSS initiate failed) with true cause: (GSS initiate failed) {code} This patch adds re-login from Server side as well during any Authentication Failure. bq. Configuring kerberosMinSecondsBeforeRelogin with low value will not work here if it's needed.? This will workaround the first issue. bq. After this fix , on failure it will continuously retry..? IPC#Client does re-login during Connection Failure. This patch adds at IPC#Server side as well. Retries are based on the retry mechanism of IPC#Client and IPC#Server. The real kerberos login will happen for every retry from IPC#Client and IPC#Server till the login succeeds. > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in >
[jira] [Comment Edited] (HADOOP-17996) UserGroupInformation#unprotectedRelogin sets the last login time before logging in
[ https://issues.apache.org/jira/browse/HADOOP-17996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444021#comment-17444021 ] Prabhu Joseph edited comment on HADOOP-17996 at 11/15/21, 6:17 PM: --- Thanks [~brahmareddy] for reviewing the patch. {quote}this was just to track the re-login attempt so that so many retries can be avoided.? {quote} There are two issues the patch tries to address 1. When IPC#Client fails during {{{}saslConnect{}}}, it does re-login from {{{}handleSaslConnectionFailure{}}}. The re-login sets the last login time to current time irrespective of the login status, followed by logout and then login. When login fails for some reason like intermittent issue in connecting to AD, then all subsequent Client and Server operations will fail with GSS Initiate Failed for next configured {{kerberosMinSecondsBeforeLogin}} (60 seconds). {code:java} // try re-login if (UserGroupInformation.isLoginKeytabBased()) { UserGroupInformation.getLoginUser().reloginFromKeytab(); } else if (UserGroupInformation.isLoginTicketBased()) { UserGroupInformation.getLoginUser().reloginFromTicketCache(); } {code} This issue is addressed by setting the last login time to current time after the login succeeds. 2. Currently the re-login happens only from IPC#Client during {{{}handleSaslConnectionFailure(){}}}. Have observed cases where Client has logged out and have failed to login back leading to all IPC#Server operations failing in {{processSaslMessage}} with below error. {code:java} 2021-11-02 13:28:08,750 WARN ipc.Server - Auth failed for 10.25.35.45:37849:null (GSS initiate failed) with true cause: (GSS initiate failed) 2021-11-02 13:28:08,767 WARN ipc.Server - Auth failed for 10.25.35.46:35919:null (GSS initiate failed) with true cause: (GSS initiate failed) {code} This patch adds re-login from Server side as well during any Authentication Failure. bq. Configuring kerberosMinSecondsBeforeRelogin with low value will not work here if it's needed.? This will workaround the first issue. bq. After this fix , on failure it will continuously retry..? IPC#Client does re-login during Connection Failure. This patch adds at IPC#Server side as well. Retries are based on the retry mechanism of IPC#Client and IPC#Server. The real kerberos login will happen for every retry from IPC#Client and IPC#Server till the login succeeds. was (Author: prabhu joseph): Thanks [~brahmareddy] for reviewing the patch. {quote}this was just to track the re-login attempt so that so many retries can be avoided.? {quote} There are two issues the patch tries to address 1. When IPC#Client fails during {{{}saslConnect{}}}, it does re-login from {{{}handleSaslConnectionFailure{}}}. The re-login sets the last login time to current time irrespective of the login status, followed by logout and then login. When login fails for some reason like intermittent issue in connecting to AD, then all subsequent Client and Server operations will fail with GSS Initiate Failed for next configured {{kerberosMinSecondsBeforeLogin}} (60 seconds). {code:java} // try re-login if (UserGroupInformation.isLoginKeytabBased()) { UserGroupInformation.getLoginUser().reloginFromKeytab(); } else if (UserGroupInformation.isLoginTicketBased()) { UserGroupInformation.getLoginUser().reloginFromTicketCache(); } {code} This issue is addressed by setting the last login time to current time after the login succeeds. 2. Currently the re-login happens only from IPC#Client during {{{}handleSaslConnectionFailure(){}}}. Have observed cases where Client has logged out and have failed to login back leading to all IPC#Server operations failing in {{processSaslMessage}} with below error. {code:java} 2021-11-02 13:28:08,750 WARN ipc.Server - Auth failed for 10.25.35.45:37849:null (GSS initiate failed) with true cause: (GSS initiate failed) 2021-11-02 13:28:08,767 WARN ipc.Server - Auth failed for 10.25.35.46:35919:null (GSS initiate failed) with true cause: (GSS initiate failed) {code} This patch adds re-login from Server side as well during any Authentication Failure. bq. Configuring kerberosMinSecondsBeforeRelogin with low value will not work here if it's needed.? This will workaround the first issue. {quote} {quote}After this fix , on failure it will continuously retry..? {quote} IPC#Client does re-login during Connection Failure. This patch adds at IPC#Server side as well. Retries are based on the retry mechanism of IPC#Client and IPC#Server. The real kerberos login will happen for every retry from IPC#Client and IPC#Server till the login succeeds. > UserGroupInformation#unprotectedRelogin sets the last login time before > logging in >