[jira] [Updated] (HDFS-16165) Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x

2022-05-24 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-16165:

Target Version/s: 2.10.3  (was: 2.10.2)

> Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x
> --
>
> Key: HDFS-16165
> URL: https://issues.apache.org/jira/browse/HDFS-16165
> Project: Hadoop HDFS
>  Issue Type: Wish
> Environment: Can be reproduced in docker HDFS environment with 
> Kerberos 
> https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh
>Reporter: Daniel Osvath
>Priority: Major
>  Labels: Confluent
>
> *Problem Description*
> For more than a year Apache Kafka Connect users have been running into a 
> Kerberos renewal issue that causes our HDFS2 connectors to fail. 
> We have been able to consistently reproduce the issue under high load with 40 
> connectors (threads) that use the library. When we try an alternate 
> workaround that uses the kerberos keytab on the system the connector operates 
> without issues.
> We identified the root cause to be a race condition bug in the Hadoop 2.x 
> library that causes the ticker renewal to fail with the error below: 
> {code:java}
> Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>  at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We
>  reached the conclusion of the root cause once we tried the same environment 
> (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated 
> without renewal issues. Additionally, identifying that the synchronization 
> issue has been fixed for the newer Hadoop 3.x releases  we confirmed our 
> hypothesis about the root cause. Request
> {code}
> There are many changes in HDFS 3 
> [UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java]
>  related to UGI synchronization which were done as part of 
> https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest 
> some race conditions were happening with older version, i.e HDFS 2.x Which 
> would explain why we can reproduce the problem with HDFS2.
> For example(among others):
> {code:java}
>   private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime)
>   throws IOException {
> // ensure the relogin is atomic to avoid leaving credentials in an
> // inconsistent state.  prevents other ugi instances, SASL, and SPNEGO
> // from accessing or altering credentials during the relogin.
> synchronized(login.getSubjectLock()) {
>   // another racing thread may have beat us to the relogin.
>   if (login == getLogin()) {
> unprotectedRelogin(login, ignoreLastLoginTime);
>   }
> }
>   }
> {code}
> All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses 
> 2.10.1), on which several CDH distributions are based. 
> *Request*
> We would like to ask for the synchronization fix to be backported to Hadoop 
> 2.x so that our users can operate without issues. 
> *Impact*
> The older 2.x Hadoop version is used by our HDFS connector, which is used in 
> production by our community. Currently, the issue causes our HDFS connector 
> to fail, as it is unable to recover and renew the ticket at a later point. 
> Having the backported fix would allow our users to operate without issues 
> that require manual intervention every week (or few days in some cases). The 
> only workaround available to community for the issue is to run a command or 
> restart their workers. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16165) Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x

2021-08-29 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-16165:
-
Target Version/s: 2.10.2

> Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x
> --
>
> Key: HDFS-16165
> URL: https://issues.apache.org/jira/browse/HDFS-16165
> Project: Hadoop HDFS
>  Issue Type: Wish
> Environment: Can be reproduced in docker HDFS environment with 
> Kerberos 
> https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh
>Reporter: Daniel Osvath
>Priority: Major
>  Labels: Confluent
>
> *Problem Description*
> For more than a year Apache Kafka Connect users have been running into a 
> Kerberos renewal issue that causes our HDFS2 connectors to fail. 
> We have been able to consistently reproduce the issue under high load with 40 
> connectors (threads) that use the library. When we try an alternate 
> workaround that uses the kerberos keytab on the system the connector operates 
> without issues.
> We identified the root cause to be a race condition bug in the Hadoop 2.x 
> library that causes the ticker renewal to fail with the error below: 
> {code:java}
> Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>  at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We
>  reached the conclusion of the root cause once we tried the same environment 
> (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated 
> without renewal issues. Additionally, identifying that the synchronization 
> issue has been fixed for the newer Hadoop 3.x releases  we confirmed our 
> hypothesis about the root cause. Request
> {code}
> There are many changes in HDFS 3 
> [UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java]
>  related to UGI synchronization which were done as part of 
> https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest 
> some race conditions were happening with older version, i.e HDFS 2.x Which 
> would explain why we can reproduce the problem with HDFS2.
> For example(among others):
> {code:java}
>   private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime)
>   throws IOException {
> // ensure the relogin is atomic to avoid leaving credentials in an
> // inconsistent state.  prevents other ugi instances, SASL, and SPNEGO
> // from accessing or altering credentials during the relogin.
> synchronized(login.getSubjectLock()) {
>   // another racing thread may have beat us to the relogin.
>   if (login == getLogin()) {
> unprotectedRelogin(login, ignoreLastLoginTime);
>   }
> }
>   }
> {code}
> All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses 
> 2.10.1), on which several CDH distributions are based. 
> *Request*
> We would like to ask for the synchronization fix to be backported to Hadoop 
> 2.x so that our users can operate without issues. 
> *Impact*
> The older 2.x Hadoop version is used by our HDFS connector, which is used in 
> production by our community. Currently, the issue causes our HDFS connector 
> to fail, as it is unable to recover and renew the ticket at a later point. 
> Having the backported fix would allow our users to operate without issues 
> that require manual intervention every week (or few days in some cases). The 
> only workaround available to community for the issue is to run a command or 
> restart their workers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16165) Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x

2021-08-12 Thread Daniel Osvath (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Osvath updated HDFS-16165:
-
Labels: Confluent  (was: )

> Backport the Hadoop 3.x Kerberos synchronization fix to Hadoop 2.x
> --
>
> Key: HDFS-16165
> URL: https://issues.apache.org/jira/browse/HDFS-16165
> Project: Hadoop HDFS
>  Issue Type: Wish
> Environment: Can be reproduced in docker HDFS environment with 
> Kerberos 
> https://github.com/vdesabou/kafka-docker-playground/blob/93a93de293ad2f9bb22afb244f2d8729a178296e/connect/connect-hdfs2-sink/hdfs2-sink-ha-kerberos-repro-gss-exception.sh
>Reporter: Daniel Osvath
>Priority: Major
>  Labels: Confluent
>
> *Problem Description*
> For more than a year Apache Kafka Connect users have been running into a 
> Kerberos renewal issue that causes our HDFS2 connectors to fail. 
> We have been able to consistently reproduce the issue under high load with 40 
> connectors (threads) that use the library. When we try an alternate 
> workaround that uses the kerberos keytab on the system the connector operates 
> without issues.
> We identified the root cause to be a race condition bug in the Hadoop 2.x 
> library that causes the ticker renewal to fail with the error below: 
> {code:java}
> Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>  at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)We
>  reached the conclusion of the root cause once we tried the same environment 
> (40 connectors) with Hadoop 3.x, and our HDFS3 connectors and operated 
> without renewal issues. Additionally, identifying that the synchronization 
> issue has been fixed for the newer Hadoop 3.x releases  we confirmed our 
> hypothesis about the root cause. Request
> {code}
> There are many changes in HDFS 3 
> [UserGroupInformation.java|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java]
>  related to UGI synchronization which were done as part of 
> https://issues.apache.org/jira/browse/HADOOP-9747, and those changes suggest 
> some race conditions were happening with older version, i.e HDFS 2.x Which 
> would explain why we can reproduce the problem with HDFS2.
> For example(among others):
> {code:java}
>   private void relogin(HadoopLoginContext login, boolean ignoreLastLoginTime)
>   throws IOException {
> // ensure the relogin is atomic to avoid leaving credentials in an
> // inconsistent state.  prevents other ugi instances, SASL, and SPNEGO
> // from accessing or altering credentials during the relogin.
> synchronized(login.getSubjectLock()) {
>   // another racing thread may have beat us to the relogin.
>   if (login == getLogin()) {
> unprotectedRelogin(login, ignoreLastLoginTime);
>   }
> }
>   }
> {code}
> All those changes were not backported to Hadoop 2.x (out HDFS2 connector uses 
> 2.10.1), on which several CDH distributions are based. 
> *Request*
> We would like to ask for the synchronization fix to be backported to Hadoop 
> 2.x so that our users can operate without issues. 
> *Impact*
> The older 2.x Hadoop version is used by our HDFS connector, which is used in 
> production by our community. Currently, the issue causes our HDFS connector 
> to fail, as it is unable to recover and renew the ticket at a later point. 
> Having the backported fix would allow our users to operate without issues 
> that require manual intervention every week (or few days in some cases). The 
> only workaround available to community for the issue is to run a command or 
> restart their workers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org