[ 
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengbing Liu updated HDFS-7798:
--------------------------------
    Description: 
We have observed in our real cluster occasional checkpointing failure. The 
standby NameNode was not able to upload image to the active NameNode.

After some digging, the root cause appears to be a shared 
{{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
designed as a use-once instance, and is not stateless. It has attributes such 
as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
{{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going 
to have race condition, resulting in a failed image uploading.

Therefore for the first step, without breaking the current API, I propose we 
create a new {{KerberosAuthenticator}} instance for each connection, to make 
checkpointing work. We may consider making {{Authenticator}} design and 
implementation stateless afterwards, as {{ConnectionConfigurator}} does.

  was:
We have observed in our real cluster occasionally checkpointing failure. The 
standby NameNode was not able to upload image to the active NameNode.

After some digging, the root cause appears to be a shared 
{{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
designed as a use-once instance, and is not stateless. It has attributes such 
as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
{{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going 
to have race condition, resulting in a failed image uploading.

Therefore for the first step, without breaking the current API, I propose we 
create a new {{KerberosAuthenticator}} instance for each connection, to make 
checkpointing work. We may consider making {{Authenticator}} design and 
implementation stateless afterwards, as {{ConnectionConfigurator}} does.


> Checkpointing failure caused by shared KerberosAuthenticator
> ------------------------------------------------------------
>
>                 Key: HDFS-7798
>                 URL: https://issues.apache.org/jira/browse/HDFS-7798
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: security
>            Reporter: Chengbing Liu
>            Priority: Critical
>
> We have observed in our real cluster occasional checkpointing failure. The 
> standby NameNode was not able to upload image to the active NameNode.
> After some digging, the root cause appears to be a shared 
> {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is 
> designed as a use-once instance, and is not stateless. It has attributes such 
> as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling 
> {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is 
> going to have race condition, resulting in a failed image uploading.
> Therefore for the first step, without breaking the current API, I propose we 
> create a new {{KerberosAuthenticator}} instance for each connection, to make 
> checkpointing work. We may consider making {{Authenticator}} design and 
> implementation stateless afterwards, as {{ConnectionConfigurator}} does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to