[
https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chengbing Liu updated HDFS-7798:
--------------------------------
Attachment: HDFS-7798.01.patch
> Checkpointing failure caused by shared KerberosAuthenticator
> ------------------------------------------------------------
>
> Key: HDFS-7798
> URL: https://issues.apache.org/jira/browse/HDFS-7798
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: security
> Reporter: Chengbing Liu
> Priority: Critical
> Attachments: HDFS-7798.01.patch
>
>
> We have observed in our real cluster occasional checkpointing failure. The
> standby NameNode was not able to upload image to the active NameNode.
> After some digging, the root cause appears to be a shared
> {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is
> designed as a use-once instance, and is not stateless. It has attributes such
> as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling
> {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is
> going to have race condition, resulting in a failed image uploading.
> Therefore for the first step, without breaking the current API, I propose we
> create a new {{KerberosAuthenticator}} instance for each connection, to make
> checkpointing work. We may consider making {{Authenticator}} design and
> implementation stateless afterwards, as {{ConnectionConfigurator}} does.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)