Chengbing Liu created HDFS-7798:
-----------------------------------
Summary: Checkpointing failure caused by shared
KerberosAuthenticator
Key: HDFS-7798
URL: https://issues.apache.org/jira/browse/HDFS-7798
Project: Hadoop HDFS
Issue Type: Bug
Components: security
Reporter: Chengbing Liu
Priority: Critical
We have observed in our real cluster occasionally checkpointing failure. The
standby NameNode was not able to upload image to the active NameNode.
After some digging, the root cause appears to be a shared
{{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is
designed as a use-once instance, and is not stateless. It has attributes such
as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling
{{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going
to have race condition, resulting in a failed image uploading.
Therefore for the first step, without breaking the current API, I propose we
create a new {{KerberosAuthenticator}} instance for each connection, to make
checkpointing work. We may consider making {{Authenticator}} design and
implementation stateless afterwards, as {{ConnectionConfigurator}} does.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)