[
https://issues.apache.org/jira/browse/HDFS-5850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885974#comment-13885974
]
Daryn Sharp commented on HDFS-5850:
-----------------------------------
I'm not sure this issue affects 2.x. In 0.23, the client pre-constructs the
kerberos service principal and caches it in the ConnectionId. All subsequent
connections use a cached Connection which in turn reuses the cached principal
in the ConnectionId. Thus, if the principal is misconstructed it will never
recover.
RPCv9 in 2.x should recover. The client no longer preconstructs and caches the
principal. It verifies the principal advertised by the server. If a transient
DNS resolve failure occurs, the _HOST substitution in the service principal key
will indeed yield a principal with an IP. The client will reject the
advertised principal because it doesn't match (ip vs hostname). However,
subsequent connections will attempt to reverify the advertised principal which
involves a new DNS resolve. The client should recover when DNS recovers.
> DNS Issues during TrashEmptier initialization can silently leave it
> non-functional
> ----------------------------------------------------------------------------------
>
> Key: HDFS-5850
> URL: https://issues.apache.org/jira/browse/HDFS-5850
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.4.0
> Reporter: Kihwal Lee
> Priority: Critical
>
> [~knoguchi] recently noticed that the trash directories of a restarted
> cluster were not cleaned up. It turned out that it was caused by a transient
> DNS problem during initialization.
> TrashEmptier thread in namenode is actually a FileSystem client running in a
> loop, which makes RPC calls to itself in order to list, rename and delete
> trash files. In a secure setup, the client needs to create the right service
> principal name for the namenode for making a RPC connection. If there is a
> DNS issue at that moment, the SPN ends up with the IP address, not the fqdn.
> Since KDC does not recognize this SPN, TrashEmptier does not work from that
> point on. I verified that the SPN with the IP address was what the
> TrashEmptier thread asked KDC for a service ticket for.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)