[
https://issues.apache.org/jira/browse/HDFS-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860369#comment-13860369
]
Vincent Sheffer commented on HDFS-5677:
---------------------------------------
I've found a good place and condition to check in *DFSUtil* to add a warning
message. This seems to occur at startup, which will be good if someone is
monitoring the logs at startup.
For concreteness, here is the relevant fragment from my hdfs-site.xml file:
{code}
.
.
.
<property>
<name>dfs.nameservice.id</name>
<value>myCluster</value>
</property>
<property>
<name>dfs.ha.namenodes.myCluster</name>
<value>vince-1,vince2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myCluster.vince-1</name>
<value>vince-1:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.myCluster.vince-1</name>
<value>vince-1:8022</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myCluster.vince-2</name>
<value>vince-2:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.myCluster.vince-2</name>
<value>vince-2:8022</value>
</property>
.
.
.
{code}
The relevant portion of the DFSUtil code is:
{code}
private static Map<String, InetSocketAddress> getAddressesForNameserviceId(
Configuration conf, String nsId, String defaultValue,
String[] keys) {
Collection<String> nnIds = getNameNodeIds(conf, nsId);
Map<String, InetSocketAddress> ret = Maps.newHashMap();
for (String nnId : emptyAsSingletonNull(nnIds)) {
String suffix = concatSuffixes(nsId, nnId);
String address = getConfValue(defaultValue, suffix, conf, keys);
if (address != null) {
InetSocketAddress isa = NetUtils.createSocketAddr(address);
ret.put(nnId, isa);
}
}
return ret;
}
{code}
For my node with the missing hyphen (vince2), the resulting entry in the map
for the InetSocketAddress of vince2 will be *myCluster:8020*, which remains
unresolved always. And even though I do have valid properties for vince-2,
those are ignored due to the typo.
My question is why pass in *myCluster:8020* as the default value to
*getConfValue* (null might be better here) when it will never be a valid
hostname in this case and will never resolve. My hunch is that in the non HA
case, this code works fine, which may make changing it a bit tricky. If, on
the other hand, this code path isn't taken in the non HA case, then it may be
pretty easy to provide better configuration validation. I'm new to Hadoop
development and, so, don't have a good sense of what sort of hornet's nest I
may be kicking in trying to make the configuration validation absolutely bullet
proof.
The bottom line for now: I do have a simple patch that will, at least, log the
problem with the unresolved entry on startup.
That message is, at least, a minor improvement in that some where in the logs
will be information useful to someone doing trouble shooting. If the problem
doesn't manifest itself until the primary NN goes down, however, then this fix
won't be as useful since the more informative message might be buried in the
log file.
A slightly better fix may be to tweak the ongoing message (the one I have in
the original description to this Jira is recurring) to better reflect the
condition being reported and to direct the engineer where to look in the
configuration for the likely culprit.
> Need error checking for HA cluster configuration
> ------------------------------------------------
>
> Key: HDFS-5677
> URL: https://issues.apache.org/jira/browse/HDFS-5677
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, ha
> Affects Versions: 2.0.6-alpha
> Environment: centos6.5, oracle jdk6 45,
> Reporter: Vincent Sheffer
> Assignee: Vincent Sheffer
> Priority: Minor
>
> If a node is declared in the *dfs.ha.namenodes.myCluster* but is _not_ later
> defined in subsequent *dfs.namenode.servicerpc-address.myCluster.nodename* or
> *dfs.namenode.rpc-address.myCluster.XXX* properties no error or warning
> message is provided to indicate that.
> The only indication of a problem is a log message like the following:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to
> server: myCluster:8020
> {code}
> Another way to look at this is that no error or warning is provided when a
> servicerpc-address/rpc-address property is defined for a node without a
> corresponding node declared in *dfs.ha.namenodes.myCluster*.
> This arose when I had a typo in the *dfs.ha.namenodes.myCluster* property for
> one of my node names. It would be very helpful to have at least a warning
> message on startup if there is a configuration problem like this.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)