[
https://issues.apache.org/jira/browse/HDFS-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247283#comment-14247283
]
Lei (Eddy) Xu commented on HDFS-6440:
-------------------------------------
Hey, [~jesse_yates] Thanks for your answers!
I have a few further questions regarding the patch:
1. I did not see where {{isPrimearyCheckPointer}} is set to {{false}}.
{code:title=StandbyCheckpointer.java}
private boolean isPrimaryCheckPointer = true;
...
if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
this.isPrimaryCheckPointer = true;
//avoid getting the rest of the results - we don't care since we had a
successful upload
break;
}
{code}
I guess the default value of {{isPrimaryCheckPointer}} might be a typo, which
should be {{false}}. Moreover, is there a case that SNN switches from
primary check pointer to non-primary check pointer?
2. Is the following condition correct? I think only {{sendRequest}} is needed.
{code:title=StandbyCheckpointer.java}
if (needCheckpoint && sendRequest) {
{code}
Also in the old code,
{code}
} else if (secsSinceLast >= checkpointConf.getPeriod()) {
LOG.info("Triggering checkpoint because it has been " +
secsSinceLast + " seconds since the last checkpoint, which " +
"exceeds the configured interval " +
checkpointConf.getPeriod());
needCheckpoint = true;
}
{code}
Does it implies that if {{secsSinceLast >= checkpointConf.getPeriod()}} is
{{true}} then {{secsSinceLast >= checkpointConf.getQuietPeriod()}} is always
{{true}}, for default {{quite multiplier}} value? If it is the case, are these
duplicated conditions?
It looks like that it might be easier to let ANN calculate the above
conditions, as it has the actual system-wide knowledge of last upload and last
txnid. It could be a nice optimization later.
3. When it uploads fsimage, are {{SC_CONFLICT}} and {{SC_EXPECTATION_FAILED}}
not handled in the SNN in the current patch? Do you plan to handle them in a
following patch?
4. Could you set {{EditLogTailer#maxRetries}} to {{private final}}? Do we need
to enforce an acceptable value range for {{maxRetries}}? For instance, in the
following code, it would not try every NN when {{nextNN = nns.size() - 1}} and
{{maxRetries = 1}}
{code}
// if we have reached the max loop count, quit by returning null
if (nextNN / nns.size() >= maxRetries) {
return null;
}
{code}
5. There are a few changes due to format, e.g., in {{doCheckpointing()}}. Could
you remove them to reduce the size of the patch?
Also the following code is indented incorrectly.
{code}
int i = 0;
for (; i < uploads.size(); i++) {
Future<TransferFsImage.TransferResult> upload = uploads.get(i);
try {
// TODO should there be some smarts here about retries nodes that are
not the active NN?
if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
this.isPrimaryCheckPointer = true;
//avoid getting the rest of the results - we don't care since we
had a successful upload
break;
}
} catch (ExecutionException e) {
ioe = new IOException("Exception during image upload: " +
e.getMessage(),
e.getCause());
break;
} catch (InterruptedException e) {
ie = null;
break;
}
}
{code}
Other parts LGTM. Thanks again for working on this, [~jesse_yates]!
> Support more than 2 NameNodes
> -----------------------------
>
> Key: HDFS-6440
> URL: https://issues.apache.org/jira/browse/HDFS-6440
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: auto-failover, ha, namenode
> Affects Versions: 2.4.0
> Reporter: Jesse Yates
> Assignee: Jesse Yates
> Attachments: Multiple-Standby-NameNodes_V1.pdf,
> hdfs-6440-cdh-4.5-full.patch, hdfs-multiple-snn-trunk-v0.patch
>
>
> Most of the work is already done to support more than 2 NameNodes (one
> active, one standby). This would be the last bit to support running multiple
> _standby_ NameNodes; one of the standbys should be available for fail-over.
> Mostly, this is a matter of updating how we parse configurations, some
> complexity around managing the checkpointing, and updating a whole lot of
> tests.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)