[ 
https://issues.apache.org/jira/browse/HDFS-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038870#comment-13038870
 ] 

Eli Collins commented on HDFS-1984:
-----------------------------------

+1  Nice tests.  Feel free to address the following in another change.

Can't these two threads in the test race? Imagine they would never in practice.
{noformat}
checkpointThread.start();
// Wait for the first checkpointer to get to where it should save its image.
delayer.waitForCall();
{noformat}

It should be rare that there's no MD5 file for an image, ie only happens when 
there's an image from a previous version, therefore would it make sense to warn 
in places like setVerificationHeaders where an MD5 file is not present? Would 
it make sense to establish the invariant that an MD5 is required?

Not your change, but would be less error prone if ErrorSimulation used eg an 
enum CORRUPT_IMG_XFER instead of "4".

> HDFS-1073: Enable multiple checkpointers to run simultaneously
> --------------------------------------------------------------
>
>                 Key: HDFS-1984
>                 URL: https://issues.apache.org/jira/browse/HDFS-1984
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: name-node
>    Affects Versions: Edit log branch (HDFS-1073)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: hdfs-1984.txt
>
>
> One of the motivations of HDFS-1073 is that it decouples the checkpoint 
> process so that multiple checkpoints could be taken at the same time and not 
> interfere with each other.
> Currently on the 1073 branch this doesn't quite work right, since we have 
> some state and validation in FSImage that's tied to a single fsimage_N -- 
> thus if two 2NNs perform a checkpoint at different transaction IDs, only one 
> will succeed.
> As a stress test, we can run two 2NNs each configured with the 
> fs.checkpoint.interval set to "0" which causes them to continuously 
> checkpoint as fast as they can.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to