[jira] [Commented] (HDFS-7121) For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node.

Chris Nauroth (JIRA) Mon, 22 Sep 2014 11:22:19 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143565#comment-14143565
 ]


Chris Nauroth commented on HDFS-7121:
-------------------------------------

I don't have a specific design in mind yet, so brainstorming comments are 
welcome.  Possible ideas so far are:
# If the {{QuorumJournalManager}} client gets an exception on any node, then 
send a corresponding undo message to the nodes that previously completed the 
operation successfully.  This would be best effort only, because a well-timed 
network failure could prevent delivery of the undo message, and that 
JournalNode still would be left in an inconsistent state.
# Do a full-fledged multi-phase commit.  The operations involved are executed 
only rarely as "offline" events like software upgrade and rollback, so I don't 
expect typical criticisms of scalability on multi-phase commit protocols would 
be a problem here.

> For JournalNode operations that must succeed on all nodes, attempt to undo 
> the operation on all nodes if it fails on one node.
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7121
>                 URL: https://issues.apache.org/jira/browse/HDFS-7121
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: journal-node
>            Reporter: Chris Nauroth
>
> Several JournalNode operations are not satisfied by a quorum.  They must 
> succeed on every JournalNode in the cluster.  If the operation succeeds on 
> some nodes, but fails on others, then this may leave the nodes in an 
> inconsistent state and require operations to do manual recovery steps.  For 
> example, if {{doPreUpgrade}} succeeds on 2 nodes and fails on 1 node, then 
> the operator will need to correct the problem on the failed node and also 
> manually restore the previous.tmp directory to current on the 2 successful 
> nodes before reattempting the upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7121) For JournalNode operations that must succeed on all nodes, attempt to undo the operation on all nodes if it fails on one node.

Reply via email to