[ 
https://issues.apache.org/jira/browse/HDFS-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143565#comment-14143565
 ] 

Chris Nauroth commented on HDFS-7121:
-------------------------------------

I don't have a specific design in mind yet, so brainstorming comments are 
welcome.  Possible ideas so far are:
# If the {{QuorumJournalManager}} client gets an exception on any node, then 
send a corresponding undo message to the nodes that previously completed the 
operation successfully.  This would be best effort only, because a well-timed 
network failure could prevent delivery of the undo message, and that 
JournalNode still would be left in an inconsistent state.
# Do a full-fledged multi-phase commit.  The operations involved are executed 
only rarely as "offline" events like software upgrade and rollback, so I don't 
expect typical criticisms of scalability on multi-phase commit protocols would 
be a problem here.

> For JournalNode operations that must succeed on all nodes, attempt to undo 
> the operation on all nodes if it fails on one node.
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7121
>                 URL: https://issues.apache.org/jira/browse/HDFS-7121
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: journal-node
>            Reporter: Chris Nauroth
>
> Several JournalNode operations are not satisfied by a quorum.  They must 
> succeed on every JournalNode in the cluster.  If the operation succeeds on 
> some nodes, but fails on others, then this may leave the nodes in an 
> inconsistent state and require operations to do manual recovery steps.  For 
> example, if {{doPreUpgrade}} succeeds on 2 nodes and fails on 1 node, then 
> the operator will need to correct the problem on the failed node and also 
> manually restore the previous.tmp directory to current on the 2 successful 
> nodes before reattempting the upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to