[ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13471762#comment-13471762
 ] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. the QJM is not replaceable by local disk journal if QJM is not available 
because the local disk journal and qjm will not be consistent

In a shared-edits setup, the shared edits are always synced ahead of the local 
disk journals. This means that anything that's committed locally will also be 
in the QJM. If the QJM fails to sync, then the NN aborts (since the shared 
edits are marked as "required"). So, they're not "consistent" but you can 
always take finalized edits from a JN and copy them to a local drive. In the 
case of some disaster, you can also freely intermingle the files - having that 
flexibility without having to hex-edit out a header seems prudent.

More importantly, IMO, you can continue to run tools like the 
OfflineEditsViewer against edits stored on a JN.

bq. We may have to revise the journal abstraction at little to deal with the 
above situation (independent of storing epoch in the first entry) since a 
QJM+localDisk journal is useful.

This is in fact the way in which we've been doing all of our HA testing (and 
now some customer deploys). We use local FileJournalManagers on each NN, and 
the QuorumJournalManager as shared edits. Per above, this works fine and 
doesn't have any "lost edit" issues.

Can you be specific about the consistency issue you're foreseeing here?

bq. Todd the change is small and i am trying to help you here.

Maybe I'm mis-understanding the change. More below on this...

bq. Recall in HDFS-1073 you did not want to use transaction ids or name the log 
files using transaction id range and you argued against this for quite a while. 
As I predicted, txids have become a cornerstone of HA and managing journals

To be clear, the 1073 design was always using transaction IDs, it was just a 
matter of the file naming that we argued about. But I don't think it's 
productive to argue about the past :)

bq. You have argued that prepare-recovery, using the epoch number from previous 
newEpoch, is like multi-paxos - not sure if multi-paxos is warranted here.

Can you explain what you mean by "not sure if multi-paxos is warranted here?" I 
just meant that, similar to multi-paxos, you can use an earlier promise to 
verify all future messages against that earlier epoch ID. Otherwise each 
operation would require its own new epoch, and that's clearly not what we want.

bq. The response of the newEpoch() is highest txId while response to 
PrepareRecovery is the state of the highest segment, and optionally additional 
info if there was a previous recovery in progress

I think there is some confusion here. The response to newEpoch is the highest 
segment txid, but the highest segment txid may not match up across all of the 
JNs. On a 3-JN setup, you may have three different responses to NewEpoch. For 
example, the following scenario:

1. NN writing segment starting at txid 1
2. JN1 crashes after txid 50
3. NN successfully rolls, starts txid 101
4. NN successfully finalizes segment 101-200. Starts segment 201
5. NN sends txid 201 to JN2 and JN3, but JN3 crashes before receiving it.
6. Everyone shuts down and restarts.

The current state is then:
JN1: highestSegmentTxId = 1, since it had crashed during that segment
JN2: highestSegmentTxId = 101, since it never got any transactions for segment 
201
JN3: highestSegmentTxId = 201, since it got one transaction for segment 201

Then, depending on which JNs are available during recovery, the 
prepareRecovery() call is different, and thus, we'd need different responses. 
It's not really simple to piggy-back the segment info on NewEpoch, because we 
don't yet know which segment is the one to be recovered (it may be some segment 
that is only available on one live node)


Am I misunderstanding the concrete change that you're proposing? Maybe you can 
post a patch?

bq. In step 3b you state that recovery metadata is created and then deleted in 
step 4. Isn't the updated journal file sufficient? In paxos when phase 2 is 
completed, paxos protocol has essentially completed when quorum number of 
Journal have learned the new value. From what i understand, even in ZAB the 
journal is updated at that stage and no separate metadata is persisted.

The updated journal file isn't sufficient because it doesn't record information 
about whether it was an accepted recovery proposal or whether it was just left 
over at the last write. You need to ensure the property that, if the recovery 
coordinator thinks a value is accepted, then no different recovery will be 
accepted in the future (otherwise you risk having two different finalized 
lengths for the same log segment). In order to do so, you need to wait until a 
quorum of nodes are Finalized before you know that any future recovery will be 
able to rely only on the finalization state.

I don't know enough about the details of the ZAB implementation to understand 
why they can get away without this, if in fact they can. My guess is that it's 
because the transaction IDs themselves have the epoch number as their high 
order bits, and hence you can't ever confuse the first txn of epoch N+1 with 
the last transaction of epoch N.

bq. The final step (finalize-segment or ZAB's commit) is really to lets all the 
JNs know that the new writer is the leader and that they can publish the data 
to other readers (the standBy in our case).

Agreed. At this point we delete the metadata for recovery, but we don't 
necessarily have to. It's just a convenient place to do the cleanup.
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077-test-merge.txt, 
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> hdfs-3077.txt, hdfs-3077.txt, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.pdf, qjournal-design.pdf, 
> qjournal-design.pdf, qjournal-design.tex, qjournal-design.tex
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to