[
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399757#comment-13399757
]
Todd Lipcon commented on HDFS-3077:
-----------------------------------
bq. if the JournalNode who already received a prepare RPC with higher newEpoch
number (possible?) can inform the writer(proposer), the writer can exit earlier
in step 2 "Choosing a recovery".
Yes, it would reject such RPCs. Once it receives newEpoch(N), then it won't
accept any RPCs from any writer with a lower epoch number (the normal Paxos
promise guarantee)
bq. In step 3 "Accept RPC", I assume the URL that the writer sends to all the
JNs is the URL of one JN which responded in step2. If that JN becomes
inaccessible immediately, and thus other JNs can't sync themselves by
downloading the finalized segment from that JN, the recovery process could be
stuck?
More or less, yes. It's a potential improvement to actually send a list of URLs
here -- and include any who already have the correct segment in sync.
IMO, recovery needs to be tolerant of missing nodes, but I don't think we need
it to be tolerant of nodes crashing in the middle of the recovery process --
it's OK for the NN to bail out of startup in that case, so long as they don't
leave the system in an unrecoverable state. The "retry" would be done by trying
again to start the NN. I'll try to address this in the next rev of the
document. It would be future work to add a retry loop around the recovery
process so that it would be tolerant of this.
My thinking for the above is that log segments tend to be short -- in an HA
setup we roll every 2 minutes. So even in a heavily loaded cluster, segments
tend to be quite small. I just looked on a 100-node QA cluster here running HA
and a constant workload of terasort, gridmix, etc, and the largest edit log is
3.1MB. So, the recovery process should complete in far less than 1 second given
the transfer time for such a segment would be <~50ms. Hence the chances of a
crash during this timeframe are vanishingly small.
bq. If it could be stuck in step 3, an alternate way to sync lagging JNs is to
let them contact other quorum number JNs to download the finalized segments.
Given the requirement "All loggers must finalize the segment to the same length
and contents", all the finalized segment with the same name should be identical
in all JNs. Therefore, the lagging JN can download it from any other JN as long
as that JN has the file.
Yep, per above I think we can extend to a list of all possible JNs which are
known to be up-to-date with the sync txid. But I think it's a good future
enhancement rather than a requirement for right now. Does my reasoning make
sense?
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
> Key: HDFS-3077
> URL: https://issues.apache.org/jira/browse/HDFS-3077
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: ha, name-node
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Attachments: hdfs-3077-partial.txt, qjournal-design.pdf,
> qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on
> shared storage such as an NFS filer for the shared edit log. One alternative
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject
> which provides a highly available replicated edit log on commodity hardware.
> This JIRA is to implement another alternative, based on a quorum commit
> protocol, integrated more tightly in HDFS and with the requirements driven
> only by HDFS's needs rather than more generic use cases. More details to
> follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira