[
https://issues.apache.org/jira/browse/HDFS-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Thomas updated HDFS-6777:
-------------------------------
Attachment: 6777-design.pdf
> Supporting consistent edit log reads when in-progress edit log segments are
> included
> ------------------------------------------------------------------------------------
>
> Key: HDFS-6777
> URL: https://issues.apache.org/jira/browse/HDFS-6777
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: qjm
> Reporter: James Thomas
> Assignee: James Thomas
> Attachments: 6777-design.pdf, HDFS-6777.patch
>
>
> For inotify, we want to be able to read transactions from in-progress edit
> log segments so we can serve transactions to listeners soon after they are
> committed.
> Inotify clients ask the active NN for transactions, and the NN opens
> EditLogInputStreams that pull transactions from its various edit repositories
> (local directories, individual JournalNodes, etc.). The goal is to send back
> only successfully sync’ed transactions. What constitutes a successful sync
> varies among the edit repositories. In the case of the JournalNodes, a
> successful sync requires a write to a quorum of the JournalNodes. Properly
> sync’ed transactions are always applied to the namesystem, and unsync’ed
> transactions are never applied. So it is natural that listeners are only
> interested in sync’ed transactions.
> In particular, the NN reads from EditLogInputStreams that it obtains from
> FSEditLog.selectInputStreams (inProgressOk = true) and sends transactions
> back to the client. Some infrastructural changes are required to prevent any
> unsync’ed transactions from reaching the client, and we tackle those in this
> JIRA.
> The current implementation of selectInputStreams (inProgressOk = true,
> fromTxId = X) in FSEditLog essentially asks all available edit log
> repositories for all in-progress or finalized log segments with transactions
> after and including X, and combines log segments starting at the same
> transaction ID in a RedundantEditLogInputStream.
> The first issues with this is that the RedundantEditLogInputStream may
> combine finalized and in-progress segments starting at the same txid, and may
> serve unsync’ed transactions from the in-progress segments. So we simply
> discard in-progress segments if we have any finalized segments starting at
> the same txid -- the modifcation is to
> JournalSet.chainAndMakeRedundantStreams.
> The second change we need to make is for the case where we only have
> in-progress segments starting at a particular txid. In this case, we know we
> are at the log segment the NN currently has open, since the various journals
> ensure that if there are finalized segments available for a particular
> starting txid, we are able to see them (e.g. reads from QJM require a quorum
> of JournalNodes to respond, and a finalized segment must be present on at
> least one JN in the quorum). Our goal in this case is to return only the
> in-progress segments being written by the current NN. This happens trivially
> in the local edits directory case and the NFS shared edits case, since there
> is only a single “replica” and if the NN is writing to it, it is fully
> up-to-date. So we focus on QJM here. The key is that JournalNodes maintain a
> lastWriterEpoch, which increases monotonically as new writers arrive. So we
> keep track of the lastWriterEpochs of the segments we receive from the JNs
> and discard any segments with lastWriterEpochs less than the maximum one
> seen. We know that when a new writer contacts the JournalNodes, it finalizes
> all sync’ed transactions, so in-progress segments with out-of-date
> lastWriterEpochs may contain unsync’ed transactions and any sync’ed
> transactions they contain will also be contained in finalized segments on a
> quorum on JNs. So we can safely discard them. The segments with the largest
> lastWriterEpochs may also contain unsync’ed transactions from a previous
> writer if the current NN has not yet sync’ed any transactions, but in this
> case we will make sure to read only from finalized segments (logic not in
> this JIRA). These segments may also contain unsync’ed transactions from the
> current NN (e.g. which we read in the window where the NN has written to less
> than a quorum of JNs but has not yet crashed due to failure to reach a quorum
> on write), but we only return to the client transactions with txids less than
> or equal to last txid the NN has sync’ed, a value it already stores in a
> variable (logic not in this JIRA). The main modification for this change is
> to QuorumJournalManager.selectInputStreams.
--
This message was sent by Atlassian JIRA
(v6.2#6252)