[
https://issues.apache.org/jira/browse/HDFS-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Thomas updated HDFS-6777:
-------------------------------
Description: For inotify, we want to be able to read transactions from
in-progress edit log segments so we can serve transactions to listeners soon
after they are committed. This JIRA works toward ensuring that we do not send
unsync'ed transactions back to the client by 1) discarding in-progress segments
if we have a finalized segment starting at the same transaction ID and 2) if
there are no finalized segments at the same transaction ID, using only the
in-progress segments with the largest seen lastWriterEpoch. See the design
document for more background and details. (was: For inotify, we want to be
able to read transactions from in-progress edit log segments so we can serve
transactions to listeners soon after they are committed.
Inotify clients ask the active NN for transactions, and the NN opens
EditLogInputStreams that pull transactions from its various edit repositories
(local directories, individual JournalNodes, etc.). The goal is to send back
only successfully sync’ed transactions. What constitutes a successful sync
varies among the edit repositories. In the case of the JournalNodes, a
successful sync requires a write to a quorum of the JournalNodes. Properly
sync’ed transactions are always applied to the namesystem, and unsync’ed
transactions are never applied. So it is natural that listeners are only
interested in sync’ed transactions.
In particular, the NN reads from EditLogInputStreams that it obtains from
FSEditLog.selectInputStreams (inProgressOk = true) and sends transactions back
to the client. Some infrastructural changes are required to prevent any
unsync’ed transactions from reaching the client, and we tackle those in this
JIRA.
The current implementation of selectInputStreams (inProgressOk = true, fromTxId
= X) in FSEditLog essentially asks all available edit log repositories for all
in-progress or finalized log segments with transactions after and including X,
and combines log segments starting at the same transaction ID in a
RedundantEditLogInputStream.
The first issues with this is that the RedundantEditLogInputStream may combine
finalized and in-progress segments starting at the same txid, and may serve
unsync’ed transactions from the in-progress segments. So we simply discard
in-progress segments if we have any finalized segments starting at the same
txid -- the modifcation is to JournalSet.chainAndMakeRedundantStreams.
The second change we need to make is for the case where we only have
in-progress segments starting at a particular txid. In this case, we know we
are at the log segment the NN currently has open, since the various journals
ensure that if there are finalized segments available for a particular starting
txid, we are able to see them (e.g. reads from QJM require a quorum of
JournalNodes to respond, and a finalized segment must be present on at least
one JN in the quorum). Our goal in this case is to return only the in-progress
segments being written by the current NN. This happens trivially in the local
edits directory case and the NFS shared edits case, since there is only a
single “replica” and if the NN is writing to it, it is fully up-to-date. So we
focus on QJM here. The key is that JournalNodes maintain a lastWriterEpoch,
which increases monotonically as new writers arrive. So we keep track of the
lastWriterEpochs of the segments we receive from the JNs and discard any
segments with lastWriterEpochs less than the maximum one seen. We know that
when a new writer contacts the JournalNodes, it finalizes all sync’ed
transactions, so in-progress segments with out-of-date lastWriterEpochs may
contain unsync’ed transactions and any sync’ed transactions they contain will
also be contained in finalized segments on a quorum on JNs. So we can safely
discard them. The segments with the largest lastWriterEpochs may also contain
unsync’ed transactions from a previous writer if the current NN has not yet
sync’ed any transactions, but in this case we will make sure to read only from
finalized segments (logic not in this JIRA). These segments may also contain
unsync’ed transactions from the current NN (e.g. which we read in the window
where the NN has written to less than a quorum of JNs but has not yet crashed
due to failure to reach a quorum on write), but we only return to the client
transactions with txids less than or equal to last txid the NN has sync’ed, a
value it already stores in a variable (logic not in this JIRA). The main
modification for this change is to QuorumJournalManager.selectInputStreams.)
> Supporting consistent edit log reads when in-progress edit log segments are
> included
> ------------------------------------------------------------------------------------
>
> Key: HDFS-6777
> URL: https://issues.apache.org/jira/browse/HDFS-6777
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: qjm
> Reporter: James Thomas
> Assignee: James Thomas
> Attachments: 6777-design.pdf, HDFS-6777.patch
>
>
> For inotify, we want to be able to read transactions from in-progress edit
> log segments so we can serve transactions to listeners soon after they are
> committed. This JIRA works toward ensuring that we do not send unsync'ed
> transactions back to the client by 1) discarding in-progress segments if we
> have a finalized segment starting at the same transaction ID and 2) if there
> are no finalized segments at the same transaction ID, using only the
> in-progress segments with the largest seen lastWriterEpoch. See the design
> document for more background and details.
--
This message was sent by Atlassian JIRA
(v6.2#6252)