[ 
https://issues.apache.org/jira/browse/HDFS-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Thomas updated HDFS-6777:
-------------------------------

    Description: 
For inotify, we want to be able to read transactions from in-progress edit log 
segments so we can serve transactions to listeners soon after they are 
committed.

Inotify clients ask the active NN for transactions, and the NN opens 
EditLogInputStreams that pull transactions from its various edit repositories 
(local directories, individual JournalNodes, etc.). The goal is to send back 
only successfully sync’ed transactions. What constitutes a successful sync 
varies among the edit repositories. In the case of the JournalNodes, a 
successful sync requires a write to a quorum of the JournalNodes. Properly 
sync’ed transactions are always applied to the namesystem, and unsync’ed 
transactions are never applied. So it is natural that listeners are only 
interested in sync’ed transactions.

In particular, the NN reads from EditLogInputStreams that it obtains from 
FSEditLog.selectInputStreams (inProgressOk = true) and sends transactions back 
to the client. Some infrastructural changes are required to prevent any 
unsync’ed transactions from reaching the client, and we tackle those in this 
JIRA.

The current implementation of selectInputStreams (inProgressOk = true, fromTxId 
= X) in FSEditLog essentially asks all available edit log repositories for all 
in-progress or finalized log segments with transactions after and including X, 
and combines log segments starting at the same transaction ID in a 
RedundantEditLogInputStream. 

The first issues with this is that the RedundantEditLogInputStream may combine 
finalized and in-progress segments starting at the same txid, and may serve 
unsync’ed transactions from the in-progress segments. So we simply discard 
in-progress segments if we have any finalized segments starting at the same 
txid -- the modifcation is to JournalSet.chainAndMakeRedundantStreams.

The second change we need to make is for the case where we only have 
in-progress segments starting at a particular txid. In this case, we know we 
are at the log segment the NN currently has open, since the various journals 
ensure that if there are finalized segments available for a particular starting 
txid, we are able to see them (e.g. reads from QJM require a quorum of 
JournalNodes to respond, and a finalized segment must be present on at least 
one JN in the quorum). Our goal in this case is to return only the in-progress 
segments being written by the current NN. This happens trivially in the local 
edits directory case and the NFS shared edits case, since there is only a 
single “replica” and if the NN is writing to it, it is fully up-to-date. So we 
focus on QJM here. The key is that JournalNodes maintain a lastWriterEpoch, 
which increases monotonically as new writers arrive. So we keep track of the 
lastWriterEpochs of the segments we receive from the JNs and discard any 
segments with lastWriterEpochs less than the maximum one seen. We know that 
when a new writer contacts the JournalNodes, it finalizes all sync’ed 
transactions, so in-progress segments with out-of-date lastWriterEpochs may 
contain unsync’ed transactions and any sync’ed transactions they contain will 
also be contained in finalized segments on a quorum on JNs. So we can safely 
discard them. The segments with the largest lastWriterEpochs may also contain 
unsync’ed transactions from a previous writer if the current NN has not yet 
sync’ed any transactions, but in this case we will make sure to read only from 
finalized segments (logic not in this JIRA). These segments may also contain 
unsync’ed transactions from the current NN (e.g. which we read in the window 
where the NN has written to less than a quorum of JNs but has not yet crashed 
due to failure to reach a quorum on write), but we only return to the client 
transactions with txids less than or equal to last txid the NN has sync’ed, a 
value it already stores in a variable (logic not in this JIRA). The main 
modification for this change is to QuorumJournalManager.selectInputStreams.

  was:
For inotify, we want to be able to read transactions from in-progress edit log 
segments so we can serve transactions to listeners soon after they are 
committed.

Inotify clients ask the active NN for transactions, and the NN opens 
EditLogInputStreams that pull transactions from its various edit repositories 
(local directories, individual JournalNodes, etc.). The goal is to send back 
only successfully sync’ed transactions. What constitutes a successful sync 
varies among the edit repositories. In the case of the JournalNodes, a 
successful sync requires a write to a quorum of the JournalNodes. Properly 
sync’ed transactions are always applied to the namesystem, and unsync’ed 
transactions are never applied. So it is natural that listeners are only 
interested in sync’ed transactions.

To achieve this goal, the NN returns to clients only transactions up to the 
last txid it has sync'ed.
Some infrastructural changes are required for this policy to prevent any 
unsync’ed transactions from reaching the client, and we tackle those in this 
JIRA.

The first issue is that the current implementation of selectInputStreams 
(inProgressOk = true, fromTxId = X) in FSEditLog (which the NN uses to read 
transactions) essentially asks all available edit log repositories for all 
in-progress or finalized log segments with transactions after and including X, 
and combines log segments starting at the same transaction ID in a 
RedundantEditLogInputStream. The problem here is that the 
RedundantEditLogInputStream may combine finalized and in-progress segments 
starting at the same txid, and may serve unsync’ed transactions from the 
in-progress segments. So we simply discard in-progress segments if we have any 
finalized segments starting at the same txid -- the modifcation is to 
JournalSet.chainAndMakeRedundantStreams.

The second change we need to make is for the case where we only have 
in-progress segments starting at a particular txid. In this case, we know we 
are at the log segment the NN currently has open, since the various journals 
ensure that if there are finalized segments available for a particular starting 
txid, we are able to see them (e.g. reads from QJM require a quorum of 
JournalNodes to respond, and a finalized segment must be present on at least 
one JN in the quorum). Our goal in this case is to return only the in-progress 
segments being written by the current NN. This happens trivially in the local 
edits directory case and the NFS shared edits case, since there is only a 
single “replica” and if the NN is writing to it, it is fully up-to-date. So we 
focus on QJM here. The key is that JournalNodes maintain a lastWriterEpoch, 
which increases monotonically as new writers arrive. So we keep track of the 
lastWriterEpochs of the segments we receive from the JNs and discard any 
segments with lastWriterEpochs less than the maximum one seen. We know that 
when a new writer contacts the JournalNodes, it finalizes all sync’ed 
transactions, so in-progress segments with out-of-date lastWriterEpochs may 
contain unsync’ed transactions and any sync’ed transactions they contain will 
also be contained in finalized segments on a quorum on JNs. So we can safely 
discard them. The segments with the largest lastWriterEpochs may also contain 
unsync’ed transactions from a previous writer if the current NN has not yet 
sync’ed any transactions, but in this case we will make sure to read only from 
finalized segments. The main modification for this change is to 
QuorumJournalManager.selectInputStreams.


> Supporting consistent edit log reads when in-progress edit log segments are 
> included
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-6777
>                 URL: https://issues.apache.org/jira/browse/HDFS-6777
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: qjm
>            Reporter: James Thomas
>            Assignee: James Thomas
>
> For inotify, we want to be able to read transactions from in-progress edit 
> log segments so we can serve transactions to listeners soon after they are 
> committed.
> Inotify clients ask the active NN for transactions, and the NN opens 
> EditLogInputStreams that pull transactions from its various edit repositories 
> (local directories, individual JournalNodes, etc.). The goal is to send back 
> only successfully sync’ed transactions. What constitutes a successful sync 
> varies among the edit repositories. In the case of the JournalNodes, a 
> successful sync requires a write to a quorum of the JournalNodes. Properly 
> sync’ed transactions are always applied to the namesystem, and unsync’ed 
> transactions are never applied. So it is natural that listeners are only 
> interested in sync’ed transactions.
> In particular, the NN reads from EditLogInputStreams that it obtains from 
> FSEditLog.selectInputStreams (inProgressOk = true) and sends transactions 
> back to the client. Some infrastructural changes are required to prevent any 
> unsync’ed transactions from reaching the client, and we tackle those in this 
> JIRA.
> The current implementation of selectInputStreams (inProgressOk = true, 
> fromTxId = X) in FSEditLog essentially asks all available edit log 
> repositories for all in-progress or finalized log segments with transactions 
> after and including X, and combines log segments starting at the same 
> transaction ID in a RedundantEditLogInputStream. 
> The first issues with this is that the RedundantEditLogInputStream may 
> combine finalized and in-progress segments starting at the same txid, and may 
> serve unsync’ed transactions from the in-progress segments. So we simply 
> discard in-progress segments if we have any finalized segments starting at 
> the same txid -- the modifcation is to 
> JournalSet.chainAndMakeRedundantStreams.
> The second change we need to make is for the case where we only have 
> in-progress segments starting at a particular txid. In this case, we know we 
> are at the log segment the NN currently has open, since the various journals 
> ensure that if there are finalized segments available for a particular 
> starting txid, we are able to see them (e.g. reads from QJM require a quorum 
> of JournalNodes to respond, and a finalized segment must be present on at 
> least one JN in the quorum). Our goal in this case is to return only the 
> in-progress segments being written by the current NN. This happens trivially 
> in the local edits directory case and the NFS shared edits case, since there 
> is only a single “replica” and if the NN is writing to it, it is fully 
> up-to-date. So we focus on QJM here. The key is that JournalNodes maintain a 
> lastWriterEpoch, which increases monotonically as new writers arrive. So we 
> keep track of the lastWriterEpochs of the segments we receive from the JNs 
> and discard any segments with lastWriterEpochs less than the maximum one 
> seen. We know that when a new writer contacts the JournalNodes, it finalizes 
> all sync’ed transactions, so in-progress segments with out-of-date 
> lastWriterEpochs may contain unsync’ed transactions and any sync’ed 
> transactions they contain will also be contained in finalized segments on a 
> quorum on JNs. So we can safely discard them. The segments with the largest 
> lastWriterEpochs may also contain unsync’ed transactions from a previous 
> writer if the current NN has not yet sync’ed any transactions, but in this 
> case we will make sure to read only from finalized segments (logic not in 
> this JIRA). These segments may also contain unsync’ed transactions from the 
> current NN (e.g. which we read in the window where the NN has written to less 
> than a quorum of JNs but has not yet crashed due to failure to reach a quorum 
> on write), but we only return to the client transactions with txids less than 
> or equal to last txid the NN has sync’ed, a value it already stores in a 
> variable (logic not in this JIRA). The main modification for this change is 
> to QuorumJournalManager.selectInputStreams.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to