[ 
https://issues.apache.org/jira/browse/HDFS-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Thomas updated HDFS-6777:
-------------------------------

    Attachment: 6777-design.pdf

> Supporting consistent edit log reads when in-progress edit log segments are 
> included
> ------------------------------------------------------------------------------------
>
>                 Key: HDFS-6777
>                 URL: https://issues.apache.org/jira/browse/HDFS-6777
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: qjm
>            Reporter: James Thomas
>            Assignee: James Thomas
>         Attachments: 6777-design.pdf, HDFS-6777.patch
>
>
> For inotify, we want to be able to read transactions from in-progress edit 
> log segments so we can serve transactions to listeners soon after they are 
> committed.
> Inotify clients ask the active NN for transactions, and the NN opens 
> EditLogInputStreams that pull transactions from its various edit repositories 
> (local directories, individual JournalNodes, etc.). The goal is to send back 
> only successfully sync’ed transactions. What constitutes a successful sync 
> varies among the edit repositories. In the case of the JournalNodes, a 
> successful sync requires a write to a quorum of the JournalNodes. Properly 
> sync’ed transactions are always applied to the namesystem, and unsync’ed 
> transactions are never applied. So it is natural that listeners are only 
> interested in sync’ed transactions.
> In particular, the NN reads from EditLogInputStreams that it obtains from 
> FSEditLog.selectInputStreams (inProgressOk = true) and sends transactions 
> back to the client. Some infrastructural changes are required to prevent any 
> unsync’ed transactions from reaching the client, and we tackle those in this 
> JIRA.
> The current implementation of selectInputStreams (inProgressOk = true, 
> fromTxId = X) in FSEditLog essentially asks all available edit log 
> repositories for all in-progress or finalized log segments with transactions 
> after and including X, and combines log segments starting at the same 
> transaction ID in a RedundantEditLogInputStream. 
> The first issues with this is that the RedundantEditLogInputStream may 
> combine finalized and in-progress segments starting at the same txid, and may 
> serve unsync’ed transactions from the in-progress segments. So we simply 
> discard in-progress segments if we have any finalized segments starting at 
> the same txid -- the modifcation is to 
> JournalSet.chainAndMakeRedundantStreams.
> The second change we need to make is for the case where we only have 
> in-progress segments starting at a particular txid. In this case, we know we 
> are at the log segment the NN currently has open, since the various journals 
> ensure that if there are finalized segments available for a particular 
> starting txid, we are able to see them (e.g. reads from QJM require a quorum 
> of JournalNodes to respond, and a finalized segment must be present on at 
> least one JN in the quorum). Our goal in this case is to return only the 
> in-progress segments being written by the current NN. This happens trivially 
> in the local edits directory case and the NFS shared edits case, since there 
> is only a single “replica” and if the NN is writing to it, it is fully 
> up-to-date. So we focus on QJM here. The key is that JournalNodes maintain a 
> lastWriterEpoch, which increases monotonically as new writers arrive. So we 
> keep track of the lastWriterEpochs of the segments we receive from the JNs 
> and discard any segments with lastWriterEpochs less than the maximum one 
> seen. We know that when a new writer contacts the JournalNodes, it finalizes 
> all sync’ed transactions, so in-progress segments with out-of-date 
> lastWriterEpochs may contain unsync’ed transactions and any sync’ed 
> transactions they contain will also be contained in finalized segments on a 
> quorum on JNs. So we can safely discard them. The segments with the largest 
> lastWriterEpochs may also contain unsync’ed transactions from a previous 
> writer if the current NN has not yet sync’ed any transactions, but in this 
> case we will make sure to read only from finalized segments (logic not in 
> this JIRA). These segments may also contain unsync’ed transactions from the 
> current NN (e.g. which we read in the window where the NN has written to less 
> than a quorum of JNs but has not yet crashed due to failure to reach a quorum 
> on write), but we only return to the client transactions with txids less than 
> or equal to last txid the NN has sync’ed, a value it already stores in a 
> variable (logic not in this JIRA). The main modification for this change is 
> to QuorumJournalManager.selectInputStreams.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to