[ https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176986#comment-13176986 ]
Eli Collins commented on HDFS-2709: ----------------------------------- Approach looks great. * This change handles errors reading an edit from the log (the common case) but not when there's a failure to apply an edit (eg if there was a bug, or a silent corruption somehow went unnoticed). While loadEdits won't ignore (will throw) this exception it does get propagated up to the catch of Throwable in EditLogTailer#run so we effectively retry endlessly in this case. Need to replace the TODO(HA) comment there with code to shutdown the SBN. Feel free to punt to another jira. * How about adding a test that uses multiple shared edits dirs, and shows that a failure to read from one of them will cause the tailer to not catch up, can file a jira for a future change that is OK with faulty shared dirs as long as one is working. * In FileJournalManager#getNumberOfTransactions, not that the we loosen the check to elf.containsTxId(fromTxid) isn't the last else case dead code? * I think we can remove the "TODO(HA): Should this happen when called by the tailer?" comment in loadEdits right since we always create new streams when we select them? * Would it be simpler in LimitedEditLogAnswer#answer to spy on each stream and stub readOp rather than introduce LimitedEditLogInputStream? * How about introducing DFSHATestUtil and put waitForStandbyToCatchUp and CouldNotCatchUpException there? Seems like the methods you pointed out in the HDFS-2692 review could go there as well). * Nit: "IOException e", s/e/ioe/ * testFailuretoReadEdits needs a javadoc * waitForStandbyToCatchUp needs a javadoc indicating it waits for NN_LAG_TIMEOUT then throws CouldNotCatchUp > HA: Appropriately handle error conditions in EditLogTailer > ---------------------------------------------------------- > > Key: HDFS-2709 > URL: https://issues.apache.org/jira/browse/HDFS-2709 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha, name-node > Affects Versions: HA branch (HDFS-1623) > Reporter: Todd Lipcon > Assignee: Aaron T. Myers > Priority: Critical > Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch > > > Currently if the edit log tailer experiences an error replaying edits in the > middle of a file, it will go back to retrying from the beginning of the file > on the next tailing iteration. This is incorrect since many of the edits will > have already been replayed, and not all edits are idempotent. > Instead, we either need to (a) support reading from the middle of a finalized > file (ie skip those edits already applied), or (b) abort the standby if it > hits an error while tailing. If "a" isn't simple, let's do "b" for now and > come back to 'a' later since this is a rare circumstance and better to abort > than be incorrect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira