[jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

Eli Collins (Commented) (JIRA) Wed, 28 Dec 2011 20:42:59 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176986#comment-13176986
 ]


Eli Collins commented on HDFS-2709:
-----------------------------------

Approach looks great.

* This change handles errors reading an edit from the log (the common case) but 
not when there's a failure to apply an edit (eg if there was a bug, or a silent 
corruption somehow went unnoticed). While loadEdits won't ignore (will throw) 
this exception it does get propagated up to the catch of Throwable in 
EditLogTailer#run so we effectively retry endlessly in this case. Need to 
replace the TODO(HA) comment there with code to shutdown the SBN. Feel free to 
punt to another jira.
* How about adding a test that uses multiple shared edits dirs, and shows that 
a failure to read from one of them will cause the tailer to not catch up, can 
file a jira for a future change that is OK with faulty shared dirs as long as 
one is working.
* In FileJournalManager#getNumberOfTransactions, not that the we loosen the 
check to  elf.containsTxId(fromTxid) isn't the last else case dead code?
* I think we can remove the "TODO(HA): Should this happen when called by the 
tailer?" comment in loadEdits right since we always create new streams when we 
select them?
* Would it be simpler in LimitedEditLogAnswer#answer to spy on each stream and 
stub readOp rather than introduce LimitedEditLogInputStream?
* How about introducing DFSHATestUtil and put waitForStandbyToCatchUp and 
CouldNotCatchUpException there? Seems like the methods you pointed out in the 
HDFS-2692 review could go there as well).
* Nit: "IOException e", s/e/ioe/
* testFailuretoReadEdits needs a javadoc
* waitForStandbyToCatchUp needs a javadoc indicating it waits for 
NN_LAG_TIMEOUT then throws CouldNotCatchUp
                
> HA: Appropriately handle error conditions in EditLogTailer
> ----------------------------------------------------------
>
>                 Key: HDFS-2709
>                 URL: https://issues.apache.org/jira/browse/HDFS-2709
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Aaron T. Myers
>            Priority: Critical
>         Attachments: HDFS-2709-HDFS-1623.patch, HDFS-2709-HDFS-1623.patch
>
>
> Currently if the edit log tailer experiences an error replaying edits in the 
> middle of a file, it will go back to retrying from the beginning of the file 
> on the next tailing iteration. This is incorrect since many of the edits will 
> have already been replayed, and not all edits are idempotent.
> Instead, we either need to (a) support reading from the middle of a finalized 
> file (ie skip those edits already applied), or (b) abort the standby if it 
> hits an error while tailing. If "a" isn't simple, let's do "b" for now and 
> come back to 'a' later since this is a rare circumstance and better to abort 
> than be incorrect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2709) HA: Appropriately handle error conditions in EditLogTailer

Reply via email to