[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305952#comment-15305952
 ] 

Flavio Junqueira commented on ZOOKEEPER-2434:
---------------------------------------------

I realized that as well as I was looking at ZOOKEEPER-1549. I think that logic 
needs to change a bit to deal with the issue in that other jira.

As is, the exception will bring the server back into leader election. If it 
keeps getting an exception, then the server will be spinning and will make no 
useful progress. Depending on what is causing it to throw an exception, it 
might be better to exit (e.g., there is some issue with the file system), but 
in some cases it will eventually heal itself by pulling a snapshot from the 
leader. 

Perhaps we should differentiate in {{FileTxnLog.truncate}} between an 
IOException from the constructor of {{FileTxnIterator}} and the input stream 
being null. The input stream being null is like we couldn't find a file to 
truncate (e.g., ZOOKEEPER-1549), so this is a case for returning false. 

On weather to exit or keep going, I'm personally not a big fan of exits in the 
middle of the code like that, so if we are to shut down the server, perhaps we 
should looking into doing it differently. Bringing the broken server back to 
leader election has the downside that it will keep making the leader do work to 
get it to follow and in the case it isn't recoverable, we will have that broken 
server annoying the leader. From an ops point of view, the solution in either 
case is to inspect the error and determine whether there is something wrong 
with the server data, and one solution is to wipe out that server data.

> Error handling issues in truncation code
> ----------------------------------------
>
>                 Key: ZOOKEEPER-2434
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2434
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Ed Rowe
>            Priority: Minor
>
> # If FileTxnLog.truncate() is unable to delete a log file, it calls 
> LOG.warn() but otherwise does nothing. I think this should be a fatal error 
> not a logged warning. Otherwise those log files are going be be read later 
> when the DB is reloaded and data that should have been removed will still be 
> present. 
> # Learner.syncWithLeader() expects ZKDatabase.truncateLog() to return false 
> on failure, and if this occurs it calls System.exit(). However, this will 
> never happen because ZKDatabase.truncateLog() never returns false - instead 
> an exception is thrown on failure. ZKDatabase.truncateLog() calls 
> FileTxnSnapLog.truncateLog() which calls FileTxnLog.truncate(), each of which 
> is documented to return false on failure but none of which ever does in 
> practice. TruncateTest.testTruncationNullLog() clearly expects an exception 
> on error in ZKDatabase.truncateLog() so there are conflicting expectations in 
> the codebase. It appears that if Learner.syncWithLeader() encounters an 
> exception, System.exit() will _not_ be called and instead we land in the main 
> run loop where we'll start the whole thing again. So there are two things to 
> deal with: a) whether we want to do system.exit or go back to the main run 
> loop if truncation fails, and b) sort out the return false vs. throw 
> exception discrepancy and make it consistent (and change the docs as needed).
>    
> I'm happy to propose a patch, but I'd need people with more experience in the 
> codebase to weigh in on the questions above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to