[
https://issues.apache.org/jira/browse/ZOOKEEPER-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15305952#comment-15305952
]
Flavio Junqueira commented on ZOOKEEPER-2434:
---------------------------------------------
I realized that as well as I was looking at ZOOKEEPER-1549. I think that logic
needs to change a bit to deal with the issue in that other jira.
As is, the exception will bring the server back into leader election. If it
keeps getting an exception, then the server will be spinning and will make no
useful progress. Depending on what is causing it to throw an exception, it
might be better to exit (e.g., there is some issue with the file system), but
in some cases it will eventually heal itself by pulling a snapshot from the
leader.
Perhaps we should differentiate in {{FileTxnLog.truncate}} between an
IOException from the constructor of {{FileTxnIterator}} and the input stream
being null. The input stream being null is like we couldn't find a file to
truncate (e.g., ZOOKEEPER-1549), so this is a case for returning false.
On weather to exit or keep going, I'm personally not a big fan of exits in the
middle of the code like that, so if we are to shut down the server, perhaps we
should looking into doing it differently. Bringing the broken server back to
leader election has the downside that it will keep making the leader do work to
get it to follow and in the case it isn't recoverable, we will have that broken
server annoying the leader. From an ops point of view, the solution in either
case is to inspect the error and determine whether there is something wrong
with the server data, and one solution is to wipe out that server data.
> Error handling issues in truncation code
> ----------------------------------------
>
> Key: ZOOKEEPER-2434
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2434
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Reporter: Ed Rowe
> Priority: Minor
>
> # If FileTxnLog.truncate() is unable to delete a log file, it calls
> LOG.warn() but otherwise does nothing. I think this should be a fatal error
> not a logged warning. Otherwise those log files are going be be read later
> when the DB is reloaded and data that should have been removed will still be
> present.
> # Learner.syncWithLeader() expects ZKDatabase.truncateLog() to return false
> on failure, and if this occurs it calls System.exit(). However, this will
> never happen because ZKDatabase.truncateLog() never returns false - instead
> an exception is thrown on failure. ZKDatabase.truncateLog() calls
> FileTxnSnapLog.truncateLog() which calls FileTxnLog.truncate(), each of which
> is documented to return false on failure but none of which ever does in
> practice. TruncateTest.testTruncationNullLog() clearly expects an exception
> on error in ZKDatabase.truncateLog() so there are conflicting expectations in
> the codebase. It appears that if Learner.syncWithLeader() encounters an
> exception, System.exit() will _not_ be called and instead we land in the main
> run loop where we'll start the whole thing again. So there are two things to
> deal with: a) whether we want to do system.exit or go back to the main run
> loop if truncation fails, and b) sort out the return false vs. throw
> exception discrepancy and make it consistent (and change the docs as needed).
>
> I'm happy to propose a patch, but I'd need people with more experience in the
> codebase to weigh in on the questions above.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)