> On May 24, 2013, 12:19 a.m., fpj wrote:
> > /src/java/main/org/apache/zookeeper/server/TxnLogProposalIterator.java, 
> > line 67
> > <https://reviews.apache.org/r/11231/diff/1/?file=293955#file293955line67>
> >
> >     Ideally we propagate the exception up, but it is incompatible with the 
> > signature of the method unless we throw NoSuchElementException. What if we 
> > return null in the case of an exception?
> 
> Thawan Kooburat wrote:
>     IOException can be thrown from  
>     
>     hasNext = itr.next();
>     
>     which for prefetching the next element. 
>     
>     I think we only need to set hasNext to false when we see exception, so 
> the client won't try to call next() again
>
> 
> Raul Gutierrez Segales wrote:
>     Record#serialize() can also throw IOException. Are callers of this method 
> expected to deal with a Proposal that isn't properly initialized (i.e.: won't 
> have the QuorumPacket member won't be initialized so might have wrong zxid, 
> type, etc)?
> 
> Thawan Kooburat wrote:
>     itr.next() is the only place that actually perform IO-related operation. 
>     
>     Record#serialize() can thrown IO exception if the output stream is 
> writing to network or disk. However, in this case, we are writing it to 
> memory because we are using BinaryOutputArchive/ByteArrayOutputStream.  I 
> looked at the code, these implementation cannot throw IOException but they 
> have to inherit the method signature of output stream.
> 
> fpj wrote:
>     My concern here is that we simply log the error and keep making progress 
> regularly, that's why I suggested we could try to propagate an exception up, 
> but it is not completely straightforward because of the reasons I stated. If 
> a server acks a txn but later cannot read it, then we might end up in an 
> inconsistent state.
> 
> Thawan Kooburat wrote:
>     From synchronization perspective, if the leader encounter a possible hole 
> or corrupted txnlog, it will have to abort by clear the txn queue which is 
> going to be sent to the learner. Then, it can fall back to sending a snapshot.
>     
>     The txnlog facility need to be able to guarantee there properties but i 
> don't think we have that yet.  I think we only do per-entry checksum and 
> throws IOexception because of disk read, checksum or deserialization failure.
> 
> fpj wrote:
>     It is a bit worse than what you're saying. If we miss txns because the 
> log is corrupt, then we might end up in a situation in which we don't have a 
> majority of servers containing a given txn for an operation that has been 
> replied. In such a case, we may end up starting an epoch missing some txns 
> that we have replied to. 
>     
>     If you think that the problem is deeper (and it might be) and we need to 
> do more, then perhaps we need to discuss it in another jira. From your 
> response, you prefer not to propagate an exception up the call chain.
> 
> Thawan Kooburat wrote:
>     This depends on how we decide to handle the error.  With the current 
> code, it will be easier but slower to verify the integrity of the entire 
> txnlog file during db.getProposalsFromTxnLog() call,  so the leader won't 
> enter try to use txnlog at all and fallback to use snapshot. If we decide to 
> load txnlog on-demand, we need a way to propagate the exception and abort the 
> DIFF attempt.
>     
>     Right now, I am not sure if we have a way to verify the integrity of 
> txnlog file.  For example, we should have a notion that a file is already 
> closed and it should contains txn [1..10].  If we get IOException before 
> reading txn 10, its means that txnlog is corrupted. If a file is not close, 
> it means that this is the latest file so failing to reading txn 10 may a 
> result of a partial write, so it is ok to ignore.
>     
>     I think there are 2 main problems that can cause txnlog to fail integrity 
> check needed by this feature: the file is partially corrupt or the operator 
> didn't copied all the txnlog file. We may have to have a separate JIRA to 
> deal with these problem.
> 
> fpj wrote:
>     Ok, let's wrap up this jira with the changes we agreed upon and discuss 
> txn log integrity separately. Could you create a jira for this discussion, 
> please?

Ok,  we can discuss about this in ZOOKEEPER-1710


- Thawan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11231/#review20941
-----------------------------------------------------------


On May 20, 2013, 5:40 a.m., Thawan Kooburat wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11231/
> -----------------------------------------------------------
> 
> (Updated May 20, 2013, 5:40 a.m.)
> 
> 
> Review request for zookeeper.
> 
> 
> Description
> -------
> 
> ZOOKEEPER-1413: Use on-disk transaction log for learner sync up
> 
> - Use txnlog for learner synchronization if learner fall too far behind
> - Refactoring LearnerHandler to deal with different cases of handling learner 
> synchronization  
> 
> 
> This addresses bug https://issues.apache.org/jira/browse/ZOOKEEPER-1413.
>     
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/ZOOKEEPER-1413
> 
> 
> Diffs
> -----
> 
>   /src/java/main/org/apache/zookeeper/server/TxnLogProposalIterator.java 
> PRE-CREATION 
>   /src/java/main/org/apache/zookeeper/server/ZKDatabase.java 1483440 
>   /src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java 
> 1483440 
>   /src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java 
> 1483440 
>   /src/java/main/org/apache/zookeeper/server/persistence/TxnLog.java 1483440 
>   /src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java 
> 1483440 
>   /src/java/test/org/apache/zookeeper/server/quorum/LearnerHandlerTest.java 
> PRE-CREATION 
>   /src/java/test/org/apache/zookeeper/test/FollowerResyncConcurrencyTest.java 
> 1483440 
>   /src/java/test/org/apache/zookeeper/test/GetProposalFromTxnTest.java 
> PRE-CREATION 
>   /src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java 1483440 
> 
> Diff: https://reviews.apache.org/r/11231/diff/
> 
> 
> Testing
> -------
> 
> - unit tests
> - ran in prod for more than half a year
> 
> 
> Thanks,
> 
> Thawan Kooburat
> 
>

Reply via email to