Re: Review Request: ZOOKEEPER-1413: Use on-disk transaction log for learner sync up

fpj Thu, 30 May 2013 07:01:02 -0700


> On May 24, 2013, 12:19 a.m., fpj wrote:
> > /src/java/main/org/apache/zookeeper/server/TxnLogProposalIterator.java, 
> > line 67
> > <https://reviews.apache.org/r/11231/diff/1/?file=293955#file293955line67>
> >
> >     Ideally we propagate the exception up, but it is incompatible with the 
> > signature of the method unless we throw NoSuchElementException. What if we 
> > return null in the case of an exception?
> 
> Thawan Kooburat wrote:
>     IOException can be thrown from  
>     
>     hasNext = itr.next();
>     
>     which for prefetching the next element. 
>     
>     I think we only need to set hasNext to false when we see exception, so 
> the client won't try to call next() again
>
> 
> Raul Gutierrez Segales wrote:
>     Record#serialize() can also throw IOException. Are callers of this method 
> expected to deal with a Proposal that isn't properly initialized (i.e.: won't 
> have the QuorumPacket member won't be initialized so might have wrong zxid, 
> type, etc)?
> 
> Thawan Kooburat wrote:
>     itr.next() is the only place that actually perform IO-related operation. 
>     
>     Record#serialize() can thrown IO exception if the output stream is 
> writing to network or disk. However, in this case, we are writing it to 
> memory because we are using BinaryOutputArchive/ByteArrayOutputStream.  I 
> looked at the code, these implementation cannot throw IOException but they 
> have to inherit the method signature of output stream.
> 
> fpj wrote:
>     My concern here is that we simply log the error and keep making progress 
> regularly, that's why I suggested we could try to propagate an exception up, 
> but it is not completely straightforward because of the reasons I stated. If 
> a server acks a txn but later cannot read it, then we might end up in an 
> inconsistent state.
> 
> Thawan Kooburat wrote:
>     From synchronization perspective, if the leader encounter a possible hole 
> or corrupted txnlog, it will have to abort by clear the txn queue which is 
> going to be sent to the learner. Then, it can fall back to sending a snapshot.
>     
>     The txnlog facility need to be able to guarantee there properties but i 
> don't think we have that yet.  I think we only do per-entry checksum and 
> throws IOexception because of disk read, checksum or deserialization failure.


It is a bit worse than what you're saying. If we miss txns because the log is 
corrupt, then we might end up in a situation in which we don't have a majority 
of servers containing a given txn for an operation that has been replied. In 
such a case, we may end up starting an epoch missing some txns that we have 
replied to. 

If you think that the problem is deeper (and it might be) and we need to do 
more, then perhaps we need to discuss it in another jira. From your response, 
you prefer not to propagate an exception up the call chain.


> On May 24, 2013, 12:19 a.m., fpj wrote:
> > /src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java, line 
> > 861
> > <https://reviews.apache.org/r/11231/diff/1/?file=293960#file293960line861>
> >
> >     I'm confused about this case. The only way I can think of right now of 
> > falling into this case is if the follower has tried to join an epoch that 
> > has never become established. 
> >     
> >     Why does the learner crash?
> 
> Thawan Kooburat wrote:
>     This is also a case when quorum is formed and it has never see any 
> request.  Then, we restart the quorum.
> 
> fpj wrote:
>     When you say that a quorum is formed, which point of the protocol you're 
> referring to? When a quorum has acknowledged NEWLEADER or some earlier point?
> 
> Thawan Kooburat wrote:
>     I think the case that prompt me to add NewEpochZxid-related stuff is 
> below .   Here the leader has seen the follower transaction T(epoch, zxid)   
>     
>     1. T(1, 4)
>     2. T(2, 0)  - NEWLEADER
>     3. T(3, 0)  - NEWLEADER
>     4. T(3, 1)
>     
>     If a follower joined a quorum at epoch 2 and die. Its peerLastZxid can be 
> T(2,0) if it get a snapshot from the leader, or it gets NEWLEADER packet via 
> outstanding proposals.
>     
>     When the same follower try to join the quorum again, the leader will try 
> to ask the follower to TRUNC to T(1,4) which is not possible because there is 
> no txnlog to truncate and the learner will crash. 
>     
>     The fix is to send a DIFF starting at T(1,4) instead.  However, I think I 
> added a block that handle (Cannot send TRUNC to peer ...) just for safety 
> based on the observation that TRUNC can only work if it is within the same 
> epoch.

Right now it sounds correct to me, but I'll think some more about this one. 


> On May 24, 2013, 12:19 a.m., fpj wrote:
> > /src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java, line 
> > 866
> > <https://reviews.apache.org/r/11231/diff/1/?file=293960#file293960line866>
> >
> >     This seems to be more than a warn, right? It should be either error or 
> > fatal.
> 
> Thawan Kooburat wrote:
>     Should be WARN or INFO because we just send SNAP whenever we have a case 
> that we cannot handle. It is correct but not optimized.
> 
> fpj wrote:
>     You're right, but the logic doesn't seem completely straightforward to 
> me. Is it the case that we end up sending a snapshot because of the special 
> case handled by this if block? 
>     
>             if (needOpPacket && !needSnap) {
>                 // This should never happen, but we should fall back to 
> sending
>                 // snapshot just in case.
>     
>                 LOG.error("Unhandled scenario for peer sid: " +  getSid() +
>                          " fall back to use snapshot");
>     
>                 needSnap = true;
>             }
>     
>     If this is right, why do we need special handling instead of just 
> capturing the fact we need a snapshot from the return value of 
> queuedCommittedProposal?
>
> 
> Thawan Kooburat wrote:
>     This block just act as a safety net to send SNAP if there is case that we 
> forgot to handle (in addition to this one).  When I thought about this 
> carefully, I am not sure if I ever see this log line (Cannot send TRUNC to 
> peer sid...) in our prod or not.

I was actually asking how we end up sending a snapshot. If I get it right, we 
need to set needSnap and the only place I could spot that sets needSnap after 
hitting the case I originally commented on (the comment about being error or 
fatal). Is the block I copied above the one that causes us to send a snapshot 
in the case we hit the scenario of LearnerHandler:750? 


- fpj


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11231/#review20941
-----------------------------------------------------------


On May 20, 2013, 5:40 a.m., Thawan Kooburat wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11231/
> -----------------------------------------------------------
> 
> (Updated May 20, 2013, 5:40 a.m.)
> 
> 
> Review request for zookeeper.
> 
> 
> Description
> -------
> 
> ZOOKEEPER-1413: Use on-disk transaction log for learner sync up
> 
> - Use txnlog for learner synchronization if learner fall too far behind
> - Refactoring LearnerHandler to deal with different cases of handling learner 
> synchronization  
> 
> 
> This addresses bug https://issues.apache.org/jira/browse/ZOOKEEPER-1413.
>     
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/ZOOKEEPER-1413
> 
> 
> Diffs
> -----
> 
>   /src/java/main/org/apache/zookeeper/server/TxnLogProposalIterator.java 
> PRE-CREATION 
>   /src/java/main/org/apache/zookeeper/server/ZKDatabase.java 1483440 
>   /src/java/main/org/apache/zookeeper/server/persistence/FileTxnLog.java 
> 1483440 
>   /src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java 
> 1483440 
>   /src/java/main/org/apache/zookeeper/server/persistence/TxnLog.java 1483440 
>   /src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java 
> 1483440 
>   /src/java/test/org/apache/zookeeper/server/quorum/LearnerHandlerTest.java 
> PRE-CREATION 
>   /src/java/test/org/apache/zookeeper/test/FollowerResyncConcurrencyTest.java 
> 1483440 
>   /src/java/test/org/apache/zookeeper/test/GetProposalFromTxnTest.java 
> PRE-CREATION 
>   /src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java 1483440 
> 
> Diff: https://reviews.apache.org/r/11231/diff/
> 
> 
> Testing
> -------
> 
> - unit tests
> - ran in prod for more than half a year
> 
> 
> Thanks,
> 
> Thawan Kooburat
> 
>

Re: Review Request: ZOOKEEPER-1413: Use on-disk transaction log for learner sync up

Reply via email to