[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-4394:
--------------------------------------
    Labels: pull-request-available  (was: )

> Learner.syncWithLeader got NullPointerException
> -----------------------------------------------
>
>                 Key: ZOOKEEPER-4394
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4394
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.7.0
>         Environment: ZooKeeper 3.7.0
>            Reporter: Liu Haifeng
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> ZooKeeper follower node encountered NullPointerException during 
> syncWithLeader.
> Logs indicate that the follower has received NEWLEADER packet between a 
> PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet 
> leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do 
> packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the 
> later if-statement raised NPE.
> {code:java}
> case Leader.COMMIT:
> case Leader.COMMITANDACTIVATE:
>     pif = packetsNotCommitted.peekFirst();
>     if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() == 
> Leader.COMMITANDACTIVATE) {
>         // ...
>     }{code}
> After look into the Leader side, I found:
>  # LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog 
> (PROPOSAL/COMMIT pairs);
>  # Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs);
>  # Leader.startForwarding queues outstandingProposals packets(PROSOAL only);
>  # LeanerHandler.run sends NEWLEADER message.
> Seams if the outstandingProposals is not empty at the certain moment, the 
> follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order.
> The follower will retry from LOOKING again and is expected to be succeed at 
> last, however, under heavy load it may be too many retries. Further more, I 
> my case the follower has to sync data from leader's disk, and start over 
> again after the NPE(prior sync not flushed?), which may harm the leader.
> I don't know if it is designed so or not, but consider the performance, can 
> we at least avoid wasting of network/disk IO?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to