[
https://issues.apache.org/jira/browse/ZOOKEEPER-4394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ZOOKEEPER-4394:
--------------------------------------
Labels: pull-request-available (was: )
> Learner.syncWithLeader got NullPointerException
> -----------------------------------------------
>
> Key: ZOOKEEPER-4394
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4394
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.7.0
> Environment: ZooKeeper 3.7.0
> Reporter: Liu Haifeng
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> ZooKeeper follower node encountered NullPointerException during
> syncWithLeader.
> Logs indicate that the follower has received NEWLEADER packet between a
> PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet
> leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do
> packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the
> later if-statement raised NPE.
> {code:java}
> case Leader.COMMIT:
> case Leader.COMMITANDACTIVATE:
> pif = packetsNotCommitted.peekFirst();
> if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() ==
> Leader.COMMITANDACTIVATE) {
> // ...
> }{code}
> After look into the Leader side, I found:
> # LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog
> (PROPOSAL/COMMIT pairs);
> # Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs);
> # Leader.startForwarding queues outstandingProposals packets(PROSOAL only);
> # LeanerHandler.run sends NEWLEADER message.
> Seams if the outstandingProposals is not empty at the certain moment, the
> follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order.
> The follower will retry from LOOKING again and is expected to be succeed at
> last, however, under heavy load it may be too many retries. Further more, I
> my case the follower has to sync data from leader's disk, and start over
> again after the NPE(prior sync not flushed?), which may harm the leader.
> I don't know if it is designed so or not, but consider the performance, can
> we at least avoid wasting of network/disk IO?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)