Re: Review Request 19566: QPID-5642: Message sequence numbers (enabled with qpid.msg_sequence) should be persistent across broker restarts

Pavel Moravec Wed, 02 Apr 2014 02:20:18 -0700


> On April 1, 2014, 9:17 p.m., Alan Conway wrote:
> > Sorry, catching up. I think this patch is abusing the bootsequence which 
> > really has nothing at all to do with exchange sequence numbers. It would 
> > work on a standalone (but still be an abuse IMO) broker but it doesn't work 
> > for HA. The base for exchage sequence numbers has to be synchronized across 
> > a HA cluster, whereas the number of times each broker has rebooted *cannot* 
> > be synchronized without breaking its existing meaning as a reboot count. 
> > I'm also a bit wary of the division of sequence number into 17 bits for 
> > "restart" count 47 bits for "message" count. What are the grounds for that 
> > division? Is it safe to assume no overflows?
> > 
> > I see the goal of avoiding persisting the sequence number on every message, 
> > but how about persisting it (per exchange) on some suitably large number of 
> > messages? E.g. persist an initial next-sequence of 64k for each exchange, 
> > with the sequence number starting at 1. Once the sequence number gets 
> > within 32k of next-sequence we increase next-sequence of by 64k etc. On 
> > reboot or on promotion of the primary we jump each exchange sequence number 
> > ahead to the exchange's next-sequence number.
> > 
> > That lets us take smaller jumps on each restart/failover (16 bits vs. 47 
> > bits) and avoids using a single counter for every exchange which makes 
> > overflow less likely. The next-sequence number for each exchange would need 
> > to be persisted, replicated on joining (both can be done by putting it back 
> > in the args) and replicated via a management event when it is increased on 
> > the primary. Jumping the next-sequence number ahead when we still have 32k 
> > messages left to go means that we have time to store & replicate before we 
> > pass the new limit. The next-sequence doesn't have to be precisely 
> > synchronized in a cluster provided it is always a good margin bigger than 
> > the highest sequence number used by that exchange on the primary so far. 
> > During promotion (before allowing backups to join) we would need to jump 
> > each exchange sequence ahead to the next-sequence for the exchange and 
> > increase next-sequence by 64k to make sure the new primary is using higher 
> > sequence numbers than the old.
> > 
> > Are we sure nobody assumes that exchange sequence numbers are sequential? 
> > The above is fine for duplicate detection but are we sure nobody does 
> > missed message detection by checking for gaps in the exchange sequence 
> > numbers? That would complicate everything quite a bit...
> > 
> > 
> >


This seems to be sound to me. To sum it up:
- exchanges with sequencing to have new argument qpid.next_sequenceNo, by 
default set to 64k for new exchanges, or to the value read from store during 
startup (or being updated from primary broker)
  - if the argument is missing in stored args, it will be automatically added 
there (to support upgrades)
- when incrementing sequenceNo in preRoute, if args(qpid.next_sequenceNo) - 
sequenceNo == 32k, then execute 3 steps:
  - increment qpid.next_sequenceNo by 64k
  - update store bdb for the exchange
  - if clustered broker and (primary or being promoted), notify backups about 
the increment of qpid.next_sequenceNo
- when joining a cluster, active broker will update the joiner one 
automatically (via exchange arguments) - no need to change
- when promoting a node to active:
  - set sequenceNo = args(qpid.next_sequenceNo)
  - execute the same 3 steps from "when incrementing sequenceNo"

Two questions:
1) Can a node being promoted already update backups? Or is there a way of 
postponing the update? (see latest point above)
2) What if - after a new promotion - the broker sends few messages with 
sequenceNo say 128k but dies before it really writes to the disk 
args(qpid.next_sequenceNo)=(128+64)k ? Can it even happen?


Sequence number will then jump ahead by at most 96k whenever a standalone 
broker is restarted or new promotion in cluster happens (or cold start of 
cluster happens). That means, there can be 93824992236885 such "jump events" 
without sequence number overflow (or 93824671436104 "jump events" for steadily 
message flow of 1M messages per second for 10 years). More than fair enough.


Comment to "Are we sure nobody assumes that exchange sequence numbers are 
sequential?" - Consumers get messages from queues and no queue can be 
guaranteed to get _all_ messages routed by the exchange (until all know the 
exchange is fanout or binding key is empty or so - but I think nobody can build 
on this premise). So jumps in exchange sequence numbers can be always meant as 
"the skipped messages were not meant for this queue".


- Pavel


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19566/#review39200
-----------------------------------------------------------


On April 1, 2014, 10:45 a.m., Pavel Moravec wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19566/
> -----------------------------------------------------------
> 
> (Updated April 1, 2014, 10:45 a.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Kim van der Riet.
> 
> 
> Bugs: QPID-5642
>     https://issues.apache.org/jira/browse/QPID-5642
> 
> 
> Repository: qpid
> 
> 
> Description
> -------
> 
> Elegant (but not performance optimal) way of patch:
> 
> 1) In Exchange::PreRoute::PreRoute, update exchange in store (whole entry).
> 2) The update method is dealed very similarly like 
> MessageStoreImpl::create(db_ptr db, IdSequence& seq, const 
> qpid::broker::Persistable& p) method, i.e. calls BufferValue that calls 
> Exchange::encode.
> 
> Here the code can be unified by merging MessageStoreImpl::update 
> intoMessageStoreImpl::create method where the code almost duplicates.
> 
> However I do not see the patch as performance efficient, as with every 
> message preroute, new qpid::framing::Buffer is filled in Exchange::encode 
> method, data are copied from it to char* BufferValue::data and even then they 
> are really written to the BDB. While in fact we just update the only one 
> number in the Buffer.
> 
> I tried to come up with less performance invasive approach (for those 
> interested, see 
> https://bugzilla.redhat.com/attachment.cgi?id=877576&action=diff - if you 
> dont have access there, let me write), that keeps qpid::framing::Buffer for 
> every durable exchange with sequencing enabled, but it returned (among 
> others) into the need of changing the way store encodes/decodes Exchange 
> instance (change in Exchange::encode / decode methods). What would make the 
> broker backward incompatible.
> 
> Is the performance penalty (due to Exchange::encode method called for every 
> message preroute) acceptable?
> 
> Is it worth merging MessageStoreImpl::update intoMessageStoreImpl::create 
> method?
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/Exchange.cpp 1582719 
> 
> Diff: https://reviews.apache.org/r/19566/diff/
> 
> 
> Testing
> -------
> 
> - reproducer from JIRA verified
> - automated tests passed (except for those known to fail due to QPID-5641 
> (valgrind & legacystore)
> 
> 
> Thanks,
> 
> Pavel Moravec
> 
>

Re: Review Request 19566: QPID-5642: Message sequence numbers (enabled with qpid.msg_sequence) should be persistent across broker restarts

Reply via email to