On Thu, Sep 7, 2017 at 3:18 PM Rajan Dhabalia <rdhaba...@apache.org> wrote:
> > and leave normal queue consumption out of this mechanism, (to reduce > the ZK writes) > >> To be precise, these are BookKeeper writes that would be happening > anyway > > Just to clarify: Broker also stores ack-holes in ZK along with BK > (Cursor-ledger). But Broker only writes it to ZK when broker unloads the > topic gracefully and deletes the cursor-ledger. > That's correct. The cursor state (along with info about messages deleted individually) is snapshotted into the ZK z-node. That doesn't increase the rate of ZK writes, just the size of it when it's happening (and with an upper bound). > > We could also do that once you reach the max number of "holes" the > delivery stops > The only problem I see in restricting based on ack-holes metrics is > "Ack-hole doesn't follow any pattern and it might not be in sequence". > *For example:* > If we have that max-number of ack-hole is = 1K > and if consumer acks alternate consumed message then there will be 1K > ack-holes built with in 2K consumed messages, and broker will stop the > message-delivery. > I don't think that acknowledging every-other message for an extensive amount of time should be considered a "valid" use case. There are multiple reasons to acknowledge out of order, but we cannot keep and arbitrarily big state indefinitely. Consumer should not suffer in this usecase where consumer is blocked after > consuming only 2K messages. > There are not many options here.. Since I think we all agree that we shouldn't store more than N "holes" in any case. Currently we are not storing more that 1K "holes" and we just keep the rest of them in memory only. So the options would be: 1. Continue as today, if you have more than 1K (or 10K should be better default) "holes", you can have duplicates after a broker restart. Cursor will not know of the "holes" after the first 1K (might be big or small..) 2. Stop delivery after 1K holes 3. Have alerts that get triggered at 2/3 of the max number of "holes" so that user can be notified of abnormal acknowledging pattern Stopping delivery based on read position vs mark-delete position doesn't really address this problem. My personal preference would be to do 1 & 3. Same behavior as today with clear actionable metrics getting reported. -- Matteo Merli <mme...@apache.org>