codelipenghui commented on code in PR #21118: URL: https://github.com/apache/pulsar/pull/21118#discussion_r1315274862
########## pip/pip-299.md: ########## @@ -0,0 +1,79 @@ +# Background knowledge + +The cursor metadata contains the following three data: +- Subscription properties(Usually, this is small) +- The last sequence ID of each producer. It only exists for the cursor `pulsar.dedup`. If a topic has many, many producers, this part of the data will be large. See [PIP-6:-Guaranteed-Message-Deduplication](https://github.com/apache/pulsar/wiki/PIP-6:-Guaranteed-Message-Deduplication) for more details. +- Individual Deleted Messages(including the acknowledgment of batched messages). This part of the data occupies most of the cursor metadata's space, which is the focus of this proposal. + +Differ with Kafka: Pulsar supports [individual acknowledgment](https://pulsar.apache.org/docs/2.11.x/concepts-messaging/#acknowledgment) (just like ack `{pos-1, pos-3, pos-5}`), so instead of a pointer(acknowledged on the left and un-acknowledged on the right), Pulsar needs to persist the acknowledgment state of each message, we call these records `Individual Deleted Messages.` + +The current persistence mechanism of the cursor metadata(including `Individual Deleted Messages`) works like this: +1. Write the data of cursor metadata(including `Individual Deleted Messages`) to BK in one Entry; by default, the maximum size of the Entry is 5MB. +2. Write the data of cursor metadata(optional to include `Individual Deleted Messages`) to the Metadata Store(such as ZK) if BK-Write fails; data of a Metadata Store Node that is less than 10MB is recommended. Since writing large chunks of data to the Metadata Store frequently makes the Metadata Store work unstable, this is only a backstop measure. + +Is 5MB enough? `Individual Deleted Messages` consists of Position_Rang(each Position_Rang occupies 32 bytes; the implementation will not be explained in this proposal). This means that the Broker can persist `5m / 32bytes` number of Position_Rang for each Subscription, and there is an additional compression mechanism at work, so it is sufficient for almost all scenarios except the following three scenarios: +- Client Miss Acknowledges: Clients receive many messages, and ack some of them, the rest still need to be acknowledged due to errors or other reasons. As time goes on, more and more records will be staying there. +- Delay Messages: Long-delayed and short-delayed messages are mixed, with only the short-delayed message successfully consumed and the long-delayed message not delivered. As time goes on, more and more records will be staying there. +- Large Number of Consumers: If the number of consumers is large and each has some discrete ack records, all add up to a large number. +- Large Number of Producers: If the number of producers is large, there might be a large data of Last Sequence ID to persist. This scenario only exists on the `pulsar.dedup` cursor. + +# Motivation + +Since the frequent persistence of `Individual Deleted Messages` will magnify the amount of BK Written and increase the latency of ack-response, the Broker does not immediately persist it when receiving a consumer's acknowledgment but persists it regularly. + +The data of cursor metadata is recommended to be less than 5MB; if a subscription's `Individual Deleted Messages` data is too large to persist, as the program grows for a long time, there will be more and more non-persistent data. Eventually, there will be an unacceptable amount of repeated consumption of messages when the Broker restarts. + +# Goal + +## In Scope + +To avoid repeated consumption due to the cursor metadata being too large to persist. + +## Out of Scope + +This proposal will not care about this scenario: if so many producers make the metadata of cursor `pulsar.dedup` cannot persist, the task `Take Deduplication Snapshot` will be in vain due to the inability to persist. + +# High-Level Design + +Cache the size of the cursor metadata in memory when doing persistent data to BK. We call the cache name `persistedCursorMetadataSizeInBytes.` + +Provide a new config named `maxUnPersistAckRecordsBytesPerSubscription,` stuck delivery messages to clients if the size of the cursor metadata reaches the limit. + +Note: +- Since we will not update `persistedCursorMetadataSizeInBytes` each time acknowledgment, `persistedCursorMetadataSizeInBytes` is not a real-time value. +- The delayed messages will also not be redelivered after reaching the limitation. Review Comment: Instead of using a new size-based control. We can only add a configuration to stop the message delivery when the subscription reaches the persistent threshold of the acknowledgment state. For example: ``` dispatcherPauseOnAckPersistentStateEnabled=false ``` Now, we have `managedLedgerMaxUnackedRangesToPersist`, but it's not good for end users to understand. We can try to add `managedLedgerMaxAckStateInBytesToPersist`. The newly added `dispatcherPauseOnAckPersistentStateEnabled=false` can work with `managedLedgerMaxUnackedRangesToPersist` and `managedLedgerMaxAckStateInBytesToPersist`. The major reason I want to raise this approach is that if you put them together like the following, it will be a little confusing. ``` managedLedgerMaxUnackedRangesToPersist managedLedgerMaxAckStateInBytesToPersist maxUnPersistAckRecordsBytesPerSubscription ``` But this one looks better. ``` dispatcherPauseOnAckPersistentStateEnabled managedLedgerMaxUnackedRangesToPersist managedLedgerMaxAckStateInBytesToPersist ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
