Hi Heesung, Thank you for opening this discussion.
IMO backward-compatibility should be kept at least across minor releases. Although the performance issue you mentioned (e.g. memory burst, high GC and OOM) looks problematic, backward-compatibility is also important. I think which has higher priority depends on the case. Your current change seems to remove the index-based aggregation completely. However I think we should keep room for choice. In order to allow users (here Pulsar server-side admins in particular) to choose the performance or backward-compatibility, how about introducing a "force" setting, e.g. "forceAggregatePublisherStatsByProducerName"? Those who place more importance on the performance than backward-compatibility can set this flag to true. Others, those who want to keep backward-compatibility, set this flag to false. By the way, I'm not sure, is the producer name generation logic already implemented in C++, Go and other clients? If not so, first we should implement it before switching the producer-name-based aggregation. Best Regards, Nozomi 2022年11月17日(木) 8:51 Heesung Sohn <heesung.s...@streamnative.io.invalid>: > Hi, > > To add more about the backward incompatibility issue > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>, > > Before fix: > % ./bin/pulsar-admin topics partitioned-stats > persistent://public/default/pt > ... > "publishers" : [ { > "msgRateIn" : 0.0, > "msgThroughputIn" : 0.0, > "averageMsgSize" : 0.0, > "chunkedMessageRate" : 0.0, > "producerId" : 0, > "supportsPartialProducer" : false > } ], > > After fix: > % ./bin/pulsar-admin topics partitioned-stats > persistent://public/default/pt > ... > "publishers" : [ { > "msgRateIn" : 0.0, > "msgThroughputIn" : 0.0, > "averageMsgSize" : 0.0, > "chunkedMessageRate" : 0.0, > "producerId" : 0, > "supportsPartialProducer" : true, > "producerName" : "standalone-0-1" > }, { > "msgRateIn" : 0.0, > "msgThroughputIn" : 0.0, > "averageMsgSize" : 0.0, > "chunkedMessageRate" : 0.0, > "producerId" : 0, > "supportsPartialProducer" : true, > "producerName" : "standalone-0-0" > } ], > ... > > > The broker side's producer name generation has been there since this PR > <https://github.com/apache/pulsar/pull/1178>(1.22-incubating). > ProducerName was automatically generated in this format > {clusterName}-{brokerInstanceId}-{producerNameGenerationCounter} until > 2.10. > So, a producer to a partitioned topic(2) results in the two producer names, > like the following. > standalone-0-0 > standalone-0-1 > > And since 2.10, by default, partitioned producers have the same producer > name between partitions by this PR > <https://github.com/apache/pulsar/pull/10279>(generated on the client > side). > standalone-0-0 > > Hence, the impacted versions(backward incompatibility issue > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>) by > the proposed fix(Option 1 below) are < 2.10. > > When we aggregate stats between partitions, the default( > aggregatePublisherStatsByProducerName=false) aggregates the producer stats > by the index of the producer stat list from each partition. So, when lucky, > it could output a single producer stat. However, this method can be buggy, > as each partition could return a different size and index of the > producer stat list. > > > To fix the original issue(described here > <https://github.com/apache/pulsar/pull/18254>), I think we have the > following options. > > Option 1(proposed): Deprecate aggregatePublisherStatsByProducerName (the > current PR here <https://github.com/apache/pulsar/pull/18254>) in the next > release and live with behavior change where we get `topics > partitioned-stats` per-producer-and-partition from old clients(Ver. <2.10), > instead of stats per-producer. > > Option 2: Defer the Option 1 fix and push it to the next major > version(3.0.0), as this is a breaking change. > > Option 3: Keep aggregatePublisherStatsByProducerName config but change the > default to aggregatePublisherStatsByProducerName=true. > > Option 4: As a long-term fix, create separate Admin-APIs for publisher and > subscription stats and drop their stats from `topics partitioned-stats` as > it is expensive to aggregate them on the fly. (for thousands of publishers > and subscriptions). Push this change to the next major version. > > Or other suggestions? > > Regards, > Heesung > > > > > > > On Mon, Nov 14, 2022 at 7:56 PM Heesung Sohn <heesung.s...@streamnative.io > > > wrote: > > > Dear Pulsar Community, > > > > We recently found a bug in `pulsar-admin topics partitioned-stats api` > that > > could incur a memory burst, high GC time, or OOM. > > > > For this issue, I proposed a fix > > <https://github.com/apache/pulsar/pull/18254> by deprecating the > aggregatePublisherStatsByProducerName > > config and always aggregating the publishers' stats by publisherName, > > instead of the list index(aggregatePublisherStatsByProducerName=false, > > default). > > > > > > - The index-based aggregation is inherently wrong in a highly > > concurrent producer environment(where the order and size of the > publisher > > stat list are not guaranteed to be the same). The publisher stats > need to > > be aggregated by a unique key, preferably the producer > > name(aggregatePublisherStatsByProducerName=true). > > > > > > However, this fix will break some of the old client's compatibility since > > the way Pulsar generates the producer name has changed over time, as > > described here > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1311168182>. > > > > As I replied here > > <https://github.com/apache/pulsar/pull/18254#issuecomment-1312084363>, > although > > it is not desirable, I think we could be lenient on this change in the > stat > > API response(assuming thispublishers'stat struct is used for human admins > > only for ad-hoc checks). > > > > Are we OK with this non-backward-compatible fix for some of the old > > clients? Or, do you have any other suggestions? > > > > One idea for a long-term fix could be: > > When there are thousands of producers(consumers) for a > > (partitioned-)topic, it is expensive to aggregate each > > publisher(subscriptions)'s stats on-the-fly across the brokers. > Alternatively, > > for the next major version, I think we could further define > > producers(subscriptions)' API like the below and drop the publishers and > > subscriptions structs from topics (partitioned-)stats returns. > > > > pulsar-admin publishers list my-topic --last-pagination-key xyz > > pulsar-admin publishers stats my-producer > > > > # similarly for subscriptions > > > > Regards, > > Heesung > > >