2020-02-18 10:25:29 UTC - Ravi Shah: I am getting Error: Failed to create producer: ProducerBlockedQuotaExceededError while doing load test of 1M messages. Any idea? @Sijie Guo ---- 2020-02-18 13:47:48 UTC - Ravi Shah: Backlog quota exceeded. Cannot create producer [standalone-33-6] ---- 2020-02-18 16:45:25 UTC - Sijie Guo: Which version are you using? The error indicates that you produced more messages than the backlog quota ---- 2020-02-18 16:59:37 UTC - Atif: @Atif has joined the channel ---- 2020-02-19 03:39:06 UTC - Addison Higham: okay, so I think I found an issue we are seeing with proxies/brokers on k8s, where we end up with the proxy being out of sync and trying to send traffic to a broker that doesn't exist anymore. I goes like this: 1. We have some minor interruption with ZK 2. This issue causes one of the brokers to restart and get rescheduled to a new IP 3. Since the broker has a new IP, it causes the proxy do need to refresh it's cache of available brokers (with the watch in the `ZookeeperCacheLoader`) 4. Because ZK is still somewhat catching back up, the call to update the broker info here <https://github.com/apache/pulsar/blob/master/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/util/ZookeeperCacheLoader.java#L82> fails 5. That exception is caught and logged, however, if no other broker changes happen, the cache is never updated 6. The proxy still keeps trying to send requests to broker with an old IP ---- 2020-02-19 03:39:48 UTC - Devin G. Bost: That sounds very similar to an issue we’ve seen with bare metal docker. ---- 2020-02-19 03:40:12 UTC - Addison Higham: The patch here <https://github.com/apache/pulsar/pull/6347> may help, but I think there is still a fundamental problem with the `ZookeeperCacheLoader` (and at least how the proxy is using it) in that it never actually calls from the zookeeper cache ---- 2020-02-19 03:41:21 UTC - Devin G. Bost: Did you run into the issue when under heavy load? Or, was it completely random ---- 2020-02-19 03:41:22 UTC - Devin G. Bost: ? ---- 2020-02-19 03:43:20 UTC - Addison Higham: There is a lot of code in place that can make ZK be cached, however the `ZookeeperCacheLoader` doesn't use any of it and relies purely on watches to keep it's state up to date. It seems like the `getAvailableBrokers()` should instead just directly call the cache and get both the list of children and the loadReport, OR the `ZookeeperCacheLoader` should have another mechanism (instead of watches) to periodically refresh its cache of the brokers ---- 2020-02-19 03:44:48 UTC - Addison Higham: @Devin G. Bost in this instance, not particularly ---- 2020-02-19 03:45:16 UTC - Addison Higham: but we have seen some issues in the past where load has caused brokers to fall over and then we get out of sync, but likely also because of timeouts trying to get to ZK ---- 2020-02-19 03:46:30 UTC - Devin G. Bost: That sounds familiar. ---- 2020-02-19 03:46:46 UTC - Devin G. Bost: Once ZK is behind, it seems like everything falls apart. ---- 2020-02-19 03:48:38 UTC - Devin G. Bost: It’s pretty typical for distributed systems to heavily depend upon the watches to get updates from ZK, but I wonder if the bigger issue is that it’s easy to put heavy load on ZK, and once ZK gets behind, the cluster stops behaving properly. How much actual load would that caching remove from ZK? ---- 2020-02-19 03:48:43 UTC - Devin G. Bost: If it’s a lot, then it might be worth it. ---- 2020-02-19 03:49:18 UTC - Devin G. Bost: Otherwise, I’m concerned that it might just increase the complexity since then there must be a careful mechanism to invalidate the caches, or the same type of problem will remain. ---- 2020-02-19 03:53:32 UTC - Addison Higham: I mean, it is just a cache invalidation bug. A watch triggers a cache invalidation that can fail, but there is no alternative mechanism or way to retry or eventually ensure the cache is eventually correct ---- 2020-02-19 03:53:47 UTC - Devin G. Bost: Oh, gotcha. ---- 2020-02-19 03:54:04 UTC - Devin G. Bost: Yes, I agree that we’ve seen issues where it seems to get out of sync. ---- 2020-02-19 03:54:36 UTC - Devin G. Bost: In some cases, we’ve deleted major sections of data from ZK to try to force an update, but it didn’t always work. We only sort of knew what we were doing when we tried that… ---- 2020-02-19 03:56:31 UTC - Addison Higham: just for the proxy? AFAICT, this seems to only be an issue in the proxy, there may be analagous bugs elsewhere, it seems like the broker lookup has been a bit of copy pasted in a place or two, but with somewhat difference in the details ---- 2020-02-19 03:56:54 UTC - Devin G. Bost: I’m not sure about the proxy. ---- 2020-02-19 03:57:20 UTC - Devin G. Bost: I actually don’t think we were running a proxy when we had that happen… ---- 2020-02-19 03:57:36 UTC - Devin G. Bost: So, perhaps we’re talking about two different things. ---- 2020-02-19 03:58:26 UTC - Devin G. Bost: I wonder if we could have a background process that would just periodically verify that metadata in the caches all matches what’s in ZK. ---- 2020-02-19 04:00:05 UTC - Addison Higham: I think I saw a bug report you filed, do you have it handy? will take a look now that I am on this thread ---- 2020-02-19 04:01:55 UTC - Devin G. Bost: There are a couple. I’ll grab them for you. ---- 2020-02-19 04:02:32 UTC - Devin G. Bost: I’m not sure how closely related they are though since when we’ve had the issues, we had a lot of things all blow up at once like a chain reaction. It seemed like there may have been multiple issues involved. ---- 2020-02-19 04:05:35 UTC - Addison Higham: one other datapoint, the load managers that also control getting a list of active brokers do just pull from a the zookeeperCache object, so this seems likely to not be as large of a deal ---- 2020-02-19 04:05:55 UTC - Devin G. Bost: <https://github.com/apache/pulsar/issues/5311> (old) <https://github.com/apache/pulsar/issues/6054> <https://github.com/apache/pulsar/issues/6251> (broad) ---- 2020-02-19 04:06:57 UTC - Devin G. Bost: I should mention, however, that the first two issues in that list were primarily focused around a freezing topic issue that @Penghui Li may have fixed recently. We’ve seen other strange issues that I haven’t created issues for since we weren’t exactly sure how to reproduce them. ---- 2020-02-19 04:08:17 UTC - Devin G. Bost: During one of the recent prod issues, I remember noticing that there were missing ZK entries… I probably should have created a Pulsar issue, but I didn’t. I might be able to dig up the information from messages I have with other devs. ---- 2020-02-19 04:15:53 UTC - Addison Higham: maybe i just saw a message here ---- 2020-02-19 04:16:03 UTC - Devin G. Bost: As I’m looking through my chats, I’m remembering that geo-replication was enabled before the last issue with ZK missing data. ---- 2020-02-19 04:17:09 UTC - Devin G. Bost: You might have seen a message in <#C5Z4T36F7|general> at some point in the past. A few months ago, we were dealing with prod issues several times a week, so we were pretty desperate to find a solution. What made the most difference was optimizing ZK disks to be on fast SSD SAN instead of slow 7200 disks. ---- 2020-02-19 04:17:20 UTC - Devin G. Bost: After that, we stopped seeing most of the issues. ---- 2020-02-19 08:51:56 UTC - Manuel Mueller: @Manuel Mueller has joined the channel ----