2020-02-18 10:25:29 UTC - Ravi Shah: I am getting Error: Failed to create 
producer: ProducerBlockedQuotaExceededError while doing load test of 1M 
messages. Any idea? @Sijie Guo
----
2020-02-18 13:47:48 UTC - Ravi Shah: Backlog quota exceeded. Cannot create 
producer [standalone-33-6]
----
2020-02-18 16:45:25 UTC - Sijie Guo: Which version are you using? The error 
indicates that you produced more messages than the backlog quota 
----
2020-02-18 16:59:37 UTC - Atif: @Atif has joined the channel
----
2020-02-19 03:39:06 UTC - Addison Higham: okay, so I think I found an issue we 
are seeing with proxies/brokers on k8s, where we end up with the proxy being 
out of sync and trying to send traffic to a broker that doesn't exist anymore. 
I goes like this:
1. We have some minor interruption with ZK
2. This issue causes one of the brokers to restart and get rescheduled to a new 
IP
3. Since the broker has a new IP, it causes the proxy do need to refresh it's 
cache of available brokers (with the watch in the `ZookeeperCacheLoader`)
4. Because ZK is still somewhat catching back up, the call to update the broker 
info here 
<https://github.com/apache/pulsar/blob/master/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/util/ZookeeperCacheLoader.java#L82>
 fails
5. That exception is caught and logged, however, if no other broker changes 
happen, the cache is never updated
6. The proxy still keeps trying to send requests to broker with an old IP
----
2020-02-19 03:39:48 UTC - Devin G. Bost: That sounds very similar to an issue 
we’ve seen with bare metal docker.
----
2020-02-19 03:40:12 UTC - Addison Higham: The patch here 
<https://github.com/apache/pulsar/pull/6347> may help, but I think there is 
still a fundamental problem with the `ZookeeperCacheLoader` (and at least how 
the proxy is using it) in that it never actually calls from the zookeeper cache
----
2020-02-19 03:41:21 UTC - Devin G. Bost: Did you run into the issue when under 
heavy load? Or, was it completely random
----
2020-02-19 03:41:22 UTC - Devin G. Bost: ?
----
2020-02-19 03:43:20 UTC - Addison Higham: There is a lot of code in place that 
can make ZK be cached, however the `ZookeeperCacheLoader` doesn't use any of it 
and relies purely on watches to keep it's state up to date. It seems like the 
`getAvailableBrokers()` should instead just directly call the cache and get 
both the list of children and the loadReport, OR the `ZookeeperCacheLoader` 
should have another mechanism (instead of watches) to periodically refresh its 
cache of the brokers
----
2020-02-19 03:44:48 UTC - Addison Higham: @Devin G. Bost in this instance, not 
particularly
----
2020-02-19 03:45:16 UTC - Addison Higham: but we have seen some issues in the 
past where load has caused brokers to fall over and then we get out of sync, 
but likely also because of timeouts trying to get to ZK
----
2020-02-19 03:46:30 UTC - Devin G. Bost: That sounds familiar.
----
2020-02-19 03:46:46 UTC - Devin G. Bost: Once ZK is behind, it seems like 
everything falls apart.
----
2020-02-19 03:48:38 UTC - Devin G. Bost: It’s pretty typical for distributed 
systems to heavily depend upon the watches to get updates from ZK, but I wonder 
if the bigger issue is that it’s easy to put heavy load on ZK, and once ZK gets 
behind, the cluster stops behaving properly.
How much actual load would that caching remove from ZK?
----
2020-02-19 03:48:43 UTC - Devin G. Bost: If it’s a lot, then it might be worth 
it.
----
2020-02-19 03:49:18 UTC - Devin G. Bost: Otherwise, I’m concerned that it might 
just increase the complexity since then there must be a careful mechanism to 
invalidate the caches, or the same type of problem will remain.
----
2020-02-19 03:53:32 UTC - Addison Higham: I mean, it is just a cache 
invalidation bug. A watch triggers a cache invalidation that can fail, but 
there is no alternative mechanism or way to retry or eventually ensure the 
cache is eventually correct
----
2020-02-19 03:53:47 UTC - Devin G. Bost: Oh, gotcha.
----
2020-02-19 03:54:04 UTC - Devin G. Bost: Yes, I agree that we’ve seen issues 
where it seems to get out of sync.
----
2020-02-19 03:54:36 UTC - Devin G. Bost: In some cases, we’ve deleted major 
sections of data from ZK to try to force an update, but it didn’t always work. 
We only sort of knew what we were doing when we tried that…
----
2020-02-19 03:56:31 UTC - Addison Higham: just for the proxy? AFAICT, this 
seems to only be an issue in the proxy, there may be analagous bugs elsewhere, 
it seems like the broker lookup has been a bit of copy pasted in a place or 
two, but with somewhat difference in the details
----
2020-02-19 03:56:54 UTC - Devin G. Bost: I’m not sure about the proxy.
----
2020-02-19 03:57:20 UTC - Devin G. Bost: I actually don’t think we were running 
a proxy when we had that happen…
----
2020-02-19 03:57:36 UTC - Devin G. Bost: So, perhaps we’re talking about two 
different things.
----
2020-02-19 03:58:26 UTC - Devin G. Bost: I wonder if we could have a background 
process that would just periodically verify that metadata in the caches all 
matches what’s in ZK.
----
2020-02-19 04:00:05 UTC - Addison Higham: I think I saw a bug report you filed, 
do you have it handy? will take a look now that I am on this thread
----
2020-02-19 04:01:55 UTC - Devin G. Bost: There are a couple. I’ll grab them for 
you.
----
2020-02-19 04:02:32 UTC - Devin G. Bost: I’m not sure how closely related they 
are though since when we’ve had the issues, we had a lot of things all blow up 
at once like a chain reaction. It seemed like there may have been multiple 
issues involved.
----
2020-02-19 04:05:35 UTC - Addison Higham: one other datapoint, the load 
managers that also control getting a list of active brokers do just pull from a 
the zookeeperCache object, so this seems likely to not be as large of a deal
----
2020-02-19 04:05:55 UTC - Devin G. Bost: 
<https://github.com/apache/pulsar/issues/5311> (old)
<https://github.com/apache/pulsar/issues/6054>
<https://github.com/apache/pulsar/issues/6251> (broad)
----
2020-02-19 04:06:57 UTC - Devin G. Bost: I should mention, however, that the 
first two issues in that list were primarily focused around a freezing topic 
issue that @Penghui Li may have fixed recently.
We’ve seen other strange issues that I haven’t created issues for since we 
weren’t exactly sure how to reproduce them.
----
2020-02-19 04:08:17 UTC - Devin G. Bost: During one of the recent prod issues, 
I remember noticing that there were missing ZK entries… I probably should have 
created a Pulsar issue, but I didn’t.
I might be able to dig up the information from messages I have with other devs.
----
2020-02-19 04:15:53 UTC - Addison Higham: maybe i just saw a message here
----
2020-02-19 04:16:03 UTC - Devin G. Bost: As I’m looking through my chats, I’m 
remembering that geo-replication was enabled before the last issue with ZK 
missing data.
----
2020-02-19 04:17:09 UTC - Devin G. Bost: You might have seen a message in 
<#C5Z4T36F7|general> at some point in the past. A few months ago, we were 
dealing with prod issues several times a week, so we were pretty desperate to 
find a solution. What made the most difference was optimizing ZK disks to be on 
fast SSD SAN instead of slow 7200 disks.
----
2020-02-19 04:17:20 UTC - Devin G. Bost: After that, we stopped seeing most of 
the issues.
----
2020-02-19 08:51:56 UTC - Manuel Mueller: @Manuel Mueller has joined the channel
----

Reply via email to