Slack digest for #dev - 2020-02-19

Apache Pulsar Slack Wed, 19 Feb 2020 01:12:08 -0800

2020-02-18 10:25:29 UTC - Ravi Shah: I am getting Error: Failed to create
producer: ProducerBlockedQuotaExceededError while doing load test of 1M
messages. Any idea? @Sijie Guo
----
2020-02-18 13:47:48 UTC - Ravi Shah: Backlog quota exceeded. Cannot create
producer [standalone-33-6]
----
2020-02-18 16:45:25 UTC - Sijie Guo: Which version are you using? The error
indicates that you produced more messages than the backlog quota
----
2020-02-18 16:59:37 UTC - Atif: @Atif has joined the channel
----
2020-02-19 03:39:06 UTC - Addison Higham: okay, so I think I found an issue we
are seeing with proxies/brokers on k8s, where we end up with the proxy being
out of sync and trying to send traffic to a broker that doesn't exist anymore.
I goes like this:
1. We have some minor interruption with ZK
2. This issue causes one of the brokers to restart and get rescheduled to a new
IP
3. Since the broker has a new IP, it causes the proxy do need to refresh it's
cache of available brokers (with the watch in the `ZookeeperCacheLoader`)
4. Because ZK is still somewhat catching back up, the call to update the broker
info here
<https://github.com/apache/pulsar/blob/master/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/util/ZookeeperCacheLoader.java#L82>
fails
5. That exception is caught and logged, however, if no other broker changes
happen, the cache is never updated
6. The proxy still keeps trying to send requests to broker with an old IP
----
2020-02-19 03:39:48 UTC - Devin G. Bost: That sounds very similar to an issue
we’ve seen with bare metal docker.
----
2020-02-19 03:40:12 UTC - Addison Higham: The patch here
<https://github.com/apache/pulsar/pull/6347> may help, but I think there is
still a fundamental problem with the `ZookeeperCacheLoader` (and at least how
the proxy is using it) in that it never actually calls from the zookeeper cache
----
2020-02-19 03:41:21 UTC - Devin G. Bost: Did you run into the issue when under
heavy load? Or, was it completely random
----
2020-02-19 03:41:22 UTC - Devin G. Bost: ?
----
2020-02-19 03:43:20 UTC - Addison Higham: There is a lot of code in place that
can make ZK be cached, however the `ZookeeperCacheLoader` doesn't use any of it
and relies purely on watches to keep it's state up to date. It seems like the
`getAvailableBrokers()` should instead just directly call the cache and get
both the list of children and the loadReport, OR the `ZookeeperCacheLoader`
should have another mechanism (instead of watches) to periodically refresh its
cache of the brokers
----
2020-02-19 03:44:48 UTC - Addison Higham: @Devin G. Bost in this instance, not
particularly
----
2020-02-19 03:45:16 UTC - Addison Higham: but we have seen some issues in the
past where load has caused brokers to fall over and then we get out of sync,
but likely also because of timeouts trying to get to ZK
----
2020-02-19 03:46:30 UTC - Devin G. Bost: That sounds familiar.
----
2020-02-19 03:46:46 UTC - Devin G. Bost: Once ZK is behind, it seems like
everything falls apart.
----
2020-02-19 03:48:38 UTC - Devin G. Bost: It’s pretty typical for distributed
systems to heavily depend upon the watches to get updates from ZK, but I wonder
if the bigger issue is that it’s easy to put heavy load on ZK, and once ZK gets
behind, the cluster stops behaving properly.
How much actual load would that caching remove from ZK?
----
2020-02-19 03:48:43 UTC - Devin G. Bost: If it’s a lot, then it might be worth
it.
----
2020-02-19 03:49:18 UTC - Devin G. Bost: Otherwise, I’m concerned that it might
just increase the complexity since then there must be a careful mechanism to
invalidate the caches, or the same type of problem will remain.
----
2020-02-19 03:53:32 UTC - Addison Higham: I mean, it is just a cache
invalidation bug. A watch triggers a cache invalidation that can fail, but
there is no alternative mechanism or way to retry or eventually ensure the
cache is eventually correct
----
2020-02-19 03:53:47 UTC - Devin G. Bost: Oh, gotcha.
----
2020-02-19 03:54:04 UTC - Devin G. Bost: Yes, I agree that we’ve seen issues
where it seems to get out of sync.
----
2020-02-19 03:54:36 UTC - Devin G. Bost: In some cases, we’ve deleted major
sections of data from ZK to try to force an update, but it didn’t always work.
We only sort of knew what we were doing when we tried that…
----
2020-02-19 03:56:31 UTC - Addison Higham: just for the proxy? AFAICT, this
seems to only be an issue in the proxy, there may be analagous bugs elsewhere,
it seems like the broker lookup has been a bit of copy pasted in a place or
two, but with somewhat difference in the details
----
2020-02-19 03:56:54 UTC - Devin G. Bost: I’m not sure about the proxy.
----
2020-02-19 03:57:20 UTC - Devin G. Bost: I actually don’t think we were running
a proxy when we had that happen…
----
2020-02-19 03:57:36 UTC - Devin G. Bost: So, perhaps we’re talking about two
different things.
----
2020-02-19 03:58:26 UTC - Devin G. Bost: I wonder if we could have a background
process that would just periodically verify that metadata in the caches all
matches what’s in ZK.
----
2020-02-19 04:00:05 UTC - Addison Higham: I think I saw a bug report you filed,
do you have it handy? will take a look now that I am on this thread
----
2020-02-19 04:01:55 UTC - Devin G. Bost: There are a couple. I’ll grab them for
you.
----
2020-02-19 04:02:32 UTC - Devin G. Bost: I’m not sure how closely related they
are though since when we’ve had the issues, we had a lot of things all blow up
at once like a chain reaction. It seemed like there may have been multiple
issues involved.
----
2020-02-19 04:05:35 UTC - Addison Higham: one other datapoint, the load
managers that also control getting a list of active brokers do just pull from a
the zookeeperCache object, so this seems likely to not be as large of a deal
----
2020-02-19 04:05:55 UTC - Devin G. Bost:
<https://github.com/apache/pulsar/issues/5311> (old)
<https://github.com/apache/pulsar/issues/6054>
<https://github.com/apache/pulsar/issues/6251> (broad)
----
2020-02-19 04:06:57 UTC - Devin G. Bost: I should mention, however, that the
first two issues in that list were primarily focused around a freezing topic
issue that @Penghui Li may have fixed recently.
We’ve seen other strange issues that I haven’t created issues for since we
weren’t exactly sure how to reproduce them.
----
2020-02-19 04:08:17 UTC - Devin G. Bost: During one of the recent prod issues,
I remember noticing that there were missing ZK entries… I probably should have
created a Pulsar issue, but I didn’t.
I might be able to dig up the information from messages I have with other devs.
----
2020-02-19 04:15:53 UTC - Addison Higham: maybe i just saw a message here
----
2020-02-19 04:16:03 UTC - Devin G. Bost: As I’m looking through my chats, I’m
remembering that geo-replication was enabled before the last issue with ZK
missing data.
----
2020-02-19 04:17:09 UTC - Devin G. Bost: You might have seen a message in
<#C5Z4T36F7|general> at some point in the past. A few months ago, we were
dealing with prod issues several times a week, so we were pretty desperate to
find a solution. What made the most difference was optimizing ZK disks to be on
fast SSD SAN instead of slow 7200 disks.
----
2020-02-19 04:17:20 UTC - Devin G. Bost: After that, we stopped seeing most of
the issues.
----
2020-02-19 08:51:56 UTC - Manuel Mueller: @Manuel Mueller has joined the channel
----

Slack digest for #dev - 2020-02-19

Reply via email to