fracasula edited a comment on issue #7682:
URL: https://github.com/apache/pulsar/issues/7682#issuecomment-681831805
@codelipenghui @gaoran10 We created a clone of the whole production
environment with the offloaded buckets too and, even with 2.6.1, we're
perfectly able to replicate. In fact the consumers are still getting stuck.
We're able to unlock them only by manually killing the pod on Kubernetes.
I deployed 3 consumers with a key shared subscription listening to our
`SpaceEvents` topic. Once all three are running I reset the consumers offset
for that subscription to by doing `bin/pulsar-admin topics reset-cursor
SpaceEvents -s cloud-pulsar-tester -t 99999w`. After the reset I can see that
the 3 consumers started reading messages. The `msgRateOut` showed in the
`bin/pulsar-admin topics stats SpaceEvents` also shows that all 3 are reading
messages.
At some point, when there are around 181k messages left in the backlog, the
consumers get stuck.
Golang µservices logs report warnings only, coming from the Pulsar library
itself (which uses logrus), they are not coming from our code:
```
{ remote_addr: { ForceQuery: false Host: "pulsar-proxy.pulsar:6650" RawPath:
"" User: null Fragment: "" Opaque: "" RawQuery: "" Path: "" Scheme: "pulsar"
RawFragment: "" } level: "warn" local_addr: { Port: 44814 IP: "10.56.0.18"
Zone: "" } }
{ remote_addr: { RawPath: "" Scheme: "pulsar" Path: "" Opaque: "" RawQuery:
"" Host: "pulsar-proxy.pulsar:6650" RawFragment: "" Fragment: "" User: null
ForceQuery: false } level: "warn" error: "write tcp
10.56.0.18:44814->10.0.1.68:6650: use of closed network connection" local_addr:
{ Port: 44814 Zone: "" IP: "10.56.0.18" } }
{ remote_addr: { RawQuery: "" Opaque: "" ForceQuery: false RawFragment: ""
Scheme: "pulsar" Fragment: "" RawPath: "" Path: "" Host:
"pulsar-proxy.pulsar:6650" User: null } local_addr: { Zone: "" Port: 42612 IP:
"10.56.0.21" } level: "warn" }
{ level: "warn" remote_addr: { User: null Path: "" Scheme: "pulsar"
RawQuery: "" Fragment: "" Opaque: "" Host: "pulsar-proxy.pulsar:6650"
ForceQuery: false RawPath: "" RawFragment: "" } local_addr: { Zone: "" IP:
"10.56.0.21" Port: 42612 } error: "write tcp 10.56.0.21:42612->10.0.1.68:6650:
use of closed network connection" }
{ local_addr: { Zone: "" Port: 51340 IP: "10.56.0.17" } remote_addr: { User:
null ForceQuery: false RawPath: "" RawFragment: "" RawQuery: "" Path: "" Host:
"pulsar-proxy.pulsar:6650" Opaque: "" Fragment: "" Scheme: "pulsar" } level:
"warn" }
```
Pulsar logs report the following (by `grep`ping by `exception`).
### Broker
```
java.lang.NullPointerException: null
org.apache.bookkeeper.mledger.ManagedLedgerException: Unknown exception
08:25:35.597 [bookkeeper-ml-workers-OrderedExecutor-3-0] ERROR
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers
- [persistent://public/default/SpaceEvents / cloud-pulsar-tester] Error
reading entries at 1430:0 : Unknown exception, Read Type Replay - Retrying to
read in 15.0 seconds
java.lang.NullPointerException: null
at
org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1776)
~[org.apache.pulsar-pulsar-common-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:87)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32)
[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_252]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_252]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
java.lang.NullPointerException: null
at
org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:125)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_252]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_252]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
java.lang.NullPointerException: null
at
org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_252]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_252]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
08:27:09.703 [bookkeeper-ml-workers-OrderedExecutor-3-0] WARN
org.apache.bookkeeper.mledger.impl.OpReadEntry -
[public/default/persistent/SpaceEvents][cloud-pulsar-tester] read failed from
ledger at position:2193:0 : Unknown exception
08:27:09.703 [bookkeeper-ml-workers-OrderedExecutor-3-0] ERROR
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers
- [persistent://public/default/SpaceEvents / cloud-pulsar-tester] Error
reading entries at 2193:0 : Unknown exception, Read Type Normal - Retrying to
read in 15.0 seconds
java.lang.NullPointerException: null
at
org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1776)
~[org.apache.pulsar-pulsar-common-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:87)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32)
[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_252]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_252]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
java.lang.NullPointerException: null
at
org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:125)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480)
~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_252]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_252]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
java.lang.NullPointerException: null
at
org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32)
~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
at
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
[org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_252]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_252]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
```
### Proxy
```
08:27:14.083 [pulsar-proxy-io-2-3] WARN
io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired,
and it reached at the tail of the pipeline. It usually means the last handler
in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed:
Connection reset by peer
08:27:14.090 [pulsar-proxy-io-2-3] WARN
org.apache.pulsar.proxy.server.ProxyConnection - [/10.56.0.21:42612] Got
exception NativeIoException : readAddress(..) failed: Connection reset by peer
null
08:27:38.090 [pulsar-proxy-io-2-8] WARN
io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired,
and it reached at the tail of the pipeline. It usually means the last handler
in the pipeline did not handle the exception.
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed:
Connection reset by peer
08:27:38.090 [pulsar-proxy-io-2-8] WARN
org.apache.pulsar.proxy.server.ProxyConnection - [/10.56.0.17:51340] Got
exception NativeIoException : readAddress(..) failed: Connection reset by peer
null
```
When inspecting the subscription, via the `pulsar-admin` cli tool, we see no
consumers at all despite having 3 services running (and using your Pulsar
Golang library - the native one, no cgo):
`$ bin/pulsar-admin topics stats SpaceEvents | jq
'.subscriptions["cloud-pulsar-tester"]' -c | jq`
```
{
"msgRateOut": 0,
"msgThroughputOut": 0,
"bytesOutCounter": 0,
"msgOutCounter": 0,
"msgRateRedeliver": 0,
"chuckedMessageRate": 0,
"msgBacklog": 181626,
"msgBacklogNoDelayed": 181626,
"blockedSubscriptionOnUnackedMsgs": false,
"msgDelayed": 0,
"unackedMessages": 0,
"type": "Key_Shared",
"msgRateExpired": 0,
"lastExpireTimestamp": 0,
"lastConsumedFlowTimestamp": 1598516857278,
"lastConsumedTimestamp": 0,
"lastAckedTimestamp": 0,
"consumers": [],
"isDurable": true,
"isReplicated": false
}
```
If we inspect the Golang µservices logs with a lower debug level we see logs
coming from the underlying Pulsar Golang library as shown below (please note
that the following logs are regularly printed also after the consumers get
stuck):
```
{"level":"debug","severity":"DEBUG","time":"2020-08-27T09:14:12Z","topic":"SpaceEvents"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]