bschofield edited a comment on issue #697:
URL:
https://github.com/apache/pulsar-client-go/issues/697#issuecomment-1005484225
> Hello @bschofield Can you provide more log info for this, thanks.
It is hard for me to know exactly which log lines are relevant as I have
very many pods running, each with a large number of producers, but here is an
example that looks interesting. This may not be the root cause of the issue.
```
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.682869,"caller":"messaging/logging_zap.go:67","msg":"[Detected
stale connection to
broker]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.6829515,"caller":"messaging/logging_zap.go:67","msg":"[Failed
to write on
connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write
tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.6829886,"caller":"messaging/logging_zap.go:67","msg":"[Connection
was
closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:59382
-> 10.244.0.2:6650"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.6830094,"caller":"messaging/logging_zap.go:67","msg":"[Connection
was
closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:59382
-> 10.244.0.2:6650"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.6830208,"caller":"messaging/logging_zap.go:67","msg":"[Connection
was
closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:59382
-> 10.244.0.2:6650"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.6830392,"caller":"messaging/logging_zap.go:67","msg":"[Failed
to write on
connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write
tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.6830513,"caller":"messaging/logging_zap.go:67","msg":"[Failed
to write on
connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write
tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.6830642,"caller":"messaging/logging_zap.go:67","msg":"[Failed
to write on
connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:59382","error":"write
tcp 10.244.1.173:59382->10.244.0.2:6650: use of closed network connection"}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.6830602,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop
will reconnect in
producer]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.6830988,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting
to broker in
108.668329ms]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.6830504,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop
will reconnect in
producer]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.6831107,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting
to broker in
107.590908ms]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.68313,"caller":"messaging/logging_zap.go:62","msg":"[runEventsLoop
will reconnect in
producer]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.6831586,"caller":"messaging/logging_zap.go:62","msg":"[Reconnecting
to broker in
111.142612ms]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.7922926,"caller":"messaging/logging_zap.go:62","msg":"[Connecting
to broker]","remote_addr":"pulsar://10.244.0.2:6650"}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.7961068,"caller":"messaging/logging_zap.go:62","msg":"[TCP
connection
established]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.79917,"caller":"messaging/logging_zap.go:62","msg":"[Connection
is
ready]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250"}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.802493,"caller":"messaging/logging_zap.go:82","msg":"Resending
[1780] pending
batches","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.8025355,"caller":"messaging/logging_zap.go:82","msg":"Resending
[680] pending
batches","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.8025396,"caller":"messaging/logging_zap.go:82","msg":"Resending
[1866] pending
batches","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8}
2021-12-29 16:12:31
{"level":"info","ts":1640794351.8164892,"caller":"messaging/logging_zap.go:62","msg":"[Reconnected
producer to
broker]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250
-> 10.244.0.2:6650"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.8165426,"caller":"messaging/logging_zap.go:87","msg":"Received
ack for [ledgerId:1021695 entryId:47743 739 711] on sequenceId %!v(MISSING) -
expected: %!v(MISSING), closing
connection","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.8166032,"caller":"messaging/logging_zap.go:67","msg":"[Connection
was
closed]","topic":"persistent://REDACTED-P2","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":1,"cnx":"10.244.1.173:39250
-> 10.244.0.2:6650"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.8166149,"caller":"messaging/logging_zap.go:67","msg":"[Connection
was
closed]","topic":"persistent://REDACTED-W0","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":8,"cnx":"10.244.1.173:39250
-> 10.244.0.2:6650"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.8166232,"caller":"messaging/logging_zap.go:67","msg":"[Connection
was
closed]","topic":"persistent://REDACTED-J1","producer_name":"REDACTED-76b5b9cbf9-6rs76","producerID":6,"cnx":"10.244.1.173:39250
-> 10.244.0.2:6650"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.8166335,"caller":"messaging/logging_zap.go:67","msg":"[Failed
to write on
connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write
tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
2021-12-29 16:12:31
{"level":"warn","ts":1640794351.816643,"caller":"messaging/logging_zap.go:67","msg":"[Failed
to write on
connection]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.1.173:39250","error":"write
tcp 10.244.1.173:39250->10.244.0.2:6650: use of closed network connection"}
```
I also see some other errors which might be unrelated problems, e.g.
```
2021-12-29 16:13:02
{"level":"warn","ts":1640794382.4651952,"caller":"messaging/logging_zap.go:87","msg":"Received
send error from server: [[PersistenceError
org.apache.bookkeeper.mledger.ManagedLedgerException: Bookie operation
timeout]] :
[%!s(MISSING)]","remote_addr":"pulsar://10.244.0.2:6650","local_addr":"10.244.0.113:54102"}
```
>The #689 does not seem to cause the problem described in this issue,
because the reconnect stuck occurs on the prouder side
I think there is a little miscommunication here, I was trying to suggest
that you look at #691 and not #689.
In PR #691 (https://github.com/apache/pulsar-client-go/pull/691/files), you
changed `pulsar/producer_partition.go` so that `p.reconnectToBroker()` is
called in a separate goroutine.
My (untested) hypothesis is that one of the other `p.*()` calls from the
original goroutine could perhaps now be happening at the same time as
`p.reconnectToBroker()`:
for {
select {
case i := <-p.eventsChan:
switch v := i.(type) {
case *sendRequest:
p.internalSend(v)
case *flushRequest:
p.internalFlush(v)
case *closeProducer:
p.internalClose(v)
return
}
case <-p.batchFlushTicker.C:
if p.batchBuilder.IsMultiBatches() {
p.internalFlushCurrentBatches()
} else {
p.internalFlushCurrentBatch()
}
}
}
I was wondering if this could cause the freeze in `p.grabCnx()` that you are
seeing.
Again, I haven't tested this idea, but thought it could be worth mentioning
since I experienced a similar problem but only with PR #691 applied. If this
turns out to be something unrelated, apologies!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]