merlimat opened a new pull request #378:
URL: https://github.com/apache/pulsar-client-go/pull/378
### Motivation
There is a deadlock that can happen in Go client when the client has a write
failure and tries to process that.
The issue is that Go mutexes are not re-entrant and we trigger a
connection.Close() while already holding the connection mutex.
```
goroutine 1077 [semacquire, 83 minutes]:
sync.runtime_SemacquireMutex(0xc00c31fb04, 0xc110a12000, 0x1)
/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00c31fb00)
/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:81
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).Close(0xc00c31fb00)
/go/pkg/mod/cd.splunkdev.com/streamlio/[email protected]/pulsar/internal/connection.go:718
+0x547
github.com/apache/pulsar-client-go/pulsar.(*partitionProducer).ReceivedSendReceipt(0xc0033926e0,
0xc09ba0fe00)
/go/pkg/mod/cd.splunkdev.com/streamlio/[email protected]/pulsar/producer_partition.go:475
+0x6f0
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).handleSendReceipt(0xc00c31fb00,
0xc09ba0fe00)
/go/pkg/mod/cd.splunkdev.com/streamlio/[email protected]/pulsar/internal/connection.go:588
+0xee
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).internalReceivedCommand(0xc00c31fb00,
0xc00e40e8c0, 0x0, 0x0)
/go/pkg/mod/cd.splunkdev.com/streamlio/[email protected]/pulsar/internal/connection.go:507
+0x1ce
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).run(0xc00c31fb00)
/go/pkg/mod/cd.splunkdev.com/streamlio/[email protected]/pulsar/internal/connection.go:368
+0x2db
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).start.func1(0xc00c31fb00)
/go/pkg/mod/cd.splunkdev.com/streamlio/[email protected]/pulsar/internal/connection.go:230
+0x71
created by
github.com/apache/pulsar-client-go/pulsar/internal.(*connection).start
/go/pkg/mod/cd.splunkdev.com/streamlio/[email protected]/pulsar/internal/connection.go:226
+0x3f
```
### Modifications
We don't need to hold the connection lock while the producer is processing
the write failure. Releasing the lock earlier is fixing the problem.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]