f-ld commented on issue #5104: Handle already have connected replicate produce with same name URL: https://github.com/apache/pulsar/issues/5104#issuecomment-542640572 Additional information to understand the above logs and explanations. IPs per region: - region1 : 10.10.0.0/16 - region2 : 10.11.0.0/16 - region3 : 10.12.0.0/16 - region4 : 10.13.0.0/16 - region5 : 10.14.0.0/16 We have 5 brokers per region. And for that topic, 12 partitions. Regarding the partitioned-stats from previous message and the inbound connection from region3 to region1, I can see it on the broker in region1: ``` tcp 0 0 10.10.3.46:6650 10.12.3.213:43300 ESTABLISHED 11/java ``` and on the broker in region3: ``` tcp 0 0 10.12.3.213:43300 10.10.3.46:6650 ESTABLISHED 11/java ``` So that inbound connection is actually real. And for that specific partition, on that specific broker from region3 I have those logs: ``` 10:21:58.694 [pulsar-io-22-13] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Creating producer on cnx [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650] 10:21:58.893 [pulsar-io-22-2] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x581a0460, L:/10.12.3.213:43308 - R:10.10.3.46/10.10.3.46:6650] Received error from server: Producer with name 'pulsar.repl.region3' is already connected to topic 10:21:58.893 [pulsar-io-22-2] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-1] [pulsar.repl.region3] Failed to create producer: Producer with name 'pulsar.repl.region3' is already connected to topic ``` And in region1 I have those logs: ``` 10:21:58.785 [pulsar-io-22-14] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308][persistent://tenant/namespace/topic-partition-1] Creating producer. producerId=22291 10:21:58.785 [ForkJoinPool.commonPool-worker-4] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308]-22291 persistent://tenant/namespace/topic-partition-1 configured with schema false 10:21:58.785 [ForkJoinPool.commonPool-worker-4] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.12.3.213:43308] Failed to add producer to topic persistent://tenant/namespace/topic-partition-1: Producer with name 'pulsar.repl.region3' is already connected to topic ``` _(I picked that example of logs because they involve the same brokers as the existing replication connection reported by partitioned-stats. But similar logs are available for all 12 partitions in all brokers of both regions)_ Checking partitioned stats on region3 for that partition: ``` "replication" : { "region1" : { "msgRateIn" : 0.0, "msgThroughputIn" : 0.0, "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateExpired" : 4.553733596940979E-5, "replicationBacklog" : 0, "connected" : false, "replicationDelayInSeconds" : 0, "inboundConnection" : "/10.10.0.23:57932", "inboundConnectedSince" : "2019-10-15T08:48:49.659Z" }, "region2" : { "msgRateIn" : 0.0, "msgThroughputIn" : 0.0, "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateExpired" : 0.0, "replicationBacklog" : 0, "connected" : true, "replicationDelayInSeconds" : 0, "inboundConnection" : "/10.11.0.182:45338", "inboundConnectedSince" : "2019-10-15T08:28:14.815Z", "outboundConnection" : "[id: 0xba518cf3, L:/10.12.1.15:37322 - R:10.11.1.120/10.11.1.120:6650]", "outboundConnectedSince" : "2019-10-15T09:19:45.524Z" }, "region4" : { "msgRateIn" : 0.0, "msgThroughputIn" : 0.0, "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateExpired" : 0.0, "replicationBacklog" : 0, "connected" : true, "replicationDelayInSeconds" : 0, "inboundConnection" : "/10.13.3.30:33686", "inboundConnectedSince" : "2019-10-15T08:56:40.66Z", "outboundConnection" : "[id: 0x0e8f64e2, L:/10.12.1.15:41864 - R:10.13.3.30/10.13.3.30:6650]", "outboundConnectedSince" : "2019-10-15T09:19:45.355Z" }, "region5" : { "msgRateIn" : 0.0, "msgThroughputIn" : 0.0, "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateExpired" : 0.0, "replicationBacklog" : 0, "connected" : true, "replicationDelayInSeconds" : 0, "inboundConnection" : "/10.14.1.64:56096", "inboundConnectedSince" : "2019-10-15T08:18:27.448Z", "outboundConnection" : "[id: 0x885ed221, L:/10.12.1.15:53210 - R:10.14.0.208/10.14.0.208:6650]", "outboundConnectedSince" : "2019-10-15T09:19:45.754Z" }, }, "deduplicationStatus" : "Disabled" } ``` We have indeed no outbound connection from region3 to region1. So it would be like that broker in region3 has lost track of the existing connection to the other broker in region1 (indeed, it does not appear in partitioned stats of region3), tries to open it again but fails because brokers in region1 still have it. Unfortunately, I do not have historical logs to check if at some point broker in region 3 tried to drop the connection to region1 but failed (keeping the tcp connection but not the information of that outbound connection).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
