f-ld commented on issue #5104: Handle already have connected replicate produce with same name URL: https://github.com/apache/pulsar/issues/5104#issuecomment-542626701 I am facing the same issue. I have 5 datacenters with each a pulsar cluster. So that make 5 pulsar clusters (named 'region1' ... 'region5'), with geo replication for some namespaces. - in 3 datacenters (region2, region4, region5), it looks all ok: other replication producers are connected just fine. - but in 2 datacenters (region1, region3), I have logs like (those come from region3): ``` 09:16:10.222 [pulsar-io-22-9] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418][persistent://tenant/namespace/topic-partition-10] Creating producer. producerId=11331 09:16:10.223 [ForkJoinPool.commonPool-worker-7] INFO org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418]-11331 persistent://tenant/namespace/topic-partition-10 configured with schema false 09:16:10.224 [ForkJoinPool.commonPool-worker-7] ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418] Failed to add producer to topic persistent://tenant/namespace/topic-partition-10: Producer with name 'pulsar.repl.region2' is already connected to topic ``` and in the broker on the other side, in region2: ``` 09:16:10.164 [pulsar-io-22-4] INFO org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-10] [pulsar.repl.region2] Creating producer on cnx [id: 0x7cd1d935, L:/10.11.0.7:51418 - R:10.12.0.172/10.12.0.172:6650] 09:16:10.285 [pulsar-io-22-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x7cd1d935, L:/10.11.0.7:51418 - R:10.12.0.172/10.12.0.172:6650] Received error from server: Producer with name 'pulsar.repl.region2' is already connected to topic 09:16:10.285 [pulsar-io-22-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://tenant/namespace/topic-partition-10] [pulsar.repl.region2] Failed to create producer: Producer with name 'pulsar.repl.region2' is already connected to topic ``` Right now, replication is having issues for that namespace and replications : - region3 -> region1 - region1 -> region3 - region2 -> region3 - region5 -> region1 - region5 -> region3 all other combinations are ok. What I can see looking at `pulsar-admin topics partitioned-stats --per-partition tenant/namespace/topic` for example for one partition in region1: ``` { ... "partitions": { "persistent://tenant/namespace/topic-partition-1" : { "replication" : { "region2" : { "msgRateIn" : 0.0, "msgThroughputIn" : 0.0, "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateExpired" : 0.0, "replicationBacklog" : 0, "connected" : true, "replicationDelayInSeconds" : 0, "inboundConnection" : "/10.11.1.120:42292", "inboundConnectedSince" : "2019-10-15T08:56:40.286Z", "outboundConnection" : "[id: 0x680b115b, L:/10.10.3.46:39946 - R:10.11.1.120/10.11.1.120:6650]", "outboundConnectedSince" : "2019-10-15T09:17:03.798Z" }, "region3" : { "msgRateIn" : 0.0, "msgThroughputIn" : 0.0, "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateExpired" : 0.01666736530872926, "replicationBacklog" : 0, "connected" : false, "replicationDelayInSeconds" : 0, "inboundConnection" : "/10.12.3.213:43300", "inboundConnectedSince" : "2019-10-15T08:47:29.728Z" }, "region4" : { "msgRateIn" : 0.0, "msgThroughputIn" : 0.0, "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateExpired" : 0.0, "replicationBacklog" : 0, "connected" : true, "replicationDelayInSeconds" : 0, "inboundConnection" : "/10.13.3.30:45422", "inboundConnectedSince" : "2019-10-15T08:56:41.039Z", "outboundConnection" : "[id: 0x27db1c8d, L:/10.10.3.46:56448 - R:10.13.3.30/10.13.3.30:6650]", "outboundConnectedSince" : "2019-10-15T09:17:04.282Z" }, "region5" : { "msgRateIn" : 0.0, "msgThroughputIn" : 0.0, "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateExpired" : 0.0, "replicationBacklog" : 0, "connected" : true, "replicationDelayInSeconds" : 0, "inboundConnection" : "/10.14.1.64:36956", "inboundConnectedSince" : "2019-10-15T08:17:24.712Z", "outboundConnection" : "[id: 0x674e53e5, L:/10.10.3.46:54772 - R:10.14.0.208/10.14.0.208:6650]", "outboundConnectedSince" : "2019-10-15T09:17:03.626Z" } }, "deduplicationStatus" : "Disabled" } } } ``` As we can see, there is no outbound connection for region3. Hence why there is no replication `region1 -> region3`. This is one that is attempted but failing with above logs. And there is an inbound connection for all regions, reason why region3 and region5 cannot reopen it (region3 and region5 have no outbound connections for that topic). Still digging, but I think there is a bug around here.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
