f-ld commented on issue #5104: Handle already have connected replicate produce 
with same name
URL: https://github.com/apache/pulsar/issues/5104#issuecomment-542626701
 
 
   I am facing the same issue.
   
   I have 5 datacenters with each a pulsar cluster. So that make 5 pulsar 
clusters (named 'region1' ... 'region5'), with geo replication for some 
namespaces.
   - in 3 datacenters (region2, region4, region5), it looks all ok: other 
replication producers are connected just fine.
   - but in 2 datacenters (region1, region3), I have logs like (those come from 
region3):
   ```
   09:16:10.222 [pulsar-io-22-9] INFO  
org.apache.pulsar.broker.service.ServerCnx - 
[/10.11.0.7:51418][persistent://tenant/namespace/topic-partition-10] Creating 
producer. producerId=11331
   09:16:10.223 [ForkJoinPool.commonPool-worker-7] INFO  
org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418]-11331 
persistent://tenant/namespace/topic-partition-10 configured with schema false
   09:16:10.224 [ForkJoinPool.commonPool-worker-7] ERROR 
org.apache.pulsar.broker.service.ServerCnx - [/10.11.0.7:51418] Failed to add 
producer to topic persistent://tenant/namespace/topic-partition-10: Producer 
with name 'pulsar.repl.region2' is already connected to topic
   ```
   and in the broker on the other side, in region2:
   ```
   09:16:10.164 [pulsar-io-22-4] INFO  
org.apache.pulsar.client.impl.ProducerImpl - 
[persistent://tenant/namespace/topic-partition-10] [pulsar.repl.region2] 
Creating producer on cnx [id: 0x7cd1d935, L:/10.11.0.7:51418 - 
R:10.12.0.172/10.12.0.172:6650]
   09:16:10.285 [pulsar-io-22-1] WARN  org.apache.pulsar.client.impl.ClientCnx 
- [id: 0x7cd1d935, L:/10.11.0.7:51418 - R:10.12.0.172/10.12.0.172:6650] 
Received error from server: Producer with name 'pulsar.repl.region2' is already 
connected to topic
   09:16:10.285 [pulsar-io-22-1] ERROR 
org.apache.pulsar.client.impl.ProducerImpl - 
[persistent://tenant/namespace/topic-partition-10] [pulsar.repl.region2] Failed 
to create producer: Producer with name 'pulsar.repl.region2' is already 
connected to topic
   ```
   
   Right now, replication is having issues for that namespace and replications :
   - region3 -> region1
   - region1 -> region3
   - region2 -> region3
   - region5 -> region1
   - region5 -> region3
   all other combinations are ok.
   
   What I can see looking at `pulsar-admin topics partitioned-stats 
--per-partition tenant/namespace/topic` for example for one partition in 
region1:
   ```
   {
      ...
     "partitions": {
       "persistent://tenant/namespace/topic-partition-1" : {
         "replication" : {
           "region2" : {
             "msgRateIn" : 0.0,
             "msgThroughputIn" : 0.0,
             "msgRateOut" : 0.0,
             "msgThroughputOut" : 0.0,
             "msgRateExpired" : 0.0,
             "replicationBacklog" : 0,
             "connected" : true,
             "replicationDelayInSeconds" : 0,
             "inboundConnection" : "/10.11.1.120:42292",
             "inboundConnectedSince" : "2019-10-15T08:56:40.286Z",
             "outboundConnection" : "[id: 0x680b115b, L:/10.10.3.46:39946 - 
R:10.11.1.120/10.11.1.120:6650]",
             "outboundConnectedSince" : "2019-10-15T09:17:03.798Z"
           },
           "region3" : {
             "msgRateIn" : 0.0,
             "msgThroughputIn" : 0.0,
             "msgRateOut" : 0.0,
             "msgThroughputOut" : 0.0,
             "msgRateExpired" : 0.01666736530872926,
             "replicationBacklog" : 0,
             "connected" : false,
             "replicationDelayInSeconds" : 0,
             "inboundConnection" : "/10.12.3.213:43300",
             "inboundConnectedSince" : "2019-10-15T08:47:29.728Z"
           },
           "region4" : {
             "msgRateIn" : 0.0,
             "msgThroughputIn" : 0.0,
             "msgRateOut" : 0.0,
             "msgThroughputOut" : 0.0,
             "msgRateExpired" : 0.0,
             "replicationBacklog" : 0,
             "connected" : true,
             "replicationDelayInSeconds" : 0,
             "inboundConnection" : "/10.13.3.30:45422",
             "inboundConnectedSince" : "2019-10-15T08:56:41.039Z",
             "outboundConnection" : "[id: 0x27db1c8d, L:/10.10.3.46:56448 - 
R:10.13.3.30/10.13.3.30:6650]",
             "outboundConnectedSince" : "2019-10-15T09:17:04.282Z"
           },
           "region5" : {
             "msgRateIn" : 0.0,
             "msgThroughputIn" : 0.0,
             "msgRateOut" : 0.0,
             "msgThroughputOut" : 0.0,
             "msgRateExpired" : 0.0,
             "replicationBacklog" : 0,
             "connected" : true,
             "replicationDelayInSeconds" : 0,
             "inboundConnection" : "/10.14.1.64:36956",
             "inboundConnectedSince" : "2019-10-15T08:17:24.712Z",
             "outboundConnection" : "[id: 0x674e53e5, L:/10.10.3.46:54772 - 
R:10.14.0.208/10.14.0.208:6650]",
             "outboundConnectedSince" : "2019-10-15T09:17:03.626Z"
           }
         },
         "deduplicationStatus" : "Disabled"
       }
     }
   }
   ```
   As we can see, there is no outbound connection for region3. Hence why there 
is no replication `region1 -> region3`. This is one that is attempted but 
failing with above logs. And there is an inbound connection for all regions, 
reason why region3 and region5 cannot reopen it (region3 and region5 have no 
outbound connections for that topic).
   
   Still digging, but I think there is a bug around here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to