sergey-safarov commented on issue #4790:
URL: https://github.com/apache/couchdb/issues/4790#issuecomment-2380707745

   We have cached the same issue on v3.3.3
   Also on the one CouchDB node, I can see "Node not responding"
   ```
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.29864.2> -------- ** Node 
'couc...@db0b.wv.example.com' not responding **
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: ** Removing (timedout) 
connection **
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.29864.2> -------- ** Node 
'couc...@db0b.wv.example.com' not responding **
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: ** Removing (timedout) 
connection **
   Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.20680.5805> -------- 1 conflicted shard in 
cluster
   Sep 28 01:47:08 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.4340.5788> -------- 1 conflicted shard in cluster
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.19100.5806> -------- fabric_worker_timeout 
get_all_security,'couc...@db1a.wv.example.com',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.19100.5806> -------- fabric_worker_timeout 
get_all_security,'couc...@db0b.wv.example.com',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.19100.5806> -------- Error checking security 
objects for account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409 :: {error,timeout}
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.11649.5811> -------- fabric_worker_timeout 
update_docs,'couc...@db0b.wv.example.com',<<"shards/40000000-5fffffff/_global_changes.1660293400">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.11649.5811> -------- fabric_worker_timeout 
update_docs,'couc...@db1a.wv.example.com',<<"shards/40000000-5fffffff/_global_changes.1660293400">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.30041.5810> -------- fabric_worker_timeout 
get_all_security,'couc...@db1a.wv.example.com',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.30041.5810> -------- Error checking security 
objects for account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409 :: {error,timeout}
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.7850.5798> -------- fabric_worker_timeout 
open_doc,'couc...@db1a.wv.example.com',<<"shards/80000000-9fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.7850.5798> -------- fabric_worker_timeout 
open_doc,'couc...@db0b.wv.example.com',<<"shards/80000000-9fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.15964.5814> -------- fabric_worker_timeout 
open_doc,'couc...@db1a.wv.example.com',<<"shards/a0000000-bfffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.15964.5814> -------- fabric_worker_timeout 
open_doc,'couc...@db0b.wv.example.com',<<"shards/a0000000-bfffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.32403.5807> -------- fabric_worker_timeout 
open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.32403.5807> -------- fabric_worker_timeout 
open_doc,'couc...@db0b.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.10430.5762> -------- fabric_worker_timeout 
open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.10430.5762> -------- fabric_worker_timeout 
open_doc,'couc...@db0b.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.11019.5802> -------- fabric_worker_timeout 
open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.11019.5802> -------- fabric_worker_timeout 
open_doc,'couc...@db0b.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.28862.5796> -------- fabric_worker_timeout 
open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.28862.5796> -------- fabric_worker_timeout 
open_doc,'couc...@db0b.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: 
couc...@db1a.wv.example.com <0.14078.5795> -------- fabric_worker_timeout 
open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">>
   ```
   
   But during troubleshooting when the issue was present I sent `/_membership` 
curl request and returned a response with all (three) nodes present online in 
the cluster. The request was sent to each CouchDB node in the cluster and 
returned the same results "three nodes online in the cluster".
   
   On the other two nodes in the cluster, I can see error messages like 
"fabric_worker_timeout open_doc" and no messages like "Node not responding".
   
   Also on the two nodes  CPU load increased to 100%.
   **db0a**
   
![image](https://github.com/user-attachments/assets/4c285c4e-75ee-45c4-8571-fe6756dd621e)
   **db0b**
   
![image](https://github.com/user-attachments/assets/8d7ea6e1-0219-4b66-96bb-487640c5ba9c)
   **db1a**
   
![image](https://github.com/user-attachments/assets/7643ac29-2fe9-4e32-bb06-dcbf7cd6641f)
   
   I am sure network connectivity is present between CouchDB nodes. Also 
`/_membership` response responded with all nodes online on all CocuhDB 
instances.
   But anyway we will adjust the recommended values and provide feedback if the 
issue is reproduced.
   ```
   [cluster] reconnect_interval_sec = 37
   [fabric] request_timeout = 60000
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to