sergey-safarov commented on issue #4790: URL: https://github.com/apache/couchdb/issues/4790#issuecomment-2380707745
We have cached the same issue on v3.3.3 Also on the one CouchDB node, I can see "Node not responding" ``` Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.29864.2> -------- ** Node 'couc...@db0b.wv.example.com' not responding ** Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: ** Removing (timedout) connection ** Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.29864.2> -------- ** Node 'couc...@db0b.wv.example.com' not responding ** Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: ** Removing (timedout) connection ** Sep 28 01:47:07 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.20680.5805> -------- 1 conflicted shard in cluster Sep 28 01:47:08 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.4340.5788> -------- 1 conflicted shard in cluster Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.19100.5806> -------- fabric_worker_timeout get_all_security,'couc...@db1a.wv.example.com',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.19100.5806> -------- fabric_worker_timeout get_all_security,'couc...@db0b.wv.example.com',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.19100.5806> -------- Error checking security objects for account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409 :: {error,timeout} Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.11649.5811> -------- fabric_worker_timeout update_docs,'couc...@db0b.wv.example.com',<<"shards/40000000-5fffffff/_global_changes.1660293400">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.11649.5811> -------- fabric_worker_timeout update_docs,'couc...@db1a.wv.example.com',<<"shards/40000000-5fffffff/_global_changes.1660293400">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.30041.5810> -------- fabric_worker_timeout get_all_security,'couc...@db1a.wv.example.com',<<"shards/c0000000-dfffffff/account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409.1725148810">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.30041.5810> -------- Error checking security objects for account/9f/f9/d07f1389d0360b4cff90710cc1fe-202409 :: {error,timeout} Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.7850.5798> -------- fabric_worker_timeout open_doc,'couc...@db1a.wv.example.com',<<"shards/80000000-9fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.7850.5798> -------- fabric_worker_timeout open_doc,'couc...@db0b.wv.example.com',<<"shards/80000000-9fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.15964.5814> -------- fabric_worker_timeout open_doc,'couc...@db1a.wv.example.com',<<"shards/a0000000-bfffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.15964.5814> -------- fabric_worker_timeout open_doc,'couc...@db0b.wv.example.com',<<"shards/a0000000-bfffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.32403.5807> -------- fabric_worker_timeout open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.32403.5807> -------- fabric_worker_timeout open_doc,'couc...@db0b.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.10430.5762> -------- fabric_worker_timeout open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.10430.5762> -------- fabric_worker_timeout open_doc,'couc...@db0b.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.11019.5802> -------- fabric_worker_timeout open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.11019.5802> -------- fabric_worker_timeout open_doc,'couc...@db0b.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.28862.5796> -------- fabric_worker_timeout open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.28862.5796> -------- fabric_worker_timeout open_doc,'couc...@db0b.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> Sep 28 01:48:29 db1a.wv.example.com couchdb[3128]: couc...@db1a.wv.example.com <0.14078.5795> -------- fabric_worker_timeout open_doc,'couc...@db1a.wv.example.com',<<"shards/20000000-3fffffff/account/de/99/3bb20e6609f27b81bac3123eb833-202409.1725148811">> ``` But during troubleshooting when the issue was present I sent `/_membership` curl request and returned a response with all (three) nodes present online in the cluster. The request was sent to each CouchDB node in the cluster and returned the same results "three nodes online in the cluster". On the other two nodes in the cluster, I can see error messages like "fabric_worker_timeout open_doc" and no messages like "Node not responding". Also on the two nodes CPU load increased to 100%. **db0a**  **db0b**  **db1a**  I am sure network connectivity is present between CouchDB nodes. Also `/_membership` response responded with all nodes online on all CocuhDB instances. But anyway we will adjust the recommended values and provide feedback if the issue is reproduced. ``` [cluster] reconnect_interval_sec = 37 [fabric] request_timeout = 60000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@couchdb.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org