nickva commented on PR #5193:
URL: https://github.com/apache/couchdb/pull/5193#issuecomment-2304903172

   It is a bit cute to use a replicate to self as a fallback. I was mainly 
relying on a an existing "replicate to self" being a no-op in 
   
https://github.com/apache/couchdb/blob/637fb79f56af6371f4522eb34467601be726c94c/src/mem3/src/mem3_sync.erl#L79-L82
 with nonode would we have had to add both special cases one for self one for 
`nonode` in various places.
   
   I built a small test module to play with with the previous logic wondering 
the same thing, why we didn't see this before?
   
   ```erlang
   -module(nn).
   
   -export([
       find_next_node/3
   ]).
   
   find_next_node(Self, LiveNodes, Mem3Nodes) ->
       AllNodes0 = lists:sort(Mem3Nodes),
       AllNodes1 = [X || X <- AllNodes0, lists:member(X, LiveNodes)],
       AllNodes = AllNodes1 ++ [hd(AllNodes1)],
       [_Self, Next | _] = lists:dropwhile(fun(N) -> N =/= Self end, AllNodes),
       Next.
   ```
   
   ```erlang
   > c(nn).
   
   > nn:find_next_node(n, [a,n], [a]).
   ** exception error: no match of right hand side value []
        in function  nn:find_next_node/3 (nn.erl, line 11)
   ```
   
   So one case where we'd trigger this is if the node we're on removes itself 
from the nodes list. Then the user reported the logs being filled and the 
machine was "frozen". It must have happened between the initial sync started 
with the node in the mem3:nodes() list then it was removed and an error 
happened, where initial_sync crashed.
   
   ```
   [error] 2024-08-21T19:02:06.680089Z [email protected] emulator -------- 
Error in process <0.30202.42> on node '[email protected]' with exit value:
   
{{badmatch,[]},[{mem3_sync,find_next_node,0,[{file,"src/mem3_sync.erl"},{line,309}]},{mem3_sync,sync_nodes_and_dbs,0,[{file,"src/mem3_sync.erl"},{line,265}]},{mem3_sync,initial_sync,1,[{file,"src/mem3_sync.erl"},{line,272}]}]}
   ```
   On a crash we end up restarting it 
https://github.com/apache/couchdb/blob/d38f14f7d777b7cda79b9862ee304150ad3418ea/src/mem3/src/mem3_sync_nodes.erl#L75-L78
 and from then on it will just keep crashing. So I opted to fold both the self 
and the "odd" cases like that into the already existing "no-op" case so the 
crash cycle due to this reason doesn't happen.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to