nickva commented on PR #5193: URL: https://github.com/apache/couchdb/pull/5193#issuecomment-2304903172
It is a bit cute to use a replicate to self as a fallback. I was mainly relying on a an existing "replicate to self" being a no-op in https://github.com/apache/couchdb/blob/637fb79f56af6371f4522eb34467601be726c94c/src/mem3/src/mem3_sync.erl#L79-L82 with nonode would we have had to add both special cases one for self one for `nonode` in various places. I built a small test module to play with with the previous logic wondering the same thing, why we didn't see this before? ```erlang -module(nn). -export([ find_next_node/3 ]). find_next_node(Self, LiveNodes, Mem3Nodes) -> AllNodes0 = lists:sort(Mem3Nodes), AllNodes1 = [X || X <- AllNodes0, lists:member(X, LiveNodes)], AllNodes = AllNodes1 ++ [hd(AllNodes1)], [_Self, Next | _] = lists:dropwhile(fun(N) -> N =/= Self end, AllNodes), Next. ``` ```erlang > c(nn). > nn:find_next_node(n, [a,n], [a]). ** exception error: no match of right hand side value [] in function nn:find_next_node/3 (nn.erl, line 11) ``` So one case where we'd trigger this is if the node we're on removes itself from the nodes list. Then the user reported the logs being filled and the machine was "frozen". It must have happened between the initial sync started with the node in the mem3:nodes() list then it was removed and an error happened, where initial_sync crashed. ``` [error] 2024-08-21T19:02:06.680089Z [email protected] emulator -------- Error in process <0.30202.42> on node '[email protected]' with exit value: {{badmatch,[]},[{mem3_sync,find_next_node,0,[{file,"src/mem3_sync.erl"},{line,309}]},{mem3_sync,sync_nodes_and_dbs,0,[{file,"src/mem3_sync.erl"},{line,265}]},{mem3_sync,initial_sync,1,[{file,"src/mem3_sync.erl"},{line,272}]}]} ``` On a crash we end up restarting it https://github.com/apache/couchdb/blob/d38f14f7d777b7cda79b9862ee304150ad3418ea/src/mem3/src/mem3_sync_nodes.erl#L75-L78 and from then on it will just keep crashing. So I opted to fold both the self and the "odd" cases like that into the already existing "no-op" case so the crash cycle due to this reason doesn't happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
