Ádám Bakai has posted comments on this change. ( http://gerrit.cloudera.org:8080/21965 )
Change subject: KUDU-3605 [TestSecurity] Start Tservers during test ...................................................................... Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/21965/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21965/2//COMMIT_MSG@10 PS2, Line 10: Running master : servers without tserver can lead to undefined behavior. > Why? And what exactly was going on if not starting tserver(s)? I get this error randomly(about 1% of the dist test runs, and I couldn't reproduce on my dev env): "E20240808 13:14:38.568001 1477 master.cc:392] Unable to init master catalog manager: Not found: Unable to initialize catalog manager: Failed to initialize sys tables async: Unable to load consensus metadata for tablet 00000000000000000000000000000000: Unable to load consensus metadata for tablet 00000000000000000000000000000000: /tmp/dist-test-task6KabZO/test-tmp/mini-kudu-cluster6929255666629015678/master-1/wal/consensus-meta/00000000000000000000000000000000: No such file or directory (error 2)" I found that even though the file was written, sometimes it wasn't found during the next master start. I tried to investigate the issue further. I couldn't find the exact reason for this behaviour. My ideas about the root cause: 1. WritePBToPath writing mechanism with rename may not save the file if the whole process is killed. 2. It doesn't have anything to do with kudu, it is some strange behaviour on google cloud machines. My idea why running tservers solves(or works around the issue): During ExternalCluster::Start, there is a part: for (int i = 1; i <= opts_.num_tablet_servers; i++) { RETURN_NOT_OK_PREPEND(AddTabletServer(), Substitute("failed to start tablet server $0", i)); } RETURN_NOT_OK(WaitForTabletServerCount( opts_.num_tablet_servers, MonoDelta::FromSeconds(kTabletServerRegistrationTimeoutSeconds))); This command waits for all the tablet servers to be found by the cluster which provides enough time to make sure that the consensus meta files are properly stored. I agree that it may be a good idea to check if running tservers in test cases helps flaky test cases, but this isn't investigated fully yet, but I think it isn't very beneficial to dig deeper when there is a workaround that solves the problem. So yes, I think if a scenario is flaky and runs only masters, it's a good idea to try it with tservers, otherwise not. -- To view, visit http://gerrit.cloudera.org:8080/21965 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie9389110f2bcaaec1743fff8f6e859295b30735c Gerrit-Change-Number: 21965 Gerrit-PatchSet: 2 Gerrit-Owner: Ádám Bakai <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Attila Bukor <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Ádám Bakai <[email protected]> Gerrit-Comment-Date: Thu, 14 Nov 2024 16:28:55 +0000 Gerrit-HasComments: Yes
