Ádám Bakai has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21965 )

Change subject: KUDU-3605 [TestSecurity] Start Tservers during test
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/21965/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21965/2//COMMIT_MSG@10
PS2, Line 10: Running master
            : servers without tserver can lead to undefined behavior.
> Why?  And what exactly was going on if not starting tserver(s)?
I get this error randomly(about 1% of the dist test runs, and I couldn't 
reproduce on my dev env): "E20240808 13:14:38.568001  1477 master.cc:392] 
Unable to init master catalog manager: Not found: Unable to initialize catalog 
manager: Failed to initialize sys tables async: Unable to load consensus 
metadata for tablet 00000000000000000000000000000000: Unable to load consensus 
metadata for tablet 00000000000000000000000000000000: 
/tmp/dist-test-task6KabZO/test-tmp/mini-kudu-cluster6929255666629015678/master-1/wal/consensus-meta/00000000000000000000000000000000:
 No such file or directory (error 2)"
I found that even though the file was written, sometimes it wasn't found during 
the next master start.

I tried to investigate the issue further. I couldn't find the exact reason for 
this behaviour.

My ideas about the root cause:
1. WritePBToPath writing mechanism with rename may not save the file if the 
whole process is killed.
2. It doesn't have anything to do with kudu, it is some strange behaviour on 
google cloud machines.

My idea why running tservers solves(or works around the issue):
During ExternalCluster::Start, there is a part:

  for (int i = 1; i <= opts_.num_tablet_servers; i++) {
    RETURN_NOT_OK_PREPEND(AddTabletServer(), Substitute("failed to start tablet 
server $0", i));
  }
  RETURN_NOT_OK(WaitForTabletServerCount(
                  opts_.num_tablet_servers,
                  
MonoDelta::FromSeconds(kTabletServerRegistrationTimeoutSeconds)));



This command waits for all the tablet servers to be found by the cluster which 
provides enough time to make sure that the consensus meta files are properly 
stored.

I agree that it may be a good idea to check if running tservers in test cases 
helps flaky test cases, but this isn't investigated fully yet, but I think it 
isn't very beneficial to dig deeper when there is a workaround that solves the 
problem. So yes, I think if a scenario is flaky and runs only masters, it's a 
good idea to try it with tservers, otherwise not.



--
To view, visit http://gerrit.cloudera.org:8080/21965
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie9389110f2bcaaec1743fff8f6e859295b30735c
Gerrit-Change-Number: 21965
Gerrit-PatchSet: 2
Gerrit-Owner: Ádám Bakai <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Ádám Bakai <[email protected]>
Gerrit-Comment-Date: Thu, 14 Nov 2024 16:28:55 +0000
Gerrit-HasComments: Yes

Reply via email to