[ 
https://issues.apache.org/jira/browse/CASSANDRA-15614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053734#comment-17053734
 ] 

Ekaterina Dimitrova commented on CASSANDRA-15614:
-------------------------------------------------

During the work on this ticket I observed we have also 
[test_bootstrap_binary_disabled|https://builds.apache.org/view/A-D/view/Cassandra%20trunk/job/Cassandra-trunk-dtest/lastBuild/testReport/bootstrap_test/TestBootstrap/test_bootstrap_binary_disabled]
 covering the same test and also being flakey.

I think we can remove this one as being redundant and just mark that the other 
one tests ALSO the resumable bootstrap. 

Unfortunately, the problem is still the same but at least we don't have it 
double damage. The test does not look deterministic to me. I don't see a 100% 
way to ensure the bootstrap won't complete at some point earlier. 

Now it completed successfully 100 times (log attached) but I think in very rare 
cases it might fail again.

As pointed in Jenkins - the test was 96% stable, now I guess it would be like 
99%. 

[~jrwest] being the Shepard for testing, any advice? Can we leave this one with 
a comment on the flakiness? 

Or maybe I can add failure message, in case we see the stream failure happening 
when the bootstrap was already over? It will provide the reason and an advice 
for rerun. So people know not to investigate more? Please advise.

[Pull request|https://github.com/ekaterinadimitrova2/cassandra-dtest/pull/3]

[Branch for the 
patch|https://github.com/ekaterinadimitrova2/cassandra-dtest/tree/CASSANDRA-15614]

 

> Flakey test - TestBootstrap.test_resumable_bootstrap 
> -----------------------------------------------------
>
>                 Key: CASSANDRA-15614
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15614
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest
>            Reporter: Ekaterina Dimitrova
>            Assignee: Ekaterina Dimitrova
>            Priority: Normal
>             Fix For: 4.0-alpha
>
>         Attachments: test_resumable_bootstrap-CASSANDRA-15614.txt
>
>
> Fails 2 out of 10 times on Mac/Java 8
> {noformat}
> ==================================================================================================================
>  FAILURES 
> ==================================================================================================================
> *___________________________________________________________________________________________________
>  TestBootstrap.test_resumable_bootstrap 
> ___________________________________________________________________________________________________*
>  
> self = <bootstrap_test.TestBootstrap object at 0x10fd488d0>
>  
>     *@since('2.2')*
>     *def test_resumable_bootstrap(self):*
>         *"""*
>             *Test resuming bootstrap after data streaming failure*
>             *"""*
>         *cluster = self.cluster*
>         *cluster.populate(2)*
>     **    
>         *node1 = cluster.nodes['node1']*
>         *# set up byteman*
>         *node1.byteman_port = '8100'*
>         *node1.import_config_files()*
>     **    
>         *cluster.start(wait_other_notice=True)*
>         *# kill stream to node3 in the middle of streaming to let it fail*
>         *if cluster.version() < '4.0':*
>             *node1.byteman_submit([self.byteman_submit_path_pre_4_0])*
>         *else:*
>             *node1.byteman_submit([self.byteman_submit_path_4_0])*
>         *node1.stress(['write', 'n=1K', 'no-warmup', 'cl=TWO', '-schema', 
> 'replication(factor=2)', '-rate', 'threads=50'])*
>         *cluster.flush()*
>     **    
>         *# start bootstrapping node3 and wait for streaming*
>         *node3 = new_node(cluster)*
>         *node3.start(wait_other_notice=False)*
>     **    
>         *# let streaming fail as we expect*
> *>       node3.watch_log_for('Some data streaming failed')*
>  
> *bootstrap_test.py*:365: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ 
>  
> self = <ccmlib.node.Node object at 0x10fdf1810>, exprs = 'Some data streaming 
> failed', from_mark = None, timeout = 600, process = None, verbose = False, 
> filename = 'system.log'
>  
>     *def watch_log_for(self, exprs, from_mark=None, timeout=600, 
> process=None, verbose=False, filename='system.log'):*
>         *"""*
>             *Watch the log until one or more (regular) expression are found.*
>             *This methods when all the expressions have been found or the 
> method*
>             *timeouts (a TimeoutError is then raised). On successful 
> completion,*
>             *a list of pair (line matched, match object) is returned.*
>             *"""*
>         *start = time.time()*
>         *tofind = [exprs] if isinstance(exprs, string_types) else exprs*
>         *tofind = [re.compile(e) for e in tofind]*
>         *matchings = []*
>         *reads = ""*
>         *if len(tofind) == 0:*
>             *return None*
>     **    
>         *log_file = os.path.join(self.get_path(), 'logs', filename)*
>         *output_read = False*
>         *while not os.path.exists(log_file):*
>             *time.sleep(.5)*
>             *if start + timeout < time.time():*
>                 *raise TimeoutError(time.strftime("%d %b %Y %H:%M:%S", 
> time.gmtime()) + " [" + self.name + "] Timed out waiting for {} to be 
> created.".format(log_file))*
>             *if process and not output_read:*
>                 *process.poll()*
>                 *if process.returncode is not None:*
>                     *self.print_process_output(self.name, process, verbose)*
>                     *output_read = True*
>                     *if process.returncode != 0:*
>                         *raise RuntimeError()  # Shouldn't reuse RuntimeError 
> but I'm lazy*
>     **    
>         *with open(log_file) as f:*
>             *if from_mark:*
>                 *f.seek(from_mark)*
>     **    
>             *while True:*
>                 *# First, if we have a process to check, then check it.*
>                 *# Skip on Windows - stdout/stderr is cassandra.bat*
>                 *if not common.is_win() and not output_read:*
>                     *if process:*
>                         *process.poll()*
>                         *if process.returncode is not None:*
>                             *self.print_process_output(self.name, process, 
> verbose)*
>                             *output_read = True*
>                             *if process.returncode != 0:*
>                                 *raise RuntimeError()  # Shouldn't reuse 
> RuntimeError but I'm lazy*
>     **    
>                 *line = f.readline()*
>                 *if line:*
>                     *reads = reads + line*
>                     *for e in tofind:*
>                         *m = e.search(line)*
>                         *if m:*
>                             *matchings.append((line, m))*
>                             *tofind.remove(e)*
>                             *if len(tofind) == 0:*
>                                 *return matchings[0] if isinstance(exprs, 
> string_types) else matchings*
>                 *else:*
>                     *# yep, it's ugly*
>                     *time.sleep(1)*
>                     *if start + timeout < time.time():*
> *>                       raise TimeoutError(time.strftime("%d %b %Y 
> %H:%M:%S", time.gmtime()) + " [" + self.name + "] Missing: " + str([e.pattern 
> for e in tofind]) + ":\n" + reads[:50] + ".....\nSee {} for 
> remainder".format(filename))*
> *E                       ccmlib.node.TimeoutError: 02 Mar 2020 17:16:34 
> [node3] Missing: ['Some data streaming failed']:*
> *E                       INFO  [main] 2020-03-02 12:06:34,963 
> YamlConfigura.....*
> *E                       See system.log for remainder*
>  
> *../../dtest-new2/src/ccm/ccmlib/node.py*:536: TimeoutError
> -----------------------------------------------------------------------------------------------------------
>  Captured stdout setup 
> ------------------------------------------------------------------------------------------------------------
> 11:55:37,343 ccm DEBUG Log-watching thread starting.
> -------------------------------------------------------------------------------------------------------------
>  Captured log setup 
> -------------------------------------------------------------------------------------------------------------
> 11:55:37,239 conftest INFO Starting execution of test_resumable_bootstrap at 
> 2020-03-02 11:55:37.239467
> 11:55:37,240 dtest_setup INFO cluster ccm directory: 
> /var/folders/ql/nvcz74bd67d3vhw7227mpjm40000gp/T/dtest-fzxwwt9c
> ------------------------------------------------------------------------------------------------------------
>  Captured stdout call 
> ------------------------------------------------------------------------------------------------------------
> install rule inject stream failure
>  
> ----------------------------------------------------------------------------------------------------------
>  Captured stdout teardown 
> ----------------------------------------------------------------------------------------------------------
> 12:06:04,870 ccm DEBUG Log-watching thread exiting.
> -----------------------------------------------------------------------------------------------------------
>  Captured stdout setup 
> ------------------------------------------------------------------------------------------------------------
> 12:06:06,20 ccm DEBUG Log-watching thread starting.
> -------------------------------------------------------------------------------------------------------------
>  Captured log setup 
> -------------------------------------------------------------------------------------------------------------
> 12:06:05,887 conftest INFO Starting execution of test_resumable_bootstrap at 
> 2020-03-02 12:06:05.887592
> 12:06:05,888 dtest_setup INFO cluster ccm directory: 
> /var/folders/ql/nvcz74bd67d3vhw7227mpjm40000gp/T/dtest-yntcltkx
> ------------------------------------------------------------------------------------------------------------
>  Captured stdout call 
> ------------------------------------------------------------------------------------------------------------
> install rule inject stream failure
>  
> ----------------------------------------------------------------------------------------------------------
>  Captured stdout teardown 
> ----------------------------------------------------------------------------------------------------------
> 12:06:04,870 ccm DEBUG Log-watching thread exiting.
> ----------------------------------------------------------------------------------------------------------
>  Captured stdout teardown 
> ----------------------------------------------------------------------------------------------------------
> 12:06:04,870 ccm DEBUG Log-watching thread exiting.
> ----------------------------------------------------------------------------------------------------------
>  Captured stdout teardown 
> ----------------------------------------------------------------------------------------------------------
> 12:16:34,818 ccm DEBUG Log-watching thread exiting.
> ===Flaky Test Report===
>  
> test_resumable_bootstrap failed (1 runs remaining out of 2).
> <class 'ccmlib.node.TimeoutError'>
> 02 Mar 2020 17:06:04 [node3] Missing: ['Some data streaming failed']:
> INFO  [main] 2020-03-02 11:56:05,081 YamlConfigura.....
> See system.log for remainder
> [<TracebackEntry 
> /Users/ekaterina.dimitri/IdeaProjects/cassandra-dtest-d/bootstrap_test.py:365>,
>  <TracebackEntry 
> /Users/ekaterina.dimitri/dtest-new2/src/ccm/ccmlib/node.py:536>]
> test_resumable_bootstrap failed; it passed 0 out of the required 1 times.
> <class 'ccmlib.node.TimeoutError'>
> 02 Mar 2020 17:16:34 [node3] Missing: ['Some data streaming failed']:
> INFO  [main] 2020-03-02 12:06:34,963 YamlConfigura.....
> See system.log for remainder
> [<TracebackEntry 
> /Users/ekaterina.dimitri/IdeaProjects/cassandra-dtest-d/bootstrap_test.py:365>,
>  <TracebackEntry 
> /Users/ekaterina.dimitri/dtest-new2/src/ccm/ccmlib/node.py:536>]
>  
> ===End Flaky Test Report===
> *========================================================================================================
>  1 failed in 1258.32 seconds 
> =========================================================================================================*
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to