[jira] [Commented] (CASSANDRA-11608) dtest failure in replace_address_test.TestReplaceAddress.replace_first_boot_test

2016-04-20 Thread Philip Thompson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250594#comment-15250594
 ] 

Philip Thompson commented on CASSANDRA-11608:
-

I see now the bad start is on line 216, not 182. Thus why my fix had no effect.

> dtest failure in 
> replace_address_test.TestReplaceAddress.replace_first_boot_test
> 
>
> Key: CASSANDRA-11608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11608
> Project: Cassandra
>  Issue Type: Test
>Reporter: Jim Witschey
>Assignee: Philip Thompson
>  Labels: dtest
>
> This looks like a timeout kind of flap. It's flapped once. Example failure:
> http://cassci.datastax.com/job/cassandra-2.2_offheap_dtest/344/testReport/replace_address_test/TestReplaceAddress/replace_first_boot_test
> Failed on CassCI build cassandra-2.2_offheap_dtest #344 - 2.2.6-tentative
> {code}
> Error Message
> 15 Apr 2016 16:23:41 [node3] Missing: ['127.0.0.4.* now UP']:
> INFO  [main] 2016-04-15 16:21:32,345 Config.java:4.
> See system.log for remainder
>  >> begin captured logging << 
> dtest: DEBUG: cluster ccm directory: /mnt/tmp/dtest-4i5qkE
> dtest: DEBUG: Custom init_config not found. Setting defaults.
> dtest: DEBUG: Done setting configuration options:
> {   'initial_token': None,
> 'memtable_allocation_type': 'offheap_objects',
> 'num_tokens': '32',
> 'phi_convict_threshold': 5,
> 'start_rpc': 'true'}
> dtest: DEBUG: Starting cluster with 3 nodes.
> dtest: DEBUG: 32
> dtest: DEBUG: Inserting Data...
> dtest: DEBUG: Stopping node 3.
> dtest: DEBUG: Testing node stoppage (query should fail).
> dtest: DEBUG: Retrying read after timeout. Attempt #0
> dtest: DEBUG: Retrying read after timeout. Attempt #1
> dtest: DEBUG: Retrying request after UE. Attempt #2
> dtest: DEBUG: Retrying request after UE. Attempt #3
> dtest: DEBUG: Retrying request after UE. Attempt #4
> dtest: DEBUG: Starting node 4 to replace node 3
> dtest: DEBUG: Verifying querying works again.
> dtest: DEBUG: Verifying tokens migrated sucessfully
> dtest: DEBUG: ('WARN  [main] 2016-04-15 16:21:21,068 TokenMetadata.java:196 - 
> Token -3855903180169109916 changing ownership from /127.0.0.3 to 
> /127.0.0.4\n', <_sre.SRE_Match object at 0x7fd21c0e2370>)
> dtest: DEBUG: Try to restart node 3 (should fail)
> dtest: DEBUG: [('WARN  [GossipStage:1] 2016-04-15 16:21:22,942 
> StorageService.java:1962 - Host ID collision for 
> 75916cc0-86ec-4136-b336-862a49953616 between /127.0.0.3 and /127.0.0.4; 
> /127.0.0.4 is the new owner\n', <_sre.SRE_Match object at 0x7fd1f83555e0>)]
> - >> end captured logging << -
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/replace_address_test.py", line 212, 
> in replace_first_boot_test
> node4.start(wait_for_binary_proto=True)
>   File "/home/automaton/ccm/ccmlib/node.py", line 610, in start
> node.watch_log_for_alive(self, from_mark=mark)
>   File "/home/automaton/ccm/ccmlib/node.py", line 457, in watch_log_for_alive
> self.watch_log_for(tofind, from_mark=from_mark, timeout=timeout, 
> filename=filename)
>   File "/home/automaton/ccm/ccmlib/node.py", line 425, in watch_log_for
> raise TimeoutError(time.strftime("%d %b %Y %H:%M:%S", time.gmtime()) + " 
> [" + self.name + "] Missing: " + str([e.pattern for e in tofind]) + ":\n" + 
> reads[:50] + ".\nSee {} for remainder".format(filename))
> "15 Apr 2016 16:23:41 [node3] Missing: ['127.0.0.4.* now UP']:\nINFO  [main] 
> 2016-04-15 16:21:32,345 Config.java:4.\nSee system.log for 
> remainder\n >> begin captured logging << 
> \ndtest: DEBUG: cluster ccm directory: 
> /mnt/tmp/dtest-4i5qkE\ndtest: DEBUG: Custom init_config not found. Setting 
> defaults.\ndtest: DEBUG: Done setting configuration options:\n{   
> 'initial_token': None,\n'memtable_allocation_type': 'offheap_objects',\n  
>   'num_tokens': '32',\n'phi_convict_threshold': 5,\n'start_rpc': 
> 'true'}\ndtest: DEBUG: Starting cluster with 3 nodes.\ndtest: DEBUG: 
> 32\ndtest: DEBUG: Inserting Data...\ndtest: DEBUG: Stopping node 3.\ndtest: 
> DEBUG: Testing node stoppage (query should fail).\ndtest: DEBUG: Retrying 
> read after timeout. Attempt #0\ndtest: DEBUG: Retrying read after timeout. 
> Attempt #1\ndtest: DEBUG: Retrying request after UE. Attempt #2\ndtest: 
> DEBUG: Retrying request after UE. Attempt #3\ndtest: DEBUG: Retrying request 
> after UE. Attempt #4\ndtest: DEBUG: Starting node 4 to replace node 3\ndtest: 
> DEBUG: Verifying querying works again.\ndtest: DEBUG: Verifying tokens 
> migrated sucessfully\ndtest: DEBUG: ('WARN  [main] 

[jira] [Commented] (CASSANDRA-11608) dtest failure in replace_address_test.TestReplaceAddress.replace_first_boot_test

2016-04-20 Thread Philip Thompson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250495#comment-15250495
 ] 

Philip Thompson commented on CASSANDRA-11608:
-

Looking at the logs, it seems node4 was added before node3 was fully recognized 
as down. I tried setting {{wait-other-notice=True}} on the {{node3.stop()}} 
call, but that didn't fix the issue at all. I haven't checked the logs for 
those runs yet.

> dtest failure in 
> replace_address_test.TestReplaceAddress.replace_first_boot_test
> 
>
> Key: CASSANDRA-11608
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11608
> Project: Cassandra
>  Issue Type: Test
>Reporter: Jim Witschey
>Assignee: Philip Thompson
>  Labels: dtest
>
> This looks like a timeout kind of flap. It's flapped once. Example failure:
> http://cassci.datastax.com/job/cassandra-2.2_offheap_dtest/344/testReport/replace_address_test/TestReplaceAddress/replace_first_boot_test
> Failed on CassCI build cassandra-2.2_offheap_dtest #344 - 2.2.6-tentative
> {code}
> Error Message
> 15 Apr 2016 16:23:41 [node3] Missing: ['127.0.0.4.* now UP']:
> INFO  [main] 2016-04-15 16:21:32,345 Config.java:4.
> See system.log for remainder
>  >> begin captured logging << 
> dtest: DEBUG: cluster ccm directory: /mnt/tmp/dtest-4i5qkE
> dtest: DEBUG: Custom init_config not found. Setting defaults.
> dtest: DEBUG: Done setting configuration options:
> {   'initial_token': None,
> 'memtable_allocation_type': 'offheap_objects',
> 'num_tokens': '32',
> 'phi_convict_threshold': 5,
> 'start_rpc': 'true'}
> dtest: DEBUG: Starting cluster with 3 nodes.
> dtest: DEBUG: 32
> dtest: DEBUG: Inserting Data...
> dtest: DEBUG: Stopping node 3.
> dtest: DEBUG: Testing node stoppage (query should fail).
> dtest: DEBUG: Retrying read after timeout. Attempt #0
> dtest: DEBUG: Retrying read after timeout. Attempt #1
> dtest: DEBUG: Retrying request after UE. Attempt #2
> dtest: DEBUG: Retrying request after UE. Attempt #3
> dtest: DEBUG: Retrying request after UE. Attempt #4
> dtest: DEBUG: Starting node 4 to replace node 3
> dtest: DEBUG: Verifying querying works again.
> dtest: DEBUG: Verifying tokens migrated sucessfully
> dtest: DEBUG: ('WARN  [main] 2016-04-15 16:21:21,068 TokenMetadata.java:196 - 
> Token -3855903180169109916 changing ownership from /127.0.0.3 to 
> /127.0.0.4\n', <_sre.SRE_Match object at 0x7fd21c0e2370>)
> dtest: DEBUG: Try to restart node 3 (should fail)
> dtest: DEBUG: [('WARN  [GossipStage:1] 2016-04-15 16:21:22,942 
> StorageService.java:1962 - Host ID collision for 
> 75916cc0-86ec-4136-b336-862a49953616 between /127.0.0.3 and /127.0.0.4; 
> /127.0.0.4 is the new owner\n', <_sre.SRE_Match object at 0x7fd1f83555e0>)]
> - >> end captured logging << -
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
>   File "/home/automaton/cassandra-dtest/replace_address_test.py", line 212, 
> in replace_first_boot_test
> node4.start(wait_for_binary_proto=True)
>   File "/home/automaton/ccm/ccmlib/node.py", line 610, in start
> node.watch_log_for_alive(self, from_mark=mark)
>   File "/home/automaton/ccm/ccmlib/node.py", line 457, in watch_log_for_alive
> self.watch_log_for(tofind, from_mark=from_mark, timeout=timeout, 
> filename=filename)
>   File "/home/automaton/ccm/ccmlib/node.py", line 425, in watch_log_for
> raise TimeoutError(time.strftime("%d %b %Y %H:%M:%S", time.gmtime()) + " 
> [" + self.name + "] Missing: " + str([e.pattern for e in tofind]) + ":\n" + 
> reads[:50] + ".\nSee {} for remainder".format(filename))
> "15 Apr 2016 16:23:41 [node3] Missing: ['127.0.0.4.* now UP']:\nINFO  [main] 
> 2016-04-15 16:21:32,345 Config.java:4.\nSee system.log for 
> remainder\n >> begin captured logging << 
> \ndtest: DEBUG: cluster ccm directory: 
> /mnt/tmp/dtest-4i5qkE\ndtest: DEBUG: Custom init_config not found. Setting 
> defaults.\ndtest: DEBUG: Done setting configuration options:\n{   
> 'initial_token': None,\n'memtable_allocation_type': 'offheap_objects',\n  
>   'num_tokens': '32',\n'phi_convict_threshold': 5,\n'start_rpc': 
> 'true'}\ndtest: DEBUG: Starting cluster with 3 nodes.\ndtest: DEBUG: 
> 32\ndtest: DEBUG: Inserting Data...\ndtest: DEBUG: Stopping node 3.\ndtest: 
> DEBUG: Testing node stoppage (query should fail).\ndtest: DEBUG: Retrying 
> read after timeout. Attempt #0\ndtest: DEBUG: Retrying read after timeout. 
> Attempt #1\ndtest: DEBUG: Retrying request after UE. Attempt #2\ndtest: 
> DEBUG: Retrying request after UE. Attempt #3\ndtest: DEBUG: Retrying request 
> after UE. Attempt #4\ndtest: DEBUG: Starting