[jira] [Comment Edited] (CASSANDRA-5432) Repair Freeze/Gossip Invisibility Issues 1.2.4

Arya Goudarzi (JIRA) Tue, 16 Apr 2013 15:39:19 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13633414#comment-13633414
 ]


Arya Goudarzi edited comment on CASSANDRA-5432 at 4/16/13 10:38 PM:
--------------------------------------------------------------------

I narrowed this down to non-ssl storage port, and they must be opened on the 
public IPs. Here are the steps to reproduce:

This is a working configuration:
Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in us-west-2
Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and 
NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
C* instances have a security group called 'cluster1'
security group 'cluster1' in each region is configured as such
Allow TCP:
7199 from cluster1 (JMX)
1024 - 65535 from cluster1 (JMX Random Ports)
7100 from cluster1 (Configured Normal Storage)
7103 from cluster1 (Configured SSL Storage)
9160 from cluster1 (Configured Thrift RPC Port)
9160 from <client_group>
foreach node's public IP we also have this rule set to enable cross region 
comminication:
7103 from public_ip

The above is a functioning and happy setup. You run repair, and it finishes 
successfully.

Broken Setup:

Upgrade to 1.2.4 without changing any of the above security group settings:

Run repair. The repair will not receive the Merkle Tree for itself. Thus 
hanging. See description. The test in description was done with one region with 
strategy of us-east-1:3, but other settings were exactly the same.

Now for each public_ip add a security group rule as such to cluster1 security 
group:

Allow TCP: 7100 from public_ip

Run repair. Things will magically work now. 

If nothing in terms of port and networking has changed in 1.2, then why the 
above is happening? I can constantly reproduce it. 

This also affects gossip. If you don't have the JMX Ports open on public ips, 
then gossip would not see any node except itself after a snap restart of all 
nodes all at once. 


                
      was (Author: arya):
    I narrowed this down to JMX port range, and they must be opened on the 
public IPs. Here are the steps to reproduce:

This is a working configuration:
Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in us-west-2
Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and 
NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
C* instances have a security group called 'cluster1'
security group 'cluster1' in each region is configured as such
Allow TCP:
7199 from cluster1 (JMX)
1024 - 65535 from cluster1 (JMX Random Ports)
7100 from cluster1 (Configured Normal Storage)
7103 from cluster1 (Configured SSL Storage)
9160 from cluster1 (Configured Thrift RPC Port)
9160 from <client_group>
foreach node's public IP we also have this rule set to enable cross region 
comminication:
7103 from public_ip

The above is a functioning and happy setup. You run repair, and it finishes 
successfully.

Broken Setup:

Upgrade to 1.2.4 without changing any of the above security group settings:

Run repair. The repair will not receive the Merkle Tree for itself. Thus 
hanging. See description. The test in description was done with one region with 
strategy of us-east-1:3, but other settings were exactly the same.

Now for each public_ip add a security group rule as such to cluster1 security 
group:

Allow TCP: 1024 - 65535 from public_ip

Run repair. Things will magically work now. 

If nothing in terms of port and networking has changed in 1.2, then why the 
above is happening? I can constantly reproduce it. 

This also affects gossip. If you don't have the JMX Ports open on public ips, 
then gossip would not see any node except itself after a snap restart of all 
nodes all at once. 


                  
> Repair Freeze/Gossip Invisibility Issues 1.2.4
> ----------------------------------------------
>
>                 Key: CASSANDRA-5432
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5432
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.4
>         Environment: Ubuntu 10.04.1 LTS
> C* 1.2.3
> Sun Java 6 u43
> JNA Enabled
> Not using VNodes
>            Reporter: Arya Goudarzi
>            Priority: Critical
>
> Read comment 6. This description summarizes the repair issue only, but I 
> believe there is a bigger problem going on with networking as described on 
> that comment. 
> Since I have upgraded our sandbox cluster, I am unable to run repair on any 
> node and I am reaching our gc_grace seconds this weekend. Please help. So 
> far, I have tried the following suggestions:
> - nodetool scrub
> - offline scrub
> - running repair on each CF separately. Didn't matter. All got stuck the same 
> way.
> The repair command just gets stuck and the machine is idling. Only the 
> following logs are printed for repair job:
>  INFO [Thread-42214] 2013-04-05 23:30:27,785 StorageService.java (line 2379) 
> Starting repair command #4, repairing 1 ranges for keyspace 
> cardspring_production
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,789 AntiEntropyService.java 
> (line 652) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] new session: will 
> sync /X.X.X.190, /X.X.X.43, /X.X.X.56 on range 
> (1808575600,42535295865117307932921825930779602032] for 
> keyspace_production.[comma separated list of CFs]
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,790 AntiEntropyService.java 
> (line 858) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] requesting merkle 
> trees for BusinessConnectionIndicesEntries (to [/X.X.X.43, /X.X.X.56, 
> /X.X.X.190])
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,086 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.43
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.56
> Please advise. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (CASSANDRA-5432) Repair Freeze/Gossip Invisibility Issues 1.2.4

Reply via email to