[ 
https://issues.apache.org/jira/browse/CASSANDRA-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13633414#comment-13633414
 ] 

Arya Goudarzi edited comment on CASSANDRA-5432 at 4/16/13 10:22 PM:
--------------------------------------------------------------------

I narrowed this down to JMX port range, and they must be opened on the public 
IPs. Here are the steps to reproduce:

This is a working configuration:
Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in us-west-2
Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and 
NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
C* instances have a security group called 'cluster1'
security group 'cluster1' in each region is configured as such
Allow TCP:
7199 from cluster1 (JMX)
1024 - 65535 from cluster1 (JMX Random Ports)
7100 from cluster1 (Configured Normal Storage)
7103 from cluster1 (Configured SSL Storage)
9160 from cluster1 (Configured Thrift RPC Port)
9160 from <client_group>
foreach node's public IP we also have this rule set to enable cross region 
comminication:
7103 from public_ip

The above is a functioning and happy setup. You run repair, and it finishes 
successfully.

Broken Setup:

Upgrade to 1.2.4 without changing any of the above security group settings:

Run repair. The repair will not receive the Merkle Tree for itself. Thus 
hanging. See description. The test in description was done with one region with 
strategy of us-east-1:3, but other settings were exactly the same.

Now for each public_ip add a security group rule as such to cluster1 security 
group:

Allow TCP: 1024 - 65535 from public_ip

Run repair. Things will magically work now. 

If nothing in terms of port and networking has changed in 1.2, then why the 
above is happening? I can constantly reproduce it. 

This also affects gossip. If you don't have the JMX Ports open on public ips, 
then gossip would not see any node except itself after a snap restart of all 
nodes all at once. 


                
      was (Author: arya):
    I narrowed this down to JMX port range, and they must be opened on the 
public IPs. Here are the steps to reproduce:

This is a working configuration:
Cassandra 1.1.10 Cluster with 12 nodes in us-east-1 and 12 nodes in us-west-2
Using Ec2MultiRegionSnitch and SSL enabled for DC_ONLY and 
NetworkTopologyStrategy with strategy_options: us-east-1:3;us-west-2:3;
C* instances have a security group called 'cluster1'
security group 'cluster1' in each region is configured as such
Allow TCP:
7199 from cluster1 (JMX)
1024 - 65535 from cluster1 (JMX Random Ports)
7100 from cluster1 (Configured Normal Storage)
7103 from cluster1 (Configured SSL Storage)
9160 from cluster1 (Configured Thrift RPC Port)
9160 from <client_group>
foreach node's public IP we also have this rule set to enable cross region 
comminication:
7103 from public_ip

The above is a functioning and happy setup. You run repair, and it finishes 
successfully.

Broken Setup:

Upgrade to 1.2.4 without changing any of the above security group settings:

Run repair. The repair will not receive the Merkle Tree for itself. Thus 
hanging. See description. The test in description was done without having the 
other region up, but settings were exactly  the same.

Now for each public_ip add a security group rule as such to cluster1 security 
group:

Allow TCP: 1024 - 65535 from public_ip

Run repair. Things will magically work now. 

If nothing in terms of port and networking has changed in 1.2, then why the 
above is happening? I can constantly reproduce it. 

This also affects gossip. If you don't have the JMX Ports open on public ips, 
then gossip would not see any node except itself after a snap restart of all 
nodes all at once. 


                  
> Repair Freeze/Gossip Invisibility Issues 1.2.4
> ----------------------------------------------
>
>                 Key: CASSANDRA-5432
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5432
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.4
>         Environment: Ubuntu 10.04.1 LTS
> C* 1.2.3
> Sun Java 6 u43
> JNA Enabled
> Not using VNodes
>            Reporter: Arya Goudarzi
>            Priority: Critical
>
> Read comment 6. This description summarizes the repair issue only, but I 
> believe there is a bigger problem going on with networking as described on 
> that comment. 
> Since I have upgraded our sandbox cluster, I am unable to run repair on any 
> node and I am reaching our gc_grace seconds this weekend. Please help. So 
> far, I have tried the following suggestions:
> - nodetool scrub
> - offline scrub
> - running repair on each CF separately. Didn't matter. All got stuck the same 
> way.
> The repair command just gets stuck and the machine is idling. Only the 
> following logs are printed for repair job:
>  INFO [Thread-42214] 2013-04-05 23:30:27,785 StorageService.java (line 2379) 
> Starting repair command #4, repairing 1 ranges for keyspace 
> cardspring_production
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,789 AntiEntropyService.java 
> (line 652) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] new session: will 
> sync /X.X.X.190, /X.X.X.43, /X.X.X.56 on range 
> (1808575600,42535295865117307932921825930779602032] for 
> keyspace_production.[comma separated list of CFs]
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,790 AntiEntropyService.java 
> (line 858) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] requesting merkle 
> trees for BusinessConnectionIndicesEntries (to [/X.X.X.43, /X.X.X.56, 
> /X.X.X.190])
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,086 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.43
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java 
> (line 214) [repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle 
> tree for ColumnFamilyName from /X.X.X.56
> Please advise. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to