Re: [Clearwater] Problems with Sprout clustering and Homestead failure

Eleanor Merry Tue, 17 Feb 2015 10:12:58 -0800

Hi Lianjie,

We’ve fixed the socket file issue in 
https://github.com/Metaswitch/crest/issues/192, and it’s in the release that 
went out today (Yoshi’s Island).


For your install question:
We have some (not all) of the old releases available at 
"http://repo.cw-ngv.com/archive/repo<Release 
Number><http://repo.cw-ngv.com/archive/repo%3cRelease%20Number%3e>" - for 
example the current release 'Yoshi’s Island' is at 
http://repo.cw-ngv.com/archive/repo67 (note this is always one greater than the 
tagged number of the release). Please note though that we may not always make 
the old releases available.
If you are building your own code – we tag each release we cut (e.g. 
https://github.com/Metaswitch/sprout/releases), so you can check out the code 
that was part of the release you’re aiming for.

For your error code question:

Code 503:
A subscriber will receive this if Sprout or Bono is overloaded, or if requests 
to the Sprout time out. Internally, Sprout will receive this error code from 
Homestead if Homestead is overloaded, and Sprout will also use this as the 
return code if its request to Homestead times out. Sprout will convert this 
error code to a 504 though when it responds to the subscriber.

Code 403:
A subscriber can receive this for various reasons. These include:

-          Homestead will return a 403 when the HSS rejects a REGISTER because 
the subscriber is roaming, and roaming is not allowed for that subscriber.

-          Sprout will convert an error code from Homestead to 403 when 
Homestead returns particular error codes (such as 404) on a REGISTER

-          Sprout will reject a REGISTER with incorrect authentication 
credentials with a 403

-          Sprout/Bono will reject untrusted requests (not REGISTERs) with a 
403.

For your Homestead/Homestead-prov question:

I think there’s some confusion between Homestead and Homestead-prov in your 
questions. Both these processes run on the Homestead node, and they both talk 
to the same underlying cassandra database, but they have different functions 
and they don’t directly talk to each other.

Homestead-prov is our provisioning service. It’s used when a subscriber is 
created (either through Ellis or directly through the Homestead-prov API – see 
https://github.com/Metaswitch/crest/blob/dev/docs/homestead_api.md), and it 
adds the subscriber to the Cassandra databases. Homestead-prov is therefore 
only used if there’s no external HSS. Homestead-prov is written in Python, and 
it lives in the crest repository (https://github.com/Metaswitch/crest).

Homestead is our HSS cache. It’s the data-caching part of the S-CSCF (where 
Sprout provides the associated SIP routing function). It’s used to get/store 
subscriber data from the cassandra database when it receives requests from the 
Sprout (SARs, MARs, UARs, LIRs) or the HSS (PPRs, RTRs). Homestead is written 
in C++, and it lives in the homestead repository 
(https://github.com/Metaswitch/homestead).

For your question about the token bucket changes, you’ve made the changes to 
the Homestead-prov code, but the logs you’ve posted are from the Homestead (not 
Homestead-prov) process. To change the Homestead load monitor options you will 
need to rebuild the Homestead code. The load monitor is set up at 
https://github.com/Metaswitch/homestead/blob/dev/src/main.cpp#L526, and we’ve 
got build instructions for Homestead at 
https://github.com/Metaswitch/homestead/blob/dev/docs/development.md. We’ve got 
an open issue to make these options configurable as well 
(https://github.com/Metaswitch/cpp-common/issues/199).

Hope this helps,

Ellie

From: Lianjie Cao [mailto:[email protected]]
Sent: 16 February 2015 19:42
To: Eleanor Merry
Cc: [email protected]; Sharma, Puneet
Subject: Re: [Clearwater] Problems with Sprout clustering and Homestead failure

Hi Ellie,

Thanks for the explanation! We are able to start Homer and Homestead-prov by 
manually stop, start and removing the socket file. Hope this can be fixed in 
the next release.
And the Cassandra problem is due to the Homestead-prov database crash. It 
started to work after I rebulit it.

Since we are trying to do some quantitative parallel stress tests on our 
deployment, it would be great if we can stick to one specific release for 
consistency purpose.

Q1: Is there a way to install a specific version of Clearwater?

For testing purpose, we usually initiates a large number of requests to 
Clearwater. So, Homestead node may receive a lot of queries in a short time.
Instead of the correct responses (200/401), we also receive unexpected 
responses such as 503 and 403.

Q2: Under what circumstance does Clearwater send 503/403 responses instead of 
correct ones?

I understand Homestead-prov uses token-base mechanism to control the number of 
requests sent to Homestead. A request can only be processed when free tokens 
are available. And also the number of token increases based on a rate and the 
time since last increment.
I tried to change "loadmonitor = LoadMonitor(0.1, 20, 10, 10)" to "loadmonitor 
= LoadMonitor(0.1, 2000, 1000, 1000)" in 
/usr/share/clearwater/homestead/env/lib/python2.7/site-packages/crest-0.1-py2.7.egg/metaswitch/crest/api/base.py
 and restarted node. But the change didn't work.
The token rate still started from 10 as shown in the log:

16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:93: Constructing LoadMonitor
16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:94:    Target latency 
(usecs)   : 100000
16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:95:    Max bucket size      
    : 20
16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:96:    Initial token fill 
rate/s: 10.000000
16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:97:    Min token fill 
rate/s    : 10.000000

Q3. How should I change the initial token rate? And can you explain how 
Homestead and Homestead-prov work in more details?

Thanks,
Lianjie

On Thu, Feb 12, 2015 at 3:10 PM, Eleanor Merry 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lianjie,

In the Ultima release, we added support for separated management and signalling 
networks (see 
https://github.com/Metaswitch/clearwater-docs/wiki/Multiple-Network-Support for 
more details), and as part of that we moved nginx to listen on port 8889/7888, 
and homer and homestead-prov to listen on socket files.

However, there was an issue with this. If Homer/Homestead-prov doesn’t shut 
down cleanly, the socket file isn’t get deleted, meaning that the service can’t 
be started again (see https://github.com/Metaswitch/crest/issues/192 for more 
details). We’ve got a fix for this that will go into the next release; in the 
meantime can you delete the /tmp/.homestead-prov-sock-0 and /tmp/.homer-sock-0 
files. This will allow Homer/Homestead-prov to start again.

I’ve not seen those Cassandra errors before. Are you able to use the cassandra 
tools to fix up any corruption? I wouldn’t have though that recreating the 
subscribers would fix these errors though.

Ellie



From: Lianjie Cao [mailto:[email protected]<mailto:[email protected]>]
Sent: 10 February 2015 20:52
To: Eleanor Merry
Cc: 
[email protected]<mailto:[email protected]>;
 Sharma, Puneet

Subject: Re: [Clearwater] Problems with Sprout clustering and Homestead failure

Hi Ellie,

Thanks a lot for pointing out the relation to Cassandra!
I changed the logging level of Homestead and Homestead-prov to 5 and cleared 
out all the previous logs and restarted everthing.
Here are the errors reported in /var/log/cassandra/system.log:

 INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,559 SSTableReader.java (line 
232) Opening 
/var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ic-16 
(317 bytes)
ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,573 CassandraDaemon.java (line 
191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
        at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
        at 
org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
        at 
org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:701)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
        at java.io.DataInputStream.readUTF(DataInputStream.java:589)
        at java.io.DataInputStream.readUTF(DataInputStream.java:564)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
        ... 12 more

INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,666 SSTableReader.java (line 232) 
Opening 
/var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-ic-31
 (6226 bytes)
ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,670 CassandraDaemon.java (line 
191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
        at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
        at 
org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
        at 
org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:701)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
        at java.io.DataInputStream.readUTF(DataInputStream.java:589)
        at java.io.DataInputStream.readUTF(DataInputStream.java:564)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
        ... 12 more

 INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,791 SSTableReader.java (line 
232) Opening 
/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ic-31 (3305 
bytes)
ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,794 CassandraDaemon.java (line 
191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
        at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
        at 
org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
        at 
org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:701)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
        at java.io.DataInputStream.readUTF(DataInputStream.java:589)
        at java.io.DataInputStream.readUTF(DataInputStream.java:564)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
        ... 12 more

 INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,885 SSTableReader.java (line 
232) Opening /var/lib/cassandra/data/system/local/system-local-ic-2 (120 bytes)
ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,887 CassandraDaemon.java (line 
191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
        at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
        at 
org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
        at 
org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:701)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
        at java.io.DataInputStream.readUTF(DataInputStream.java:589)
        at java.io.DataInputStream.readUTF(DataInputStream.java:564)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
        ... 12 more

 INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,889 SSTableReader.java (line 
232) Opening /var/lib/cassandra/data/system/local/system-local-ic-1 (357 bytes)
ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,891 CassandraDaemon.java (line 
191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
        at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
        at 
org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
        at 
org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:701)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
        at java.io.DataInputStream.readUTF(DataInputStream.java:589)
        at java.io.DataInputStream.readUTF(DataInputStream.java:564)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
        ... 12 more

 INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,894 SSTableReader.java (line 
232) Opening /var/lib/cassandra/data/system/local/system-local-ic-3 (109 bytes)
ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,896 CassandraDaemon.java (line 
191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.EOFException
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
        at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
        at 
org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
        at 
org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
        at 
org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:701)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
        at java.io.DataInputStream.readUTF(DataInputStream.java:589)
        at java.io.DataInputStream.readUTF(DataInputStream.java:564)
        at 
org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
        ... 12 more




Cassandra reported errors when opening those SSTable files:

/var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ic-16
/var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-ic-31
/var/lib/cassandra/data/system/schema_columns/system-schema_columns-ic-31
/var/lib/cassandra/data/system/local/system-local-ic-2
/var/lib/cassandra/data/system/local/system-local-ic-1
/var/lib/cassandra/data/system/local/system-local-ic-3

Do you know what are those files?
I crosschecked with the Homestead log file. And it seems like during Cassandra 
initialization, Homestead report "connect() failed: Connection refused".
After that, it reports "Cache caught unknown exception!"

10-02-2015 17:24:00.879 UTC Error cassandra_store.cpp:207: Cache caught 
TTransportException: connect() failed: Connection refused
10-02-2015 17:24:00.879 UTC Error main.cpp:550: Failed to initialize cache - rc 
3

10-02-2015 17:24:02.411 UTC Debug zmq_lvc.cpp:144: Enabled XPUB_VERBOSE mode
10-02-2015 17:24:02.411 UTC Error cassandra_store.cpp:207: Cache caught 
TTransportException: connect() failed: Connection refused
10-02-2015 17:24:02.413 UTC Error main.cpp:550: Failed to initialize cache - rc 
3

10-02-2015 17:24:16.154 UTC Error cassandra_store.cpp:217: Cache caught unknown 
exception!
10-02-2015 17:24:16.154 UTC Error main.cpp:550: Failed to initialize cache - rc 
5

10-02-2015 17:24:56.569 UTC Error cassandra_store.cpp:217: Cache caught unknown 
exception!
10-02-2015 17:24:56.572 UTC Debug statistic.cpp:93: Initializing 
inproc://H_hss_latency_us statistic reporter
10-02-2015 17:24:56.572 UTC Debug statistic.cpp:93: Initializing 
inproc://H_latency_us statistic reporter
10-02-2015 17:24:56.572 UTC Error main.cpp:550: Failed to initialize cache - rc 
5

Another problem is about Homer and Homestead-prov. Both logs show "Address 
already in use." error.
After checking the port usage, I found that Port 7888 on Homer node and port 
8889 on Homestead are both used by nginx which are supposed to be assign to 
Homer and Homestead-prov.
Do you know how to fix this?


I am planning to rebuild the Homestead node and reinsert the numbers using 
Bulk-Provisioning method. Do you think if that would help?
Acutually, we used to have a working deployment (Sprint Pacman) using the same 
configuration. Is there a way to install previous versions?

Full logs are attached.

Thanks,
Lianjie

On Mon, Feb 9, 2015 at 3:09 PM, Eleanor Merry 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lianjie,

I’m glad to hear that Sprout and Chronos are now working!

For the cassandra issue, looking in the logs there’s a number of cases of 
CorruptSSTableExceptions. I’ve not seen this before, but I believe you can use 
nodetool scrub or sstablescrub to fix up any corruption.

Also, how are you stopping Homer, Homestead-prov and Homestead? When you stop 
the service, you should stop both the service and its associated poll_* script 
(e.g. “sudo monit stop poll_homestead”), and you shouldn’t restart the service 
using “sudo service <service> restart”, as this can cause issues where two 
versions of the service start up.

Ellie


From: Lianjie Cao [mailto:[email protected]<mailto:[email protected]>]
Sent: 06 February 2015 20:32
To: Eleanor Merry
Cc: 
[email protected]<mailto:[email protected]>
Subject: Re: [Clearwater] Problems with Sprout clustering and Homestead failure

Hi Ellie,

Thanks a lot for the response!
I modified Sprout and Chronos configurations. They are working correctly now!

I checked Cassandra on Homestead node. The log does show a few errors during 
initialization. But it started successfully finally. The Cassandra, Homestead 
and Homestead-prov logs are attached.

Actually, I did run into the same problem before. But after rebooting Homestead 
node a few times, it works fine. So, I didn't dig into it.
Is it possible that the problem is due to some starting conflicts among 
Cassandra, Homestead and Homestead-prov?

Thanks,
Lianjie

On Wed, Feb 4, 2015 at 3:25 PM, Eleanor Merry 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lianjie,

Your configuration files aren't quite right.

The cluster_settings file should have the form servers=<address>,<address> - so 
in your case it would be 
"servers=192.168.1.21:11211<http://192.168.1.21:11211>, 
192.168.1.22:11211<http://192.168.1.22:11211>". This file should be identical 
on each Sprout node (so the sprouts must be in same order on each node).

The chronos.conf file should have one localhost entry, which is set to the IP 
address of the local node, and multiple node entries, which are set to the IP 
addresses of each node in the cluster. In your case, this would be (on sprout 
1):

[cluster]
localhost = 192.168.1.21
node = 192.168.1.21
node = 192.168.1.22

The order of the nodes must be the same on each node - so the file on sprout 2 
should be:

[cluster]
localhost = 192.168.1.22
node = 192.168.1.21
node = 192.168.1.22

Can you make these changes to the config files, and then reload Sprout and 
Chronos (sudo service <service> reload)?

In the logs below, Homestead has stopped because it couldn't contact cassandra:

04-02-2015 18:42:19.616 UTC Error cassandra_store.cpp:207: Cache caught 
TTransportException: connect() failed: Connection refused
04-02-2015 18:42:19.616 UTC Error main.cpp:550: Failed to initialize cache - rc 
3
04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:185: Stopping cache

Can you check whether Cassandra is running reliably on the Homestead node? Does 
/var/monit/monit.log show that monit is restarting it, and are there any logs 
in /var/log/cassandra?

Ellie

-----Original Message-----
From: 
[email protected]<mailto:[email protected]>
 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of Lianjie Cao
Sent: 04 February 2015 19:37
To: 
[email protected]<mailto:[email protected]>
Subject: [Clearwater] Problems with Sprout clustering and Homestead failure

Hi,

We recently built a Clearwater deployment with one Bono node, two Sprout nodes, 
one Homestead node, one Homer node and one Ralf node. Howerver, we ran into 
some problems related to Homestead start failure and Sprout clustering.

*Sprout clustering:*
The manual installation instruction shows for the latest version Sprout 
clustering is done by Chronos. To add or remove a Sprout node, 
/etc/chronos/chronos.conf needs to modified correspondingly.
However, we found that when we don't have chronos.conf file, the two Sprout 
nodes seems working fine by adding IPs of the two Sprout nodes to 
/etc/clearwater/cluster_settings.

[sprout]cw@sprout-2:~$ cat /etc/clearwater/cluster_settings
servers=192.168.1.21:11211<http://192.168.1.21:11211>
servers=192.168.1.22:11211<http://192.168.1.22:11211>

But, if we do add /etc/chronos/chronos.conf with the information of two Sprout 
nodes as below, Chronos failed and no new log files found under 
/var/log/chronos.

[sprout]cw@sprout-1:/var/log/chronos$ cat /etc/chronos/chronos.conf [http] 
bind-address = 0.0.0.0 bind-port = 7253

[logging]
folder = /var/log/chronos
level = 5

[cluster]
localhost = 192.168.1.21
node = localhost

sprout-2 = 192.168.1.22
node = sprout-2

[alarms]
enabled = true


[sprout]cw@sprout-1:~$ sudo monit status The Monit daemon 5.8.1 uptime: 0m

Program 'poll_sprout'
  status                            Status ok
  monitoring status                 Monitored
  last started                      Wed, 04 Feb 2015 11:20:36
  last exit value                   0
  data collected                    Wed, 04 Feb 2015 11:20:36

Process 'sprout'
  status                            Running
  monitoring status                 Monitored
  pid                               1157
  parent pid                        1
  uid                               999
  effective uid                     999
  gid                               999
  uptime                            1m
  children                          0
  memory kilobytes                  42412
  memory kilobytes total            42412
  memory percent                    1.0%
  memory percent total              1.0%
  cpu percent                       0.4%
  cpu percent total                 0.4%
  data collected                    Wed, 04 Feb 2015 11:20:36

Program 'poll_memcached'
  status                            Status ok
  monitoring status                 Monitored
  last started                      Wed, 04 Feb 2015 11:20:36
  last exit value                   0
  data collected                    Wed, 04 Feb 2015 11:20:36

Process 'memcached'
  status                            Running
  monitoring status                 Monitored
  pid                               1092
  parent pid                        1
  uid                               108
  effective uid                     108
  gid                               114
  uptime                            1m
  children                          0
  memory kilobytes                  1180
  memory kilobytes total            1180
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 04 Feb 2015 11:20:36

Process 'clearwater_diags_monitor'
  status                            Running
  monitoring status                 Monitored
  pid                               1072
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            1m
  children                          1
  memory kilobytes                  1796
  memory kilobytes total            2172
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 04 Feb 2015 11:20:36

Process 'chronos'
  status                            Execution failed
  monitoring status                 Monitored
  data collected                    Wed, 04 Feb 2015 11:20:26

System 'sprout-1'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.20] [0.09] [0.04]
  cpu                               6.8%us 1.1%sy 0.0%wa
  memory usage                      116944 kB [2.8%]
  swap usage                        0 kB [0.0%]
  data collected                    Wed, 04 Feb 2015 11:20:26


Is it because we are not using Chronos in the right way or there are other 
settings we need to do?
*Homestead Failure:*

When we use SIPp to perform user registration tests, we receive “403 Forbidden" 
response and we observed error on both sprout nodes.

[sprout]cw@sprout-1:~$ cat /var/log/sprout/sprout_current.txt
04-02-2015 18:54:50.884 UTC Warning acr.cpp:627: Failed to send Ralf ACR 
message (0x7fce241cd780), rc = 400
04-02-2015 18:54:51.083 UTC Error httpconnection.cpp:573:
http://hs.hp-clearwater.com:8888/impi/6500000008%40hp-clearwater.com/av?impu=sip%3A6500000008%40hp-clearwater.com
failed at server 192.168.1.31 : Timeout was reached (28) : fatal
04-02-2015 18:54:51.083 UTC Error httpconnection.cpp:688: cURL failure with 
cURL error code 28 (see man 3 libcurl-errors) and HTTP error code 500
04-02-2015 18:54:51.083 UTC Error hssconnection.cpp:145: Failed to get 
Authentication Vector for 
[email protected]<mailto:[email protected]>
04-02-2015 18:54:51.086 UTC Error httpconnection.cpp:688: cURL failure with 
cURL error code 0 (see man 3 libcurl-errors) and HTTP error code 400
04-02-2015 18:54:51.086 UTC Warning acr.cpp:627: Failed to send Ralf ACR 
message (0x14322c0), rc = 400
04-02-2015 18:54:51.282 UTC Error httpconnection.cpp:573:
http://hs.hp-clearwater.com:8888/impi/6500000009%40hp-clearwater.com/av?impu=sip%3A6500000009%40hp-clearwater.com
failed at server 192.168.1.31 : Timeout was reached (28) : fatal
04-02-2015 18:54:51.283 UTC Error httpconnection.cpp:688: cURL failure with 
cURL error code 28 (see man 3 libcurl-errors) and HTTP error code 500
04-02-2015 18:54:51.283 UTC Error hssconnection.cpp:145: Failed to get 
Authentication Vector for 
[email protected]<mailto:[email protected]>
04-02-2015 18:54:51.286 UTC Error httpconnection.cpp:688: cURL failure with 
cURL error code 0 (see man 3 libcurl-errors) and HTTP error code 400
04-02-2015 18:54:51.286 UTC Warning acr.cpp:627: Failed to send Ralf ACR 
message (0x7fce1c1fdef0), rc = 400 ....


It seems like Homestead is unreachable.
Then on Homestead node, if we check status using monit:

[homestead]cw@homestead-1:~$ sudo monit status The Monit daemon 5.8.1 uptime: 
15m

Process 'nginx'
  status                            Running
  monitoring status                 Monitored
  pid                               1044
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            15m
  children                          4
  memory kilobytes                  1240
  memory kilobytes total            8448
  memory percent                    0.0%
  memory percent total              0.2%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  port response time                0.000s to 
127.0.0.1:80/ping<http://127.0.0.1:80/ping> [HTTP via
TCP]
  data collected                    Wed, 04 Feb 2015 10:58:02

Program 'poll_homestead'
  status                            Status failed
  monitoring status                 Monitored
  last started                      Wed, 04 Feb 2015 10:58:02
  last exit value                   1
  data collected                    Wed, 04 Feb 2015 10:58:02

Process 'homestead'
  status                            Does not exist
  monitoring status                 Monitored
  data collected                    Wed, 04 Feb 2015 10:58:02

Program 'poll_homestead-prov'
  status                            Status ok
  monitoring status                 Monitored
  last started                      Wed, 04 Feb 2015 10:58:02
  last exit value                   0
  data collected                    Wed, 04 Feb 2015 10:58:02

Process 'homestead-prov'
  status                            Execution failed
  monitoring status                 Monitored
  data collected                    Wed, 04 Feb 2015 10:58:32

Process 'clearwater_diags_monitor'
  status                            Running
  monitoring status                 Monitored
  pid                               1027
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            16m
  children                          1
  memory kilobytes                  1664
  memory kilobytes total            2040
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Wed, 04 Feb 2015 10:58:32

Program 'poll_cassandra_ring'
  status                            Status ok
  monitoring status                 Monitored
  last started                      Wed, 04 Feb 2015 10:58:32
  last exit value                   0
  data collected                    Wed, 04 Feb 2015 10:58:32

Process 'cassandra'
  status                            Running
  monitoring status                 Monitored
  pid                               1280
  parent pid                        1277
  uid                               106
  effective uid                     106
  gid                               113
  uptime                            16m
  children                          0
  memory kilobytes                  1388648
  memory kilobytes total            1388648
  memory percent                    34.3%
  memory percent total              34.3%
  cpu percent                       0.4%
  cpu percent total                 0.4%
  data collected                    Wed, 04 Feb 2015 10:58:32

System 'homestead-1'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.00] [0.04] [0.05]
  cpu                               3.0%us 0.8%sy 0.0%wa
  memory usage                      1505324 kB [37.1%]
  swap usage                        0 kB [0.0%]
  data collected                    Wed, 04 Feb 2015 10:58:32


And log file shows:

[homestead]cw@homestead-1:~$ cat
/var/log/homestead-prov/homestead-prov-err.log
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File
"/usr/share/clearwater/homestead/env/lib/python2.7/site-packages/crest-0.1-py2.7.egg/metaswitch/crest/main.py",
line 156, in <module>
    standalone()
  File
"/usr/share/clearwater/homestead/env/lib/python2.7/site-packages/crest-0.1-py2.7.egg/metaswitch/crest/main.py",
line 119, in standalone
    reactor.listenUNIX(unix_sock_name, application)
  File
"/usr/share/clearwater/homestead/env/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/posixbase.py",
line 413, in listenUNIX
    p.startListening()
  File
"/usr/share/clearwater/homestead/env/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/unix.py",
line 293, in startListening
    raise CannotListenError, (None, self.port, le)
twisted.internet.error.CannotListenError: Couldn't listen on
any:/tmp/.homestead-prov-sock-0: [Errno 98] Address already in use.
......

[homestead]cw@homestead-1:~$ cat
/var/log/homestead-prov/homestead-prov-0.log
2015-02-04 18:42:23,476 UTC INFO main:118 Going to listen for HTTP on UNIX 
socket /tmp/.homestead-prov-sock-0
2015-02-04 18:42:24,087 UTC INFO main:118 Going to listen for HTTP on UNIX 
socket /tmp/.homestead-prov-sock-0
2015-02-04 18:42:35,826 UTC INFO main:118 Going to listen for HTTP on UNIX 
socket /tmp/.homestead-prov-sock-0
2015-02-04 18:43:16,205 UTC INFO main:118 Going to listen for HTTP on UNIX 
socket /tmp/.homestead-prov-sock-0 ......

homestead_20150204T180000Z.txt  homestead_current.txt 
[homestead]cw@homestead-1:~$ cat /var/log/homestead/homestead_current.txt
04-02-2015 18:42:19.586 UTC Status main.cpp:468: Log level set to 2
04-02-2015 18:42:19.602 UTC Status main.cpp:489: Access logging enabled to 
/var/log/homestead
04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:93: Constructing LoadMonitor
04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:94:    Target latency
(usecs)   : 100000
04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:95:    Max bucket size
         : 20
04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:96:    Initial token
fill rate/s: 10.000000
04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:97:    Min token fill
rate/s    : 10.000000
04-02-2015 18:42:19.614 UTC Status dnscachedresolver.cpp:90: Creating Cached 
Resolver using server 127.0.0.1
04-02-2015 18:42:19.614 UTC Status httpresolver.cpp:50: Created HTTP resolver
04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:145: Configuring store
04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:146:   Hostname:
 localhost
04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:147:   Port:
 9160
04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:148:   Threads:   10
04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:149:   Max Queue: 0
04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:199: Starting store
04-02-2015 18:42:19.616 UTC Error cassandra_store.cpp:207: Cache caught
TTransportException: connect() failed: Connection refused
04-02-2015 18:42:19.616 UTC Error main.cpp:550: Failed to initialize cache
- rc 3
04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:185: Stopping cache
04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:226: Waiting for cache 
to stop ......

And the port usage is:

[homestead]cw@homestead-1:~$ sudo netstat -tulpn Active Internet connections 
(only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
    PID/Program name
tcp        0      0 127.0.0.1:9042<http://127.0.0.1:9042>          0.0.0.0:*    
           LISTEN
     1280/jsvc.exec
tcp        0      0 0.0.0.0:53<http://0.0.0.0:53>              0.0.0.0:*        
       LISTEN
     952/dnsmasq
tcp        0      0 0.0.0.0:22<http://0.0.0.0:22>              0.0.0.0:*        
       LISTEN
     827/sshd
tcp        0      0 127.0.0.1:7000<http://127.0.0.1:7000>          0.0.0.0:*    
           LISTEN
     1280/jsvc.exec
tcp        0      0 127.0.0.1:2812<http://127.0.0.1:2812>          0.0.0.0:*    
           LISTEN
     1036/monit
tcp        0      0 0.0.0.0:37791<http://0.0.0.0:37791>           0.0.0.0:*     
          LISTEN
     1280/jsvc.exec
tcp        0      0 0.0.0.0:7199<http://0.0.0.0:7199>            0.0.0.0:*      
         LISTEN
     1280/jsvc.exec
tcp        0      0 0.0.0.0:53313<http://0.0.0.0:53313>           0.0.0.0:*     
          LISTEN
     1280/jsvc.exec
tcp        0      0 127.0.0.1:9160<http://127.0.0.1:9160>          0.0.0.0:*    
           LISTEN
     1280/jsvc.exec
tcp6       0      0 :::53                   :::*                    LISTEN
     952/dnsmasq
tcp6       0      0 :::22                   :::*                    LISTEN
     827/sshd
tcp6       0      0 :::8889                 :::*                    LISTEN
     1044/nginx
tcp6       0      0 :::80                   :::*                    LISTEN
     1044/nginx
udp        0      0 0.0.0.0:13344<http://0.0.0.0:13344>           0.0.0.0:*
    952/dnsmasq
udp        0      0 0.0.0.0:48567<http://0.0.0.0:48567>           0.0.0.0:*
    952/dnsmasq
udp        0      0 0.0.0.0:53<http://0.0.0.0:53>              0.0.0.0:*
    952/dnsmasq
udp        0      0 0.0.0.0:41016<http://0.0.0.0:41016>           0.0.0.0:*
    952/dnsmasq
udp        0      0 0.0.0.0:68<http://0.0.0.0:68>              0.0.0.0:*
    634/dhclient3
udp        0      0 192.168.1.31:123<http://192.168.1.31:123>        0.0.0.0:*
    791/ntpd
udp        0      0 127.0.0.1:123<http://127.0.0.1:123>           0.0.0.0:*
    791/ntpd
udp        0      0 0.0.0.0:123<http://0.0.0.0:123>             0.0.0.0:*
    791/ntpd
udp6       0      0 :::53                   :::*
     952/dnsmasq
udp6       0      0 fe80::f816:3eff:fe7:123 :::*
     791/ntpd
udp6       0      0 ::1:123                 :::*
     791/ntpd
udp6       0      0 :::123                  :::*
     791/ntpd



So, how should we fix the problems with Homestead and Homestead-prov?

Best regards,
Lianjie
_______________________________________________
Clearwater mailing list
[email protected]<mailto:[email protected]>
http://lists.projectclearwater.org/listinfo/clearwater



_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/listinfo/clearwater

Re: [Clearwater] Problems with Sprout clustering and Homestead failure

Reply via email to