Re: [Clearwater] Problems with Sprout clustering and Homestead failure

Lianjie Cao Mon, 16 Feb 2015 11:43:26 -0800

Hi Ellie,



Thanks for the explanation! We are able to start Homer and Homestead-prov
by manually stop, start and removing the socket file. Hope this can be
fixed in the next release.

And the Cassandra problem is due to the Homestead-prov database crash. It
started to work after I rebulit it.



Since we are trying to do some quantitative parallel stress tests on our
deployment, it would be great if we can stick to one specific release for
consistency purpose.



Q1: Is there a way to install a specific version of Clearwater?



For testing purpose, we usually initiates a large number of requests to
Clearwater. So, Homestead node may receive a lot of queries in a short time.

Instead of the correct responses (200/401), we also receive unexpected
responses such as 503 and 403.



Q2: Under what circumstance does Clearwater send 503/403 responses instead
of correct ones?



I understand Homestead-prov uses token-base mechanism to control the number
of requests sent to Homestead. A request can only be processed when free
tokens are available. And also the number of token increases based on a
rate and the time since last increment.

I tried to change "loadmonitor = LoadMonitor(0.1, 20, 10, 10)" to
"loadmonitor = LoadMonitor(0.1, 2000, 1000, 1000)" in
/usr/share/clearwater/homestead/env/lib/python2.7/site-packages/crest-0.1-py2.7.egg/metaswitch/crest/api/base.py
and restarted node. But the change didn't work.

The token rate still started from 10 as shown in the log:



16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:93: Constructing
LoadMonitor

16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:94:    Target latency
(usecs)   : 100000

16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:95:    Max bucket
size          : 20

16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:96:    Initial token
fill rate/s: 10.000000

16-02-2015 01:34:15.835 UTC Status load_monitor.cpp:97:    Min token fill
rate/s    : 10.000000



Q3. How should I change the initial token rate? And can you explain how
Homestead and Homestead-prov work in more details?



Thanks,

Lianjie

On Thu, Feb 12, 2015 at 3:10 PM, Eleanor Merry <[email protected]
> wrote:

>  Hi Lianjie,
>
>
>
> In the Ultima release, we added support for separated management and
> signalling networks (see
> https://github.com/Metaswitch/clearwater-docs/wiki/Multiple-Network-Support
> for more details), and as part of that we moved nginx to listen on port
> 8889/7888, and homer and homestead-prov to listen on socket files.
>
>
>
> However, there was an issue with this. If Homer/Homestead-prov doesn’t
> shut down cleanly, the socket file isn’t get deleted, meaning that the
> service can’t be started again (see
> https://github.com/Metaswitch/crest/issues/192 for more details). We’ve
> got a fix for this that will go into the next release; in the meantime can
> you delete the /tmp/.homestead-prov-sock-0 and /tmp/.homer-sock-0 files.
> This will allow Homer/Homestead-prov to start again.
>
>
>
> I’ve not seen those Cassandra errors before. Are you able to use the
> cassandra tools to fix up any corruption? I wouldn’t have though that
> recreating the subscribers would fix these errors though.
>
>
>
> Ellie
>
>
>
>
>
> *From:* Lianjie Cao [mailto:[email protected]]
> *Sent:* 10 February 2015 20:52
> *To:* Eleanor Merry
> *Cc:* [email protected]; Sharma, Puneet
>
> *Subject:* Re: [Clearwater] Problems with Sprout clustering and Homestead
> failure
>
>
>
> Hi Ellie,
>
>
>
> Thanks a lot for pointing out the relation to Cassandra!
>
> I changed the logging level of Homestead and Homestead-prov to 5 and
> cleared out all the previous logs and restarted everthing.
>
> Here are the errors reported in /var/log/cassandra/system.log:
>
>
>
>   INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,559 SSTableReader.java
> (line 232) Opening
> /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ic-16
> (317 bytes)
>
> ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,573 CassandraDaemon.java
> (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
>
> org.apache.cassandra.io.sstable.CorruptSSTableException:
> java.io.EOFException
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
>
>         at
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:701)
>
> Caused by: java.io.EOFException
>
>         at
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:589)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
>
>         ... 12 more
>
>
>
> INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,666 SSTableReader.java (line
> 232) Opening
> /var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-ic-31
> (6226 bytes)
>
> ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,670 CassandraDaemon.java
> (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
>
> org.apache.cassandra.io.sstable.CorruptSSTableException:
> java.io.EOFException
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
>
>         at
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:701)
>
> Caused by: java.io.EOFException
>
>         at
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:589)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
>
>         ... 12 more
>
>
>
>  INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,791 SSTableReader.java
> (line 232) Opening
> /var/lib/cassandra/data/system/schema_columns/system-schema_columns-ic-31
> (3305 bytes)
>
> ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,794 CassandraDaemon.java
> (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
>
> org.apache.cassandra.io.sstable.CorruptSSTableException:
> java.io.EOFException
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
>
>         at
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:701)
>
> Caused by: java.io.EOFException
>
>         at
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:589)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
>
>         ... 12 more
>
>
>
>  INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,885 SSTableReader.java
> (line 232) Opening /var/lib/cassandra/data/system/local/system-local-ic-2
> (120 bytes)
>
> ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,887 CassandraDaemon.java
> (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
>
> org.apache.cassandra.io.sstable.CorruptSSTableException:
> java.io.EOFException
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
>
>         at
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:701)
>
> Caused by: java.io.EOFException
>
>         at
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:589)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
>
>         ... 12 more
>
>
>
>  INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,889 SSTableReader.java
> (line 232) Opening /var/lib/cassandra/data/system/local/system-local-ic-1
> (357 bytes)
>
> ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,891 CassandraDaemon.java
> (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
>
> org.apache.cassandra.io.sstable.CorruptSSTableException:
> java.io.EOFException
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
>
>         at
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:701)
>
> Caused by: java.io.EOFException
>
>         at
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:589)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
>
>         ... 12 more
>
>
>
>  INFO [SSTableBatchOpen:1] 2015-02-10 09:24:06,894 SSTableReader.java
> (line 232) Opening /var/lib/cassandra/data/system/local/system-local-ic-3
> (109 bytes)
>
> ERROR [SSTableBatchOpen:1] 2015-02-10 09:24:06,896 CassandraDaemon.java
> (line 191) Exception in thread Thread[SSTableBatchOpen:1,5,main]
>
> org.apache.cassandra.io.sstable.CorruptSSTableException:
> java.io.EOFException
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:108)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:63)
>
>         at
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:42)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:418)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:209)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:157)
>
>         at
> org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:273)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:701)
>
> Caused by: java.io.EOFException
>
>         at
> java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:589)
>
>         at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>
>         at
> org.apache.cassandra.io.compress.CompressionMetadata.<init>(CompressionMetadata.java:83)
>
>         ... 12 more
>
>
>
>
>
>
>
>
>
> Cassandra reported errors when opening those SSTable files:
>
>
>
>
> /var/lib/cassandra/data/system/schema_keyspaces/system-schema_keyspaces-ic-16
>
>
> /var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-ic-31
>
> /var/lib/cassandra/data/system/schema_columns/system-schema_columns-ic-31
>
> /var/lib/cassandra/data/system/local/system-local-ic-2
>
> /var/lib/cassandra/data/system/local/system-local-ic-1
>
> /var/lib/cassandra/data/system/local/system-local-ic-3
>
>
>
> Do you know what are those files?
>
> I crosschecked with the Homestead log file. And it seems like during
> Cassandra initialization, Homestead report "connect() failed: Connection
> refused".
>
> After that, it reports "Cache caught unknown exception!"
>
>
>
>  10-02-2015 17:24:00.879 UTC Error cassandra_store.cpp:207: Cache caught
> TTransportException: connect() failed: Connection refused
>
> 10-02-2015 17:24:00.879 UTC Error main.cpp:550: Failed to initialize cache
> - rc 3
>
>
>
> 10-02-2015 17:24:02.411 UTC Debug zmq_lvc.cpp:144: Enabled XPUB_VERBOSE
> mode
>
> 10-02-2015 17:24:02.411 UTC Error cassandra_store.cpp:207: Cache caught
> TTransportException: connect() failed: Connection refused
>
> 10-02-2015 17:24:02.413 UTC Error main.cpp:550: Failed to initialize cache
> - rc 3
>
>
>
> 10-02-2015 17:24:16.154 UTC Error cassandra_store.cpp:217: Cache caught
> unknown exception!
>
> 10-02-2015 17:24:16.154 UTC Error main.cpp:550: Failed to initialize cache
> - rc 5
>
>
>
> 10-02-2015 17:24:56.569 UTC Error cassandra_store.cpp:217: Cache caught
> unknown exception!
>
> 10-02-2015 17:24:56.572 UTC Debug statistic.cpp:93: Initializing
> inproc://H_hss_latency_us statistic reporter
>
> 10-02-2015 17:24:56.572 UTC Debug statistic.cpp:93: Initializing
> inproc://H_latency_us statistic reporter
>
> 10-02-2015 17:24:56.572 UTC Error main.cpp:550: Failed to initialize cache
> - rc 5
>
>
>
> Another problem is about Homer and Homestead-prov. Both logs show "Address
> already in use." error.
>
> After checking the port usage, I found that Port 7888 on Homer node and
> port 8889 on Homestead are both used by nginx which are supposed to be
> assign to Homer and Homestead-prov.
>
> Do you know how to fix this?
>
>
>
>
>
> I am planning to rebuild the Homestead node and reinsert the numbers using
> Bulk-Provisioning method. Do you think if that would help?
>
> Acutually, we used to have a working deployment (Sprint Pacman) using the
> same configuration. Is there a way to install previous versions?
>
>
>
> Full logs are attached.
>
>
>
> Thanks,
>
> Lianjie
>
>
>
> On Mon, Feb 9, 2015 at 3:09 PM, Eleanor Merry <
> [email protected]> wrote:
>
> Hi Lianjie,
>
>
>
> I’m glad to hear that Sprout and Chronos are now working!
>
>
>
> For the cassandra issue, looking in the logs there’s a number of cases of
> CorruptSSTableExceptions. I’ve not seen this before, but I believe you can
> use nodetool scrub or sstablescrub to fix up any corruption.
>
>
>
> Also, how are you stopping Homer, Homestead-prov and Homestead? When you
> stop the service, you should stop both the service and its associated
> poll_* script (e.g. “sudo monit stop poll_homestead”), and you shouldn’t
> restart the service using “sudo service <service> restart”, as this can
> cause issues where two versions of the service start up.
>
>
>
> Ellie
>
>
>
>
>
> *From:* Lianjie Cao [mailto:[email protected]]
> *Sent:* 06 February 2015 20:32
> *To:* Eleanor Merry
> *Cc:* [email protected]
> *Subject:* Re: [Clearwater] Problems with Sprout clustering and Homestead
> failure
>
>
>
> Hi Ellie,
>
>
>
> Thanks a lot for the response!
>
> I modified Sprout and Chronos configurations. They are working correctly
> now!
>
>
>
> I checked Cassandra on Homestead node. The log does show a few errors
> during initialization. But it started successfully finally. The Cassandra,
> Homestead and Homestead-prov logs are attached.
>
>
>
> Actually, I did run into the same problem before. But after rebooting
> Homestead node a few times, it works fine. So, I didn't dig into it.
>
> Is it possible that the problem is due to some starting conflicts among
> Cassandra, Homestead and Homestead-prov?
>
>
>
> Thanks,
>
> Lianjie
>
>
>
> On Wed, Feb 4, 2015 at 3:25 PM, Eleanor Merry <
> [email protected]> wrote:
>
> Hi Lianjie,
>
> Your configuration files aren't quite right.
>
> The cluster_settings file should have the form servers=<address>,<address>
> - so in your case it would be "servers=192.168.1.21:11211,
> 192.168.1.22:11211". This file should be identical on each Sprout node
> (so the sprouts must be in same order on each node).
>
> The chronos.conf file should have one localhost entry, which is set to the
> IP address of the local node, and multiple node entries, which are set to
> the IP addresses of each node in the cluster. In your case, this would be
> (on sprout 1):
>
> [cluster]
> localhost = 192.168.1.21
> node = 192.168.1.21
> node = 192.168.1.22
>
> The order of the nodes must be the same on each node - so the file on
> sprout 2 should be:
>
> [cluster]
> localhost = 192.168.1.22
> node = 192.168.1.21
> node = 192.168.1.22
>
> Can you make these changes to the config files, and then reload Sprout and
> Chronos (sudo service <service> reload)?
>
> In the logs below, Homestead has stopped because it couldn't contact
> cassandra:
>
> 04-02-2015 18:42:19.616 UTC Error cassandra_store.cpp:207: Cache caught
> TTransportException: connect() failed: Connection refused
> 04-02-2015 18:42:19.616 UTC Error main.cpp:550: Failed to initialize cache
> - rc 3
> 04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:185: Stopping cache
>
> Can you check whether Cassandra is running reliably on the Homestead node?
> Does /var/monit/monit.log show that monit is restarting it, and are there
> any logs in /var/log/cassandra?
>
> Ellie
>
> -----Original Message-----
> From: [email protected] [mailto:
> [email protected]] On Behalf Of Lianjie Cao
> Sent: 04 February 2015 19:37
> To: [email protected]
> Subject: [Clearwater] Problems with Sprout clustering and Homestead failure
>
> Hi,
>
> We recently built a Clearwater deployment with one Bono node, two Sprout
> nodes, one Homestead node, one Homer node and one Ralf node. Howerver, we
> ran into some problems related to Homestead start failure and Sprout
> clustering.
>
> *Sprout clustering:*
>
> The manual installation instruction shows for the latest version Sprout
> clustering is done by Chronos. To add or remove a Sprout node,
> /etc/chronos/chronos.conf needs to modified correspondingly.
> However, we found that when we don't have chronos.conf file, the two
> Sprout nodes seems working fine by adding IPs of the two Sprout nodes to
> /etc/clearwater/cluster_settings.
>
> [sprout]cw@sprout-2:~$ cat /etc/clearwater/cluster_settings
> servers=192.168.1.21:11211
> servers=192.168.1.22:11211
>
> But, if we do add /etc/chronos/chronos.conf with the information of two
> Sprout nodes as below, Chronos failed and no new log files found under
> /var/log/chronos.
>
> [sprout]cw@sprout-1:/var/log/chronos$ cat /etc/chronos/chronos.conf
> [http] bind-address = 0.0.0.0 bind-port = 7253
>
> [logging]
> folder = /var/log/chronos
> level = 5
>
> [cluster]
> localhost = 192.168.1.21
> node = localhost
>
> sprout-2 = 192.168.1.22
> node = sprout-2
>
> [alarms]
> enabled = true
>
>
> [sprout]cw@sprout-1:~$ sudo monit status The Monit daemon 5.8.1 uptime: 0m
>
> Program 'poll_sprout'
>   status                            Status ok
>   monitoring status                 Monitored
>   last started                      Wed, 04 Feb 2015 11:20:36
>   last exit value                   0
>   data collected                    Wed, 04 Feb 2015 11:20:36
>
> Process 'sprout'
>   status                            Running
>   monitoring status                 Monitored
>   pid                               1157
>   parent pid                        1
>   uid                               999
>   effective uid                     999
>   gid                               999
>   uptime                            1m
>   children                          0
>   memory kilobytes                  42412
>   memory kilobytes total            42412
>   memory percent                    1.0%
>   memory percent total              1.0%
>   cpu percent                       0.4%
>   cpu percent total                 0.4%
>   data collected                    Wed, 04 Feb 2015 11:20:36
>
> Program 'poll_memcached'
>   status                            Status ok
>   monitoring status                 Monitored
>   last started                      Wed, 04 Feb 2015 11:20:36
>   last exit value                   0
>   data collected                    Wed, 04 Feb 2015 11:20:36
>
> Process 'memcached'
>   status                            Running
>   monitoring status                 Monitored
>   pid                               1092
>   parent pid                        1
>   uid                               108
>   effective uid                     108
>   gid                               114
>   uptime                            1m
>   children                          0
>   memory kilobytes                  1180
>   memory kilobytes total            1180
>   memory percent                    0.0%
>   memory percent total              0.0%
>   cpu percent                       0.0%
>   cpu percent total                 0.0%
>   data collected                    Wed, 04 Feb 2015 11:20:36
>
> Process 'clearwater_diags_monitor'
>   status                            Running
>   monitoring status                 Monitored
>   pid                               1072
>   parent pid                        1
>   uid                               0
>   effective uid                     0
>   gid                               0
>   uptime                            1m
>   children                          1
>   memory kilobytes                  1796
>   memory kilobytes total            2172
>   memory percent                    0.0%
>   memory percent total              0.0%
>   cpu percent                       0.0%
>   cpu percent total                 0.0%
>   data collected                    Wed, 04 Feb 2015 11:20:36
>
> Process 'chronos'
>   status                            Execution failed
>   monitoring status                 Monitored
>   data collected                    Wed, 04 Feb 2015 11:20:26
>
> System 'sprout-1'
>   status                            Running
>   monitoring status                 Monitored
>   load average                      [0.20] [0.09] [0.04]
>   cpu                               6.8%us 1.1%sy 0.0%wa
>   memory usage                      116944 kB [2.8%]
>   swap usage                        0 kB [0.0%]
>   data collected                    Wed, 04 Feb 2015 11:20:26
>
>
> Is it because we are not using Chronos in the right way or there are other
> settings we need to do?
>
> *Homestead Failure:*
>
>
> When we use SIPp to perform user registration tests, we receive “403
> Forbidden" response and we observed error on both sprout nodes.
>
> [sprout]cw@sprout-1:~$ cat /var/log/sprout/sprout_current.txt
> 04-02-2015 18:54:50.884 UTC Warning acr.cpp:627: Failed to send Ralf ACR
> message (0x7fce241cd780), rc = 400
> 04-02-2015 18:54:51.083 UTC Error httpconnection.cpp:573:
>
> http://hs.hp-clearwater.com:8888/impi/6500000008%40hp-clearwater.com/av?impu=sip%3A6500000008%40hp-clearwater.com
> failed at server 192.168.1.31 : Timeout was reached (28) : fatal
> 04-02-2015 18:54:51.083 UTC Error httpconnection.cpp:688: cURL failure
> with cURL error code 28 (see man 3 libcurl-errors) and HTTP error code 500
> 04-02-2015 18:54:51.083 UTC Error hssconnection.cpp:145: Failed to get
> Authentication Vector for [email protected]
> 04-02-2015 18:54:51.086 UTC Error httpconnection.cpp:688: cURL failure
> with cURL error code 0 (see man 3 libcurl-errors) and HTTP error code 400
> 04-02-2015 18:54:51.086 UTC Warning acr.cpp:627: Failed to send Ralf ACR
> message (0x14322c0), rc = 400
> 04-02-2015 18:54:51.282 UTC Error httpconnection.cpp:573:
>
> http://hs.hp-clearwater.com:8888/impi/6500000009%40hp-clearwater.com/av?impu=sip%3A6500000009%40hp-clearwater.com
> failed at server 192.168.1.31 : Timeout was reached (28) : fatal
> 04-02-2015 18:54:51.283 UTC Error httpconnection.cpp:688: cURL failure
> with cURL error code 28 (see man 3 libcurl-errors) and HTTP error code 500
> 04-02-2015 18:54:51.283 UTC Error hssconnection.cpp:145: Failed to get
> Authentication Vector for [email protected]
> 04-02-2015 18:54:51.286 UTC Error httpconnection.cpp:688: cURL failure
> with cURL error code 0 (see man 3 libcurl-errors) and HTTP error code 400
> 04-02-2015 18:54:51.286 UTC Warning acr.cpp:627: Failed to send Ralf ACR
> message (0x7fce1c1fdef0), rc = 400 ....
>
>
> It seems like Homestead is unreachable.
> Then on Homestead node, if we check status using monit:
>
> [homestead]cw@homestead-1:~$ sudo monit status The Monit daemon 5.8.1
> uptime: 15m
>
> Process 'nginx'
>   status                            Running
>   monitoring status                 Monitored
>   pid                               1044
>   parent pid                        1
>   uid                               0
>   effective uid                     0
>   gid                               0
>   uptime                            15m
>   children                          4
>   memory kilobytes                  1240
>   memory kilobytes total            8448
>   memory percent                    0.0%
>   memory percent total              0.2%
>   cpu percent                       0.0%
>   cpu percent total                 0.0%
>   port response time                0.000s to 127.0.0.1:80/ping [HTTP via
> TCP]
>   data collected                    Wed, 04 Feb 2015 10:58:02
>
> Program 'poll_homestead'
>   status                            Status failed
>   monitoring status                 Monitored
>   last started                      Wed, 04 Feb 2015 10:58:02
>   last exit value                   1
>   data collected                    Wed, 04 Feb 2015 10:58:02
>
> Process 'homestead'
>   status                            Does not exist
>   monitoring status                 Monitored
>   data collected                    Wed, 04 Feb 2015 10:58:02
>
> Program 'poll_homestead-prov'
>   status                            Status ok
>   monitoring status                 Monitored
>   last started                      Wed, 04 Feb 2015 10:58:02
>   last exit value                   0
>   data collected                    Wed, 04 Feb 2015 10:58:02
>
> Process 'homestead-prov'
>   status                            Execution failed
>   monitoring status                 Monitored
>   data collected                    Wed, 04 Feb 2015 10:58:32
>
> Process 'clearwater_diags_monitor'
>   status                            Running
>   monitoring status                 Monitored
>   pid                               1027
>   parent pid                        1
>   uid                               0
>   effective uid                     0
>   gid                               0
>   uptime                            16m
>   children                          1
>   memory kilobytes                  1664
>   memory kilobytes total            2040
>   memory percent                    0.0%
>   memory percent total              0.0%
>   cpu percent                       0.0%
>   cpu percent total                 0.0%
>   data collected                    Wed, 04 Feb 2015 10:58:32
>
> Program 'poll_cassandra_ring'
>   status                            Status ok
>   monitoring status                 Monitored
>   last started                      Wed, 04 Feb 2015 10:58:32
>   last exit value                   0
>   data collected                    Wed, 04 Feb 2015 10:58:32
>
> Process 'cassandra'
>   status                            Running
>   monitoring status                 Monitored
>   pid                               1280
>   parent pid                        1277
>   uid                               106
>   effective uid                     106
>   gid                               113
>   uptime                            16m
>   children                          0
>   memory kilobytes                  1388648
>   memory kilobytes total            1388648
>   memory percent                    34.3%
>   memory percent total              34.3%
>   cpu percent                       0.4%
>   cpu percent total                 0.4%
>   data collected                    Wed, 04 Feb 2015 10:58:32
>
> System 'homestead-1'
>   status                            Running
>   monitoring status                 Monitored
>   load average                      [0.00] [0.04] [0.05]
>   cpu                               3.0%us 0.8%sy 0.0%wa
>   memory usage                      1505324 kB [37.1%]
>   swap usage                        0 kB [0.0%]
>   data collected                    Wed, 04 Feb 2015 10:58:32
>
>
> And log file shows:
>
> [homestead]cw@homestead-1:~$ cat
> /var/log/homestead-prov/homestead-prov-err.log
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
>     "__main__", fname, loader, pkg_name)
>   File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
>     exec code in run_globals
>   File
>
> "/usr/share/clearwater/homestead/env/lib/python2.7/site-packages/crest-0.1-py2.7.egg/metaswitch/crest/main.py",
> line 156, in <module>
>     standalone()
>   File
>
> "/usr/share/clearwater/homestead/env/lib/python2.7/site-packages/crest-0.1-py2.7.egg/metaswitch/crest/main.py",
> line 119, in standalone
>     reactor.listenUNIX(unix_sock_name, application)
>   File
>
> "/usr/share/clearwater/homestead/env/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/posixbase.py",
> line 413, in listenUNIX
>     p.startListening()
>   File
>
> "/usr/share/clearwater/homestead/env/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg/twisted/internet/unix.py",
> line 293, in startListening
>     raise CannotListenError, (None, self.port, le)
> twisted.internet.error.CannotListenError: Couldn't listen on
> any:/tmp/.homestead-prov-sock-0: [Errno 98] Address already in use.
> ......
>
> [homestead]cw@homestead-1:~$ cat
> /var/log/homestead-prov/homestead-prov-0.log
> 2015-02-04 18:42:23,476 UTC INFO main:118 Going to listen for HTTP on UNIX
> socket /tmp/.homestead-prov-sock-0
> 2015-02-04 18:42:24,087 UTC INFO main:118 Going to listen for HTTP on UNIX
> socket /tmp/.homestead-prov-sock-0
> 2015-02-04 18:42:35,826 UTC INFO main:118 Going to listen for HTTP on UNIX
> socket /tmp/.homestead-prov-sock-0
> 2015-02-04 18:43:16,205 UTC INFO main:118 Going to listen for HTTP on UNIX
> socket /tmp/.homestead-prov-sock-0 ......
>
> homestead_20150204T180000Z.txt  homestead_current.txt
> [homestead]cw@homestead-1:~$ cat /var/log/homestead/homestead_current.txt
> 04-02-2015 18:42:19.586 UTC Status main.cpp:468: Log level set to 2
> 04-02-2015 18:42:19.602 UTC Status main.cpp:489: Access logging enabled to
> /var/log/homestead
> 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:93: Constructing
> LoadMonitor
> 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:94:    Target latency
> (usecs)   : 100000
> 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:95:    Max bucket size
>          : 20
> 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:96:    Initial token
> fill rate/s: 10.000000
> 04-02-2015 18:42:19.614 UTC Status load_monitor.cpp:97:    Min token fill
> rate/s    : 10.000000
> 04-02-2015 18:42:19.614 UTC Status dnscachedresolver.cpp:90: Creating
> Cached Resolver using server 127.0.0.1
> 04-02-2015 18:42:19.614 UTC Status httpresolver.cpp:50: Created HTTP
> resolver
> 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:145: Configuring
> store
> 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:146:   Hostname:
>  localhost
> 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:147:   Port:
>  9160
> 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:148:   Threads:   10
> 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:149:   Max Queue: 0
> 04-02-2015 18:42:19.614 UTC Status cassandra_store.cpp:199: Starting store
> 04-02-2015 18:42:19.616 UTC Error cassandra_store.cpp:207: Cache caught
> TTransportException: connect() failed: Connection refused
> 04-02-2015 18:42:19.616 UTC Error main.cpp:550: Failed to initialize cache
> - rc 3
> 04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:185: Stopping cache
> 04-02-2015 18:42:19.616 UTC Status cassandra_store.cpp:226: Waiting for
> cache to stop ......
>
> And the port usage is:
>
> [homestead]cw@homestead-1:~$ sudo netstat -tulpn Active Internet
> connections (only servers)
> Proto Recv-Q Send-Q Local Address           Foreign Address         State
>     PID/Program name
> tcp        0      0 127.0.0.1:9042          0.0.0.0:*               LISTEN
>      1280/jsvc.exec
> tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN
>      952/dnsmasq
> tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
>      827/sshd
> tcp        0      0 127.0.0.1:7000          0.0.0.0:*               LISTEN
>      1280/jsvc.exec
> tcp        0      0 127.0.0.1:2812          0.0.0.0:*               LISTEN
>      1036/monit
> tcp        0      0 0.0.0.0:37791           0.0.0.0:*               LISTEN
>      1280/jsvc.exec
> tcp        0      0 0.0.0.0:7199            0.0.0.0:*               LISTEN
>      1280/jsvc.exec
> tcp        0      0 0.0.0.0:53313           0.0.0.0:*               LISTEN
>      1280/jsvc.exec
> tcp        0      0 127.0.0.1:9160          0.0.0.0:*               LISTEN
>      1280/jsvc.exec
> tcp6       0      0 :::53                   :::*                    LISTEN
>      952/dnsmasq
> tcp6       0      0 :::22                   :::*                    LISTEN
>      827/sshd
> tcp6       0      0 :::8889                 :::*                    LISTEN
>      1044/nginx
> tcp6       0      0 :::80                   :::*                    LISTEN
>      1044/nginx
> udp        0      0 0.0.0.0:13344           0.0.0.0:*
>     952/dnsmasq
> udp        0      0 0.0.0.0:48567           0.0.0.0:*
>     952/dnsmasq
> udp        0      0 0.0.0.0:53              0.0.0.0:*
>     952/dnsmasq
> udp        0      0 0.0.0.0:41016           0.0.0.0:*
>     952/dnsmasq
> udp        0      0 0.0.0.0:68              0.0.0.0:*
>     634/dhclient3
> udp        0      0 192.168.1.31:123        0.0.0.0:*
>     791/ntpd
> udp        0      0 127.0.0.1:123           0.0.0.0:*
>     791/ntpd
> udp        0      0 0.0.0.0:123             0.0.0.0:*
>     791/ntpd
> udp6       0      0 :::53                   :::*
>      952/dnsmasq
> udp6       0      0 fe80::f816:3eff:fe7:123 :::*
>      791/ntpd
> udp6       0      0 ::1:123                 :::*
>      791/ntpd
> udp6       0      0 :::123                  :::*
>      791/ntpd
>
>
>
> So, how should we fix the problems with Homestead and Homestead-prov?
>
> Best regards,
> Lianjie
>
> _______________________________________________
> Clearwater mailing list
> [email protected]
> http://lists.projectclearwater.org/listinfo/clearwater
>
>
>
>
>
_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/listinfo/clearwater

Re: [Clearwater] Problems with Sprout clustering and Homestead failure

Reply via email to