Re: [Project Clearwater] Issues with clearwater-docker homestead and homestead-prov under Kubernetes

Adam Lindley Wed, 16 May 2018 10:07:29 -0700

Hi Matthew,

I've had a look into your issues deploying under kubernetes, and think I've 
narrowed down the search a good amount. Details below.
First though I'll take a run through the smaller questions you've raised below, 
as I think we can knock some of them off fairly quickly.


Other Remaining questions:

  *   Do I need dnsmasq? The dns.json file is an empty list of 'hosts'. Can I 
just paste that in?
[AJL] You shouldn't need it in the containers. In our VMs we found that some 
operations were fairly heavy on DNS queries, and so provided dnsmasq as a means 
of providing caching. The need for this has decreased quite a bit now, and as 
we move forward in the microservice space it makes less sense to have it around.

     *   {  "hostnames": [ ]}
     *   Why won't dnsmasq start? ("dnsmasq: failed to create listening socket 
for 127.0.0.1: Address already in use") Does that even matter?
This will likely be down to how the container networking works, but it won't 
cause any problems in running under kubernetes.

     *   During image build there's an error "dnsmasq: setting capabilities 
failed: Operation not permitted". What's the source of that error?
[AJL] This is a benign error in this case, so we haven't dug down into it. It 
shouldn't impact your deployment at all.

  *   There are many other red error messages during build. Does that matter?
[AJL] No, these don't matter. I see a number of error messages during the 
build. A lot of these are dependencies of some of our packages raising error 
messages when being installed in this environment

  *   There are also a bunch of services (mostly clearwater ones) which are 
halted, but are running in the AIO build. Is that an issue?
[AJL] The AIO has a number of different requirements that we don't have in the 
container environment, and vice versa, so the fact that there is a different 
set doesn't mean you have issues there
As a couple of examples for the differences you pulled out below:

  *   The auto config packages are used to enable specific setup in different 
environments. E.g. setting up required config files that other deployments do 
not rely on. In the docker case, we need to read environment variables and pass 
them through to the system configuration files
  *   Clearwater-etcd is our interface into an etcd cluster, and is used for 
sharing common configuration around a cluster. On the AIO this simply increases 
complexity, as we have no need for it, but in docker/kubernetes, we want the 
services to share configuration through an underlying etcd cluster.
I don't believe any of the differences you pulled out below would be leading to 
the issues you're seeing, but it's a nice analysis :)
The missing dns.json config file is also currently just a benign error in the 
containers, raised by some expectations we have when running in VMs.


So, on to the main issue:
In getting your deployment up and running, the key issue is definitely going to 
be the 'Failed to initialize the Cassandra store' one. This indicates the 
Homestead and Homestead-prov services have been unable to reach the cassandra 
cluster, and without that they won't be able to get up and running.
As you've already found, getting logs out and doing more detailed debugging in 
containers presents its own set of problems. To temporarily prevent kubernetes 
from automatically restarting pods, you can (as you may have already) remove 
the liveness checking sections from the generated deployment files. This isn't 
ideal, but it should allow you time to do some more detailed debugging. You may 
also want to reduce the number of replicas of the pods down to one for now, to 
make it easier to follow traffic. Should just be a case of changing the .depl 
files and redeploying.

The log indicates you're getting error code 3 back from attempting to setup the 
Cassandra store, which is a 'connection error', as opposed to what you would 
get if the dns was simply unresolvable (I have tested that on our internal 
setup, and the logs would be clearly stating it was unresolvable). This is 
definitely leading me to suspect network connectivity issues into the cassandra 
pods. The fact that you can ping between the two however suggests that it's 
more complex than simply being unable to send any traffic between them.

To test this, I tweaked the cassandra configuration (running just a single pod 
for simplicity) so that it was listening on a port that wasn't 9160, which is 
the port that both homestead and homestead prov attempt to connect to. With 
this setup in place, I saw both of these pods repeatedly failing with the same 
error code as yours.
Because of this I would suggest the following next steps in tracking down the 
problem:

  *   Double check that your networking configuration allows traffic through to 
the cassandra pods on port 9160. The homestead and homestead prov processes 
will require this.
  *   Connect to the Cassandra pods, and verify that they are indeed listening 
for traffic on port 9160
     *   `netstat -planut` is my default for this, but to each their own
     *   /var/log/cassandra/system.log may have more information if something 
strange is going on here
  *   Manually check the connectivity between the pods, where you could try
     *   Connect to e.g. a homestead pod, and try running `nc <cassandra 
hostname> 9160`. I'm seeing this instantly return if the port is not 
open/listened on.
     *   Install tcpdump on both sides, and see whether you see traffic 
arriving in the cassandra pods at all. This isn't a very container, but getting 
a dump from both cassandra and homestead sides, to see what point in the flow 
is failing would be a decent way to work out where the issue lies.

Sadly, I don't have access to an AKS setup right now, so can't investigate the 
ingress rules component at all. I suspect that there's something slightly 
unusual happening there that is leading to traffic being unable to reach port 
9160.
Hopefully this should give you a better idea of what the source of the issue is 
here. If this doesn't show anything up, and you're seeing traffic from 
homestead arrive in the cassandra pods and going back out we may have to run 
some more testing

Cheers,
Adam


From: Clearwater [mailto:[email protected]] On 
Behalf Of Davis, Matthew
Sent: 16 May 2018 04:35
To: Richard Whitehouse (projectclearwater.org) 
<[email protected]>; 
[email protected]
Subject: Re: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hi Richard,

Tldr: I've made some progress since I sent that email. I've found and fixed 
bugs to do with dnsmasq, but now I'm stuck.
The systemd logs for homestead still say: "Error main.cpp:1056: Failed to 
initialize the Cassandra store with error code 3."
I'm not even sure whether the dnsmasq bugs were related.

First, to answer your questions:

Yes this happens every time I deploy.

My configmap:

```
$ kubectl describe configmap env-vars
Name:         env-vars
Namespace:    default
Labels:       <none>
Annotations:  <none>

Data
====
ZONE:
----
2991678f-f9dd-4098-8fe8-1395c572a307.westeurope.aksapp.io
Events:  <none>
```

I'm deploying on Azure's kubernetes service (AKS). I've enabled HTTP 
Application 
Routing<https://docs.microsoft.com/en-us/azure/aks/http-application-routing>, 
which means that Azure manages DNS for me. The only unusual step is that Azure 
requires that I add ingresses to do so. The DNS records point to each service 
(bono, ellis, sprout etc). I can ping "cassandra" from the homestead pod.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", 
GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", 
BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}

Getting the logs is a race against time, since kubernetes keeps terminating the 
pod while I'm in there.
After several attempts, I managed to get the homestead logs (same as 
homestead-prov logs)

```
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status dnscachedresolver.cpp:123: 
Creating Cached Resolver using servers:
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status dnscachedresolver.cpp:133:    
 10.0.0.10
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Error static_dns_cache.cpp:89: DNS 
config file /etc/clearwater/dns.json mis
sing
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status a_record_resolver.cpp:29: 
Created ARecordResolver
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status main.cpp:653: Using local 
impu store: astaire.2991678f-f9dd-4098-8fe
8-1395c572a307.westeurope.aksapp.io
15-05-2018 07:05:40.129 UTC [7f6379c8e7c0] Status http_connection_pool.cpp:37: 
Connection pool will use calculated res
ponse timeout of 550ms
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status httpconnection.h:35: 
Configuring HTTP Connection
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status httpconnection.h:36:   
Connection created for server sprout.2991678f
-f9dd-4098-8fe8-1395c572a307.westeurope.aksapp.io:9888
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status main.cpp:1013: No HSS 
configured - using Homestead-prov
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status a_record_resolver.cpp:29: 
Created ARecordResolver
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:266: 
Configuring store connection
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:267:   
Hostname:  cassandra.2991678f-f9dd-4098-8
fe8-1395c572a307.westeurope.aksapp.io
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:268:   
Port:      9160
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:296: 
Configuring store worker pool
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:297:   
Threads:   10
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:298:   
Max Queue: 0
15-05-2018 07:05:40.131 UTC [7f6366ffd700] Status alarm.cpp:244: Reraising all 
alarms with a known state
15-05-2018 07:05:40.636 UTC [7f6379c8e7c0] Error main.cpp:1056: Failed to 
initialize the Cassandra store with error co
de 3.
15-05-2018 07:05:40.636 UTC [7f6379c8e7c0] Status main.cpp:1057: Homestead is 
shutting down
```

The full logs are just this chunk repeating over and over until everything 
shuts down.

I think the crucial error message is DNS config file /etc/clearwater/dns.json 
missing

In the all-in-one installation, that file is just:

{
  "hostnames": [
  ]
}

Is it ok to manually insert that using the Dockerfile? Or should there be other 
entries in there for a non-AIO installation?

dns.json is created by the dnsmasq service.

During the container build, there is a red error message saying "dnsmasq: 
setting capabilities failed: Operation not permitted"
(There are a lot of red error messages during build. Is that an issue?)

I noticed that dnsmasq was not running in homestead docker, yet it is running 
in my working all-in-one installation. (There are also a bunch of other 
services which are halted. Is that an issue?)

I tried starting dnsmasq manually, it wouldn't start. I looked online and the 
solution is to add user=root to the file /etc/dnsmasq.conf.

I tried modifying the Dockerfile to manually append that string to the file.  
(I'm assuming that the order of lines in /etc/dnsmasq.conf does not matter)

Then dnsmasq still doesn't come up. I get an error about how the port it wants 
is already taken.

So then I tried modifying the Dockerfile to paste in dns.conf from the AIO 
installation. (The one with an empty list of hosts)

Now the missing file error is gone, but I still get the other error:

"Error main.cpp:1056: Failed to initialize the Cassandra store with error code 
3."

How can I debug this?

All this time I thought that error was caused by dns issues. Is it because 
dnsmasq still won't start, so dns.json is missing hosts? Or is that unrelated?
"ping cassandra" works from the homestead pod.
Other Remaining questions:

  *   Do I need dnsmasq? The dns.json file is an empty list of 'hosts'. Can I 
just paste that in?
     *   {  "hostnames": [ ]}
     *   Why won't dnsmasq start? ("dnsmasq: failed to create listening socket 
for 127.0.0.1: Address already in use") Does that even matter?
     *   During image build there's an error "dnsmasq: setting capabilities 
failed: Operation not permitted". What's the source of that error?
  *   There are many other red error messages during build. Does that matter?
  *   There are also a bunch of services (mostly clearwater ones) which are 
halted, but are running in the AIO build. Is that an issue?
     *   Services running in AIO:
        *   [ ? ]  clearwater-auto-config-generic
        *    [ + ]  clearwater-cluster-manager
        *    [ + ]  clearwater-config-manager
        *    [ + ]  clearwater-diags-monitor
        *    [ - ]  clearwater-etcd
        *    [ + ]  clearwater-infrastructure
        *    [ ? ]  clearwater-memcached
        *    [ + ]  clearwater-queue-manager
     *   Services running (or not) in docker homestead:
        *   [ ? ]  clearwater-auto-config-docker
        *    [ + ]  clearwater-cluster-manager
        *    [ - ]  clearwater-config-manager
        *    [ - ]  clearwater-diags-monitor
        *    [ + ]  clearwater-etcd
        *    [ + ]  clearwater-infrastructure
        *    [ - ]  clearwater-queue-manager


Thanks,

Matthew Davis
Telstra | CTO | Cloud SDN NFV

From: Richard Whitehouse (projectclearwater.org) 
[mailto:[email protected]]
Sent: Wednesday, 16 May 2018 3:24 AM
To: 
[email protected]<mailto:[email protected]>;
 Davis, Matthew 
<[email protected]<mailto:[email protected]>>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Matthew,

Sorry to hear you are having problems deploying Clearwater on Docker.

I think that the socket factory error is actually fairly benign.

Does this happen every time you create your deployment?

Can you provide the config map that you are providing when deploying on 
Kubernetes?

Can you connect into one of the pods that is failing to run homestead-prov and 
grab the logs. The relevant ones will be under /var/log/homestead-prov inside 
the container.

We cannot reproduce your issue on our kubernetes setup - can you provide 
details of your kubernetes setup - what version of k8s and what networking 
configuration you are using?

>From the diagnostics we have at the moment, the most likely scenario is that 
>the Homestead and Homestead Prov containers are unable to contact Cassandra - 
>if that's the case they will continually restart to attempt to resolve the 
>problem.



Richard


From: Clearwater [mailto:[email protected]] On 
Behalf Of Davis, Matthew
Sent: 14 May 2018 07:05
To: 
[email protected]<mailto:[email protected]>
Subject: Re: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hi everyone,
I've fixed the issue I had with homestead-prov and submitted the patch as this 
pull request: https://github.com/Metaswitch/clearwater-docker/pull/89

However now I just run into another issue, so homestead-prov is giving me the 
same error as homestead. The error message is not verbose. Does anyone know how 
to make it more verbose? I have no idea what's happening.


```
2018-05-14 05:47:41,369 INFO success: snmpd entered RUNNING state, process has 
stayed up for > than 0 seconds (startsecs)
2018-05-14 05:47:41,370 INFO success: clearwater-infrastructure entered RUNNING 
state, process has stayed up for > than 0 seconds (startsecs)
2018-05-14 05:47:42,476 INFO success: socket-factory-sig entered RUNNING state, 
process has stayed up for > than 1 seconds (startsecs)
2018-05-14 05:47:42,476 INFO success: socket-factory-mgmt entered RUNNING 
state, process has stayed up for > than 1 seconds (startsecs)
2018-05-14 05:47:42,477 INFO exited: snmpd (exit status 0; expected)
2018-05-14 05:47:42,478 CRIT reaped unknown pid 61)
2018-05-14 05:47:44,572 CRIT reaped unknown pid 65)
2018-05-14 05:47:44,576 CRIT reaped unknown pid 66)
2018-05-14 05:47:45,883 CRIT reaped unknown pid 235)
2018-05-14 05:47:45,883 CRIT reaped unknown pid 236)
2018-05-14 05:47:45,902 CRIT reaped unknown pid 257)
2018-05-14 05:47:48,153 CRIT reaped unknown pid 294)
2018-05-14 05:47:48,226 INFO spawned: 'homestead-prov' with pid 297
2018-05-14 05:47:48,237 INFO spawned: 'nginx' with pid 298
2018-05-14 05:47:48,747 INFO exited: homestead-prov (exit status 0; not 
expected)
2018-05-14 05:47:49,750 INFO spawned: 'homestead-prov' with pid 324
2018-05-14 05:47:49,750 INFO success: nginx entered RUNNING state, process has 
stayed up for > than 1 seconds (startsecs)
2018-05-14 05:47:50,089 INFO exited: clearwater-infrastructure (exit status 0; 
expected)
2018-05-14 05:47:50,335 INFO exited: homestead-prov (exit status 0; not 
expected)
2018-05-14 05:47:52,339 INFO spawned: 'homestead-prov' with pid 335
2018-05-14 05:47:53,324 INFO exited: homestead-prov (exit status 0; not 
expected)
2018-05-14 05:47:56,330 INFO spawned: 'homestead-prov' with pid 346
2018-05-14 05:47:56,902 INFO exited: homestead-prov (exit status 0; not 
expected)
2018-05-14 05:47:57,904 INFO gave up: homestead-prov entered FATAL state, too 
many start retries too quickly
```

Regards,
Matt Davis
Telstra | Graduate Engineer

_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Re: [Project Clearwater] Issues with clearwater-docker homestead and homestead-prov under Kubernetes

Reply via email to