Hi Richard,

Tldr: I've made some progress since I sent that email. I've found and fixed 
bugs to do with dnsmasq, but now I'm stuck.
The systemd logs for homestead still say: "Error main.cpp:1056: Failed to 
initialize the Cassandra store with error code 3."
I'm not even sure whether the dnsmasq bugs were related.

First, to answer your questions:

Yes this happens every time I deploy.

My configmap:

```
$ kubectl describe configmap env-vars
Name:         env-vars
Namespace:    default
Labels:       <none>
Annotations:  <none>

Data
====
ZONE:
----
2991678f-f9dd-4098-8fe8-1395c572a307.westeurope.aksapp.io
Events:  <none>
```

I'm deploying on Azure's kubernetes service (AKS). I've enabled HTTP 
Application 
Routing<https://docs.microsoft.com/en-us/azure/aks/http-application-routing>, 
which means that Azure manages DNS for me. The only unusual step is that Azure 
requires that I add ingresses to do so. The DNS records point to each service 
(bono, ellis, sprout etc). I can ping "cassandra" from the homestead pod.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", 
GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", 
BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}

Getting the logs is a race against time, since kubernetes keeps terminating the 
pod while I'm in there.
After several attempts, I managed to get the homestead logs (same as 
homestead-prov logs)

```
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status dnscachedresolver.cpp:123: 
Creating Cached Resolver using servers:
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status dnscachedresolver.cpp:133:    
 10.0.0.10
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Error static_dns_cache.cpp:89: DNS 
config file /etc/clearwater/dns.json mis
sing
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status a_record_resolver.cpp:29: 
Created ARecordResolver
15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status main.cpp:653: Using local 
impu store: astaire.2991678f-f9dd-4098-8fe
8-1395c572a307.westeurope.aksapp.io
15-05-2018 07:05:40.129 UTC [7f6379c8e7c0] Status http_connection_pool.cpp:37: 
Connection pool will use calculated res
ponse timeout of 550ms
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status httpconnection.h:35: 
Configuring HTTP Connection
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status httpconnection.h:36:   
Connection created for server sprout.2991678f
-f9dd-4098-8fe8-1395c572a307.westeurope.aksapp.io:9888
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status main.cpp:1013: No HSS 
configured - using Homestead-prov
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status a_record_resolver.cpp:29: 
Created ARecordResolver
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:266: 
Configuring store connection
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:267:   
Hostname:  cassandra.2991678f-f9dd-4098-8
fe8-1395c572a307.westeurope.aksapp.io
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:268:   
Port:      9160
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:296: 
Configuring store worker pool
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:297:   
Threads:   10
15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:298:   
Max Queue: 0
15-05-2018 07:05:40.131 UTC [7f6366ffd700] Status alarm.cpp:244: Reraising all 
alarms with a known state
15-05-2018 07:05:40.636 UTC [7f6379c8e7c0] Error main.cpp:1056: Failed to 
initialize the Cassandra store with error co
de 3.
15-05-2018 07:05:40.636 UTC [7f6379c8e7c0] Status main.cpp:1057: Homestead is 
shutting down
```

The full logs are just this chunk repeating over and over until everything 
shuts down.

I think the crucial error message is DNS config file /etc/clearwater/dns.json 
missing

In the all-in-one installation, that file is just:

{
  "hostnames": [
  ]
}

Is it ok to manually insert that using the Dockerfile? Or should there be other 
entries in there for a non-AIO installation?

dns.json is created by the dnsmasq service.

During the container build, there is a red error message saying "dnsmasq: 
setting capabilities failed: Operation not permitted"
(There are a lot of red error messages during build. Is that an issue?)

I noticed that dnsmasq was not running in homestead docker, yet it is running 
in my working all-in-one installation. (There are also a bunch of other 
services which are halted. Is that an issue?)

I tried starting dnsmasq manually, it wouldn't start. I looked online and the 
solution is to add user=root to the file /etc/dnsmasq.conf.

I tried modifying the Dockerfile to manually append that string to the file.  
(I'm assuming that the order of lines in /etc/dnsmasq.conf does not matter)

Then dnsmasq still doesn't come up. I get an error about how the port it wants 
is already taken.

So then I tried modifying the Dockerfile to paste in dns.conf from the AIO 
installation. (The one with an empty list of hosts)

Now the missing file error is gone, but I still get the other error:

"Error main.cpp:1056: Failed to initialize the Cassandra store with error code 
3."

How can I debug this?

All this time I thought that error was caused by dns issues. Is it because 
dnsmasq still won't start, so dns.json is missing hosts? Or is that unrelated?
"ping cassandra" works from the homestead pod.

Other Remaining questions:

*        Do I need dnsmasq? The dns.json file is an empty list of 'hosts'. Can 
I just paste that in?

o   {  "hostnames": [ ]}

o   Why won't dnsmasq start? ("dnsmasq: failed to create listening socket for 
127.0.0.1: Address already in use") Does that even matter?

o   During image build there's an error "dnsmasq: setting capabilities failed: 
Operation not permitted". What's the source of that error?

*        There are many other red error messages during build. Does that matter?

*        There are also a bunch of services (mostly clearwater ones) which are 
halted, but are running in the AIO build. Is that an issue?

o   Services running in AIO:

?  [ ? ]  clearwater-auto-config-generic

?   [ + ]  clearwater-cluster-manager

?   [ + ]  clearwater-config-manager

?   [ + ]  clearwater-diags-monitor

?   [ - ]  clearwater-etcd

?   [ + ]  clearwater-infrastructure

?   [ ? ]  clearwater-memcached

?   [ + ]  clearwater-queue-manager

o   Services running (or not) in docker homestead:

?  [ ? ]  clearwater-auto-config-docker

?   [ + ]  clearwater-cluster-manager

?   [ - ]  clearwater-config-manager

?   [ - ]  clearwater-diags-monitor

?   [ + ]  clearwater-etcd

?   [ + ]  clearwater-infrastructure

?   [ - ]  clearwater-queue-manager


Thanks,

Matthew Davis
Telstra | CTO | Cloud SDN NFV

From: Richard Whitehouse (projectclearwater.org) 
[mailto:[email protected]]
Sent: Wednesday, 16 May 2018 3:24 AM
To: [email protected]; Davis, Matthew 
<[email protected]>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Matthew,

Sorry to hear you are having problems deploying Clearwater on Docker.

I think that the socket factory error is actually fairly benign.

Does this happen every time you create your deployment?

Can you provide the config map that you are providing when deploying on 
Kubernetes?

Can you connect into one of the pods that is failing to run homestead-prov and 
grab the logs. The relevant ones will be under /var/log/homestead-prov inside 
the container.

We cannot reproduce your issue on our kubernetes setup - can you provide 
details of your kubernetes setup - what version of k8s and what networking 
configuration you are using?

>From the diagnostics we have at the moment, the most likely scenario is that 
>the Homestead and Homestead Prov containers are unable to contact Cassandra - 
>if that's the case they will continually restart to attempt to resolve the 
>problem.



Richard


From: Clearwater [mailto:[email protected]] On 
Behalf Of Davis, Matthew
Sent: 14 May 2018 07:05
To: 
[email protected]<mailto:[email protected]>
Subject: Re: [Project Clearwater] Issues with clearwater-docker homestead and 
homestead-prov under Kubernetes

Hi everyone,
I've fixed the issue I had with homestead-prov and submitted the patch as this 
pull request: https://github.com/Metaswitch/clearwater-docker/pull/89

However now I just run into another issue, so homestead-prov is giving me the 
same error as homestead. The error message is not verbose. Does anyone know how 
to make it more verbose? I have no idea what's happening.


```
2018-05-14 05:47:41,369 INFO success: snmpd entered RUNNING state, process has 
stayed up for > than 0 seconds (startsecs)
2018-05-14 05:47:41,370 INFO success: clearwater-infrastructure entered RUNNING 
state, process has stayed up for > than 0 seconds (startsecs)
2018-05-14 05:47:42,476 INFO success: socket-factory-sig entered RUNNING state, 
process has stayed up for > than 1 seconds (startsecs)
2018-05-14 05:47:42,476 INFO success: socket-factory-mgmt entered RUNNING 
state, process has stayed up for > than 1 seconds (startsecs)
2018-05-14 05:47:42,477 INFO exited: snmpd (exit status 0; expected)
2018-05-14 05:47:42,478 CRIT reaped unknown pid 61)
2018-05-14 05:47:44,572 CRIT reaped unknown pid 65)
2018-05-14 05:47:44,576 CRIT reaped unknown pid 66)
2018-05-14 05:47:45,883 CRIT reaped unknown pid 235)
2018-05-14 05:47:45,883 CRIT reaped unknown pid 236)
2018-05-14 05:47:45,902 CRIT reaped unknown pid 257)
2018-05-14 05:47:48,153 CRIT reaped unknown pid 294)
2018-05-14 05:47:48,226 INFO spawned: 'homestead-prov' with pid 297
2018-05-14 05:47:48,237 INFO spawned: 'nginx' with pid 298
2018-05-14 05:47:48,747 INFO exited: homestead-prov (exit status 0; not 
expected)
2018-05-14 05:47:49,750 INFO spawned: 'homestead-prov' with pid 324
2018-05-14 05:47:49,750 INFO success: nginx entered RUNNING state, process has 
stayed up for > than 1 seconds (startsecs)
2018-05-14 05:47:50,089 INFO exited: clearwater-infrastructure (exit status 0; 
expected)
2018-05-14 05:47:50,335 INFO exited: homestead-prov (exit status 0; not 
expected)
2018-05-14 05:47:52,339 INFO spawned: 'homestead-prov' with pid 335
2018-05-14 05:47:53,324 INFO exited: homestead-prov (exit status 0; not 
expected)
2018-05-14 05:47:56,330 INFO spawned: 'homestead-prov' with pid 346
2018-05-14 05:47:56,902 INFO exited: homestead-prov (exit status 0; not 
expected)
2018-05-14 05:47:57,904 INFO gave up: homestead-prov entered FATAL state, too 
many start retries too quickly
```

Regards,
Matt Davis
Telstra | Graduate Engineer
_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org

Reply via email to