Hi Richard, Tldr: I've made some progress since I sent that email. I've found and fixed bugs to do with dnsmasq, but now I'm stuck. The systemd logs for homestead still say: "Error main.cpp:1056: Failed to initialize the Cassandra store with error code 3." I'm not even sure whether the dnsmasq bugs were related.
First, to answer your questions: Yes this happens every time I deploy. My configmap: ``` $ kubectl describe configmap env-vars Name: env-vars Namespace: default Labels: <none> Annotations: <none> Data ==== ZONE: ---- 2991678f-f9dd-4098-8fe8-1395c572a307.westeurope.aksapp.io Events: <none> ``` I'm deploying on Azure's kubernetes service (AKS). I've enabled HTTP Application Routing<https://docs.microsoft.com/en-us/azure/aks/http-application-routing>, which means that Azure manages DNS for me. The only unusual step is that Azure requires that I add ingresses to do so. The DNS records point to each service (bono, ellis, sprout etc). I can ping "cassandra" from the homestead pod. $ kubectl version Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Getting the logs is a race against time, since kubernetes keeps terminating the pod while I'm in there. After several attempts, I managed to get the homestead logs (same as homestead-prov logs) ``` 15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status dnscachedresolver.cpp:123: Creating Cached Resolver using servers: 15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status dnscachedresolver.cpp:133: 10.0.0.10 15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Error static_dns_cache.cpp:89: DNS config file /etc/clearwater/dns.json mis sing 15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status a_record_resolver.cpp:29: Created ARecordResolver 15-05-2018 07:05:40.126 UTC [7f6379c8e7c0] Status main.cpp:653: Using local impu store: astaire.2991678f-f9dd-4098-8fe 8-1395c572a307.westeurope.aksapp.io 15-05-2018 07:05:40.129 UTC [7f6379c8e7c0] Status http_connection_pool.cpp:37: Connection pool will use calculated res ponse timeout of 550ms 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status httpconnection.h:35: Configuring HTTP Connection 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status httpconnection.h:36: Connection created for server sprout.2991678f -f9dd-4098-8fe8-1395c572a307.westeurope.aksapp.io:9888 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status main.cpp:1013: No HSS configured - using Homestead-prov 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status a_record_resolver.cpp:29: Created ARecordResolver 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:266: Configuring store connection 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:267: Hostname: cassandra.2991678f-f9dd-4098-8 fe8-1395c572a307.westeurope.aksapp.io 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:268: Port: 9160 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:296: Configuring store worker pool 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:297: Threads: 10 15-05-2018 07:05:40.130 UTC [7f6379c8e7c0] Status cassandra_store.cpp:298: Max Queue: 0 15-05-2018 07:05:40.131 UTC [7f6366ffd700] Status alarm.cpp:244: Reraising all alarms with a known state 15-05-2018 07:05:40.636 UTC [7f6379c8e7c0] Error main.cpp:1056: Failed to initialize the Cassandra store with error co de 3. 15-05-2018 07:05:40.636 UTC [7f6379c8e7c0] Status main.cpp:1057: Homestead is shutting down ``` The full logs are just this chunk repeating over and over until everything shuts down. I think the crucial error message is DNS config file /etc/clearwater/dns.json missing In the all-in-one installation, that file is just: { "hostnames": [ ] } Is it ok to manually insert that using the Dockerfile? Or should there be other entries in there for a non-AIO installation? dns.json is created by the dnsmasq service. During the container build, there is a red error message saying "dnsmasq: setting capabilities failed: Operation not permitted" (There are a lot of red error messages during build. Is that an issue?) I noticed that dnsmasq was not running in homestead docker, yet it is running in my working all-in-one installation. (There are also a bunch of other services which are halted. Is that an issue?) I tried starting dnsmasq manually, it wouldn't start. I looked online and the solution is to add user=root to the file /etc/dnsmasq.conf. I tried modifying the Dockerfile to manually append that string to the file. (I'm assuming that the order of lines in /etc/dnsmasq.conf does not matter) Then dnsmasq still doesn't come up. I get an error about how the port it wants is already taken. So then I tried modifying the Dockerfile to paste in dns.conf from the AIO installation. (The one with an empty list of hosts) Now the missing file error is gone, but I still get the other error: "Error main.cpp:1056: Failed to initialize the Cassandra store with error code 3." How can I debug this? All this time I thought that error was caused by dns issues. Is it because dnsmasq still won't start, so dns.json is missing hosts? Or is that unrelated? "ping cassandra" works from the homestead pod. Other Remaining questions: * Do I need dnsmasq? The dns.json file is an empty list of 'hosts'. Can I just paste that in? o { "hostnames": [ ]} o Why won't dnsmasq start? ("dnsmasq: failed to create listening socket for 127.0.0.1: Address already in use") Does that even matter? o During image build there's an error "dnsmasq: setting capabilities failed: Operation not permitted". What's the source of that error? * There are many other red error messages during build. Does that matter? * There are also a bunch of services (mostly clearwater ones) which are halted, but are running in the AIO build. Is that an issue? o Services running in AIO: ? [ ? ] clearwater-auto-config-generic ? [ + ] clearwater-cluster-manager ? [ + ] clearwater-config-manager ? [ + ] clearwater-diags-monitor ? [ - ] clearwater-etcd ? [ + ] clearwater-infrastructure ? [ ? ] clearwater-memcached ? [ + ] clearwater-queue-manager o Services running (or not) in docker homestead: ? [ ? ] clearwater-auto-config-docker ? [ + ] clearwater-cluster-manager ? [ - ] clearwater-config-manager ? [ - ] clearwater-diags-monitor ? [ + ] clearwater-etcd ? [ + ] clearwater-infrastructure ? [ - ] clearwater-queue-manager Thanks, Matthew Davis Telstra | CTO | Cloud SDN NFV From: Richard Whitehouse (projectclearwater.org) [mailto:[email protected]] Sent: Wednesday, 16 May 2018 3:24 AM To: [email protected]; Davis, Matthew <[email protected]> Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and homestead-prov under Kubernetes Matthew, Sorry to hear you are having problems deploying Clearwater on Docker. I think that the socket factory error is actually fairly benign. Does this happen every time you create your deployment? Can you provide the config map that you are providing when deploying on Kubernetes? Can you connect into one of the pods that is failing to run homestead-prov and grab the logs. The relevant ones will be under /var/log/homestead-prov inside the container. We cannot reproduce your issue on our kubernetes setup - can you provide details of your kubernetes setup - what version of k8s and what networking configuration you are using? >From the diagnostics we have at the moment, the most likely scenario is that >the Homestead and Homestead Prov containers are unable to contact Cassandra - >if that's the case they will continually restart to attempt to resolve the >problem. Richard From: Clearwater [mailto:[email protected]] On Behalf Of Davis, Matthew Sent: 14 May 2018 07:05 To: [email protected]<mailto:[email protected]> Subject: Re: [Project Clearwater] Issues with clearwater-docker homestead and homestead-prov under Kubernetes Hi everyone, I've fixed the issue I had with homestead-prov and submitted the patch as this pull request: https://github.com/Metaswitch/clearwater-docker/pull/89 However now I just run into another issue, so homestead-prov is giving me the same error as homestead. The error message is not verbose. Does anyone know how to make it more verbose? I have no idea what's happening. ``` 2018-05-14 05:47:41,369 INFO success: snmpd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2018-05-14 05:47:41,370 INFO success: clearwater-infrastructure entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2018-05-14 05:47:42,476 INFO success: socket-factory-sig entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-05-14 05:47:42,476 INFO success: socket-factory-mgmt entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-05-14 05:47:42,477 INFO exited: snmpd (exit status 0; expected) 2018-05-14 05:47:42,478 CRIT reaped unknown pid 61) 2018-05-14 05:47:44,572 CRIT reaped unknown pid 65) 2018-05-14 05:47:44,576 CRIT reaped unknown pid 66) 2018-05-14 05:47:45,883 CRIT reaped unknown pid 235) 2018-05-14 05:47:45,883 CRIT reaped unknown pid 236) 2018-05-14 05:47:45,902 CRIT reaped unknown pid 257) 2018-05-14 05:47:48,153 CRIT reaped unknown pid 294) 2018-05-14 05:47:48,226 INFO spawned: 'homestead-prov' with pid 297 2018-05-14 05:47:48,237 INFO spawned: 'nginx' with pid 298 2018-05-14 05:47:48,747 INFO exited: homestead-prov (exit status 0; not expected) 2018-05-14 05:47:49,750 INFO spawned: 'homestead-prov' with pid 324 2018-05-14 05:47:49,750 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2018-05-14 05:47:50,089 INFO exited: clearwater-infrastructure (exit status 0; expected) 2018-05-14 05:47:50,335 INFO exited: homestead-prov (exit status 0; not expected) 2018-05-14 05:47:52,339 INFO spawned: 'homestead-prov' with pid 335 2018-05-14 05:47:53,324 INFO exited: homestead-prov (exit status 0; not expected) 2018-05-14 05:47:56,330 INFO spawned: 'homestead-prov' with pid 346 2018-05-14 05:47:56,902 INFO exited: homestead-prov (exit status 0; not expected) 2018-05-14 05:47:57,904 INFO gave up: homestead-prov entered FATAL state, too many start retries too quickly ``` Regards, Matt Davis Telstra | Graduate Engineer
_______________________________________________ Clearwater mailing list [email protected] http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org
