Hey Matthew,
Sorry to hear it isn't working yet. Looking at the output of rake test, you're
hitting a forbidden in the RestClient, and so it's something on the Http calls.
I think your issue may just be a mismatch in the value of 'SIGNUP_CODE' that
you're passing in to the rake test command, and what's on your deployment. This
should default to 'secret', and I don't see anything in your configmap that
would change that, so can you double check the test command includes "
SIGNUP_CODE='secret' ". If you have made any changes to the signup key then
obviously it'll need to match those. If you want to double check the value,
take a look in the /etc/clearwater/shared_config file in one of the pods.
I have double checked a deployment on our rig, copying your provided yaml files
over directly to make sure there isn't anything odd in there, and the live
tests were able to run fine, following the deployment steps you're using. I did
have some trouble getting the script to work as copied over, but think that's
just outlook formatting quotes wrong etc. Manually running the commands it all
worked as expected. The only changes I made were:
* Changing the PUBLIC IP to match my rig IP.
* Changed the image pull source to our internal one
* Changed the zone in the config map to my own one, 'ajl.svc.cw-k8s.test'
With this all deployed, the following test command passed with no issue:
rake test[ajl.svc.cw-k8s.test] PROXY=10.230.16.1 PROXY_PORT=30060
SIGNUP_CODE='secret' ELLIS=10.230.16.1:30080 TESTS="Basic call - mainline"
If the issue isn't the signup key, can we try getting some more diags that we
can take a look into? In particular, I think we would benefit from:
* A packet capture on the node you are running the live tests on, when you
hit the errors below
* The bono logs, at debug level, from the same time. To set up debug
logging, you need to add 'log_level=5' to /etc/clearwater/user_settings
(creating if needed), and restart the bono service
* The ellis logs from the same time
Running the tcpdump on the test node should mean we get to see the full set of
flows, and you can likely read through that yourself to work out any following
issues you find hiding behind this next one.
Any other diagnostics you can gather would obviously also be useful, but with
the above, assuming traffic is reaching the pods, we should be able to work out
the issue.
On your connectivity tests, you won't be able to connect to the bono service
using 'nc localhost 30060', because that attempts to connect using the
localhost IP. We have set the bono service up to listen on the 'PUBLIC_IP',
i.e. the host IP. If you try running 'nc 10.3.1.76 30060 -v' you should see
successful connection. (Or on whichever host IP you have configured it to
listen).
The log output you are seeing on restarting Bono is also benign. These are
again simply an artefact of some behaviour we want to have in VMs, but that is
not needed in these containers. You can safely ignore this output.
Good luck, and let us know where you get with debugging.
Cheers,
Adam
From: Davis, Matthew [mailto:[email protected]]
Sent: 30 May 2018 08:27
To: Adam Lindley <[email protected]>;
[email protected]
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and
homestead-prov under Kubernetes
Hi Adam,
# Openstack Install
I only mentioned the helm charts just in case the almost empty charts on my
machine were the source of the error. I personally have no experience with
Helm, so I can't help you with any development.
I applied that latest change to the bono service port number. It still doesn't
work.
How can I check whether the rake tests are failing on the bono side, as opposed
to failing on the ellis side? Maybe the reason tcpdumps shows nothing in Bono
during the rake tests is because the tests failed to create a user in ellis,
and never got to the bono part?
Rake output:
```
Basic Call - Mainline (TCP) - Failed
RestClient::Forbidden thrown:
- 403 Forbidden
-
/home/ubuntu/.rvm/gems/ruby-1.9.3-p551/gems/rest-client-1.8.0/lib/restclient/abstract_response.rb:74:in
`return!'
-
/home/ubuntu/.rvm/gems/ruby-1.9.3-p551/gems/rest-client-1.8.0/lib/restclient/request.rb:495:in
`process_result'
...
```
If I go inside the ellis container and run `nc localhost 80 -v` I see that it
establishes a connection.
If I go inside the bono container and run `nc localhost 5060 -v` or `nc
localhost 30060 -v` it fails to connect. So from within the bono pod I cannot
connect to the localhost. To me that suggests that the problem is caused by
something inside bono, not the networking between pods. What happens when you
try `nc localhost 32060 -v` in your deployment?
The logs inside bono are an echo of an error message from sprout. Does that
matter?
```
30-05-2018 06:46:59.432 UTC [7f45527e4700] Status sip_connection_pool.cpp:428:
Recycle TCP connection slot 4
30-05-2018 06:47:06.222 UTC [7f4575ae0700] Status alarm.cpp:244: Reraising all
alarms with a known state
30-05-2018 06:47:06.222 UTC [7f4575ae0700] Status alarm.cpp:37: sprout issued
1012.3 alarm
30-05-2018 06:47:06.222 UTC [7f4575ae0700] Status alarm.cpp:37: sprout issued
1013.3 alarm
```
Those timestamps don't correspond to the rake tests. They just happen every 30
seconds.
When I restart the bono service it says: `63: ulimit: open files: cannot modify
limit: Invalid argument`
Does that matter? (I've seen that error message everywhere. I have no idea what
it means)
I've appended the yaml files to the end of this email.
# Azure Install
I had a chat to Microsoft. It seems that your hunch was correct. HTTP
Application Routing only works on port 80 and 443. Furthermore, I cannot simply
route SIP calls through port 443 because the routing does some HTTP specific
packet inspection things. So I'll have to give up on that approach and go for a
more vanilla, manually configured NodePort approach (either still on AKS but
without HTTP Application Routing, or on Openstack). So I'm even more keen to
solve the aforementioned issues.
# Yamls and stuff
I'm pasting them again just in case I've forgotten something. 10.3.1.76 is the
ip address of my cluster.
Here's a script I'm using to tear down and rebuild everything. (Just incase
`kubectl apply -f something.yaml` doesn't actually propagate the change fully)
The while loops in this script are just to wait until the previous step has
finished.
```
set -x
cd clearwater-docker-master/kubernetes
kubectl delete -f ./
kubectl delete configmap env-vars
set -e
echo 'waiting until old pods are all deleted'
while [ $(kubectl get pods | grep ^NAME -v | wc -l ) -neq 0]
do
sleep 5
done
echo "creating new pods"
kubectl create configmap env-vars --from-literal=ZONE=default.svc.cluster.local
kubectl apply -f ./
while [ $(kubectl get pods | grep "2/2" | grep bono | wc -l) -neq 1 ]
do
sleep 5
done
BONO=$(kubectl get pods | grep "2/2" | grep bono | awk '{ print $1 }')
echo "Bono is up as $BONO"
kubectl exec -it $BONO -- apt-get -y install vim
kubectl exec -it $BONO -- sed -i -e 's/--pcscf=5060,5058/--pcscf=30060,5058/g'
/etc/init.d/bono
kubectl exec -it $BONO service bono restart
while [ $(kubectl get pods | grep "0/" | wc -l) -neq 0 ]
do
sleep 5
done
echo "All pods are up now"
kubectl get pods
echo "Done"
```
`kubectl get services`
```
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
AGE
astaire ClusterIP None <none> 11311/TCP
1h
bono NodePort 10.0.168.197 <none>
3478:32214/TCP,30060:30060/TCP,5062:30144/TCP 1h
cassandra ClusterIP None <none>
7001/TCP,7000/TCP,9042/TCP,9160/TCP 1h
chronos ClusterIP None <none> 7253/TCP
1h
ellis NodePort 10.0.53.199 <none> 80:30080/TCP
1h
etcd ClusterIP None <none>
2379/TCP,2380/TCP,4001/TCP 1h
homer ClusterIP None <none> 7888/TCP
1h
homestead ClusterIP None <none> 8888/TCP
1h
homestead-prov ClusterIP None <none> 8889/TCP
1h
kubernetes ClusterIP 10.0.0.1 <none> 443/TCP
5d
ralf ClusterIP None <none> 10888/TCP
1h
sprout ClusterIP None <none> 5052/TCP,5054/TCP
1h
```
bono-depl.yaml
```
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: bono
spec:
replicas: 1
selector:
matchLabels:
service: bono
template:
metadata:
labels:
service: bono
snmp: enabled
spec:
containers:
- image: "mlda065/bono:latest"
imagePullPolicy: Always
name: bono
ports:
- containerPort: 22
- containerPort: 3478
- containerPort: 5060
- containerPort: 5062
- containerPort: 5060
protocol: "UDP"
- containerPort: 5062
protocol: "UDP"
envFrom:
- configMapRef:
name: env-vars
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: PUBLIC_IP
value: 10.3.1.76
livenessProbe:
exec:
command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", "3478
5062"]
initialDelaySeconds: 30
readinessProbe:
exec:
command: ["/bin/bash", "/usr/share/kubernetes/liveness.sh", "3478
5062"]
volumeMounts:
- name: bonologs
mountPath: /var/log/bono
- image: busybox
name: tailer
command: [ "tail", "-F", "/var/log/bono/bono_current.txt" ]
volumeMounts:
- name: bonologs
mountPath: /var/log/bono
volumes:
- name: bonologs
emptyDir: {}
imagePullSecrets:
- name: ~
restartPolicy: Always
```
Bono-svc.yaml
```
apiVersion: v1
kind: Service
metadata:
name: bono
spec:
type: NodePort
ports:
- name: "3478"
port: 3478
- name: "5060"
port: 30060
nodePort: 30060
- name: "5062"
port: 5062
selector:
service: bono
```
Ellis service
```
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: ellis
spec:
replicas: 1
template:
metadata:
labels:
service: ellis
spec:
containers:
- image: "mlda065/ellis:latest"
imagePullPolicy: Always
name: ellis
ports:
- containerPort: 22
- containerPort: 80
envFrom:
- configMapRef:
name: env-vars
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
livenessProbe:
tcpSocket:
port: 80
initialDelaySeconds: 30
readinessProbe:
tcpSocket:
port: 80
imagePullSecrets:
- name: ~
restartPolicy: Always
```
Ellis-svc.yaml
```
apiVersion: v1
kind: Service
metadata:
name: ellis
spec:
type: NodePort
ports:
- name: "http"
port: 80
nodePort: 30080
selector:
service: ellis
```
`kubectl describe configmap env-vars`
```
Name: env-vars
Namespace: default
Labels: <none>
Annotations:
kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","data":{"ZONE":"default.svc.cluster.local"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"env-vars","namespace":"default"}...
Data
====
ZONE:
----
default.svc.cluster.local
Events: <none>
```
Thanks,
Matt
Telstra Graduate Engineer
CTO | Cloud SDN NFV
From: Adam Lindley [mailto:[email protected]]
Sent: Friday, 25 May 2018 6:42 PM
To: Davis, Matthew
<[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Subject: RE: [Project Clearwater] Issues with clearwater-docker homestead and
homestead-prov under Kubernetes
Hi Matthew,
Our Helm support is a recent addition, and came from another external
contributor. See the Pull Request at
https://github.com/Metaswitch/clearwater-docker/pull/85 for the details :)
As it stands at the moment, the chart is good enough for deploying and
re-creating a full standard deployment through Helm, but I don't believe it
handles more of the complexities of upgrading a clearwater deployment that it
potentially could.
We haven't yet done any significant work in setting up Helm charts, or
integrating with them in a more detailed manner, so if that's something you're
interested in as well, we'd love to work with you to get some more enhancements
in. Especially if you have other expert contacts who know more in this area.
(I'm removing some of the thread in the email below, to keep us below the list
limits. The online archives will keep all the info though)
Cheers,
Adam
_______________________________________________
Clearwater mailing list
[email protected]
http://lists.projectclearwater.org/mailman/listinfo/clearwater_lists.projectclearwater.org