Hi All,
I recently did a deployment of the latest ONAP Casablanca git branch and
decided to try deploying into AWS using Rancher 2.2 and Amazon’s EKS offering
(Elastic Kubernetes Service). In case anyone else decides to try similar, here
are some observations and lessons learned to save time:
These comments are with respect to Rancher 2.X and EKS in particular. Rancher
1.6.X as documented in the ONAP docs works great for me, and Rancher 2.X with
RKE may also work great as well, although I haven’t tried it yet.
Regarding Rancher 2.X with EKS:
* I tried with each of Rancher 2.X stable, latest and master (2.1.X, 2.1.5,
2.2.1) and these comments apply to all of these versions in general
* Rancher 2.X, to date, doesn’t save or give access to the SSH key for EKS
worker nodes it spins up and a specific SSH key cannot be specified, so no
SSHing into the EKS nodes is possible if any troubleshooting needs to be done.
* Resolution: After Rancher spun up the EKS control plane cluster and
worker node cluster, I deleted the worker node cluster and re-created my own
with an SSH key specified that I had access to
* Rancher 2.X, uses an older EKS worker node Cloudformation template and
there is no place to specify a Cloudformation template as an override. There
is an AMI override in the Rancher interface, but the EKS docs specifically say
that the latest AMI needs to be paired with the latest Cloudformation template,
so without the template override as well, the AMI override alone isn’t very
useful
* Resolution: Same as above, spin up my own EKS worker nodes using the
latest Cloudformation template and AMI
* Rancher 2.X doesn’t allow worker node deployment to all regions that are
AWS EKS supported. For instance, my basic infrastructure (vpc, subnets,
bastion host, registry, etc) started in us-east-2 which AWS supports EKS in,
but Rancher 2.X didn’t support us-east-2 as an option for cluster/worker
deployment, so I deployed EKS via Rancher 2.X in us-east-1 which caused
additional work with multi-region deployment, vpc-peering, etc.
* Resolution: to keep this as a simple PoC deployment, I migrated my
basic infrastructure to us-east-1 to avoid multi-region, multi-vpc peering,
routing, etc.
Regarding ONAP and EFS or NFS/EBS persistent volumes:
* I spent a couple of days working with an EFS mounted /dockerdata-nfs
directory on each of the K8 worker nodes which run Amazon Linux 2 by default in
the EKS cluster and I ran into a lot of random problems with various ONAP
database sub-components which were not readily repeatable
* Example errors are included below for reference in case anyone else
runs into the same problem and is searching for resolution. The random errors
might cause a component deploy to fail and upon uninstall and re-install the
problem would be gone, but then other random, similar errors would show up in
other components or pods that hadn’t been there before
* Rancher 2.X automatically sets a storage class of “gp2” as default. This
created initial problems with the creation of any persistent volumes until I
unset the default so there was no default storageclass:
kubectl patch storageclass gp2 -p '{"metadata": {"annotations":
{"storageclass.kubernetes.io/is-default-class": "false"}}}'
* Persistent volumes then created as expected and I didn’t have issues
except when I got to the following the following SDNC components which refused
to have their PVC bind to the respective PVs and resulted in pod failures:
$ kubectl get pvc -n onap | grep -vi bound
NAME
STATUS STORAGECLASS AGE
onap-sdnc-controller-blueprints-db
Pending 13m
onap-sdnc-nengdb
Pending 13m
---
$ kubectl -n onap describe pvc onap-sdnc-controller-blueprints-db
Name: onap-sdnc-controller-blueprints-db
Namespace: onap
StorageClass:
Status: Pending
Volume:
Labels: app=controller-blueprints-db
chart=mariadb-galera-3.0.0
heritage=Tiller
release=onap-sdnc
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 3m (x83 over 24m) persistentvolume-controller no
persistent volumes available for this claim and no storage class is set
* The resolution for me was to switch from EFS backing the /dockerdata-data
store to NFS over EBS at which point:
* All the random failures stopped
* While the PVC for the SDNC pods above still show as “pending” with NFS
over EBS, there are no longer any FailedBinding events and the pods start
successfully (the same is true in a metal install I have using RAID1 block
storage).
It looks like a number of ONAP users are using EFS on AWS for persistent volume
stores without these issues, so I’m not sure why I was clearly having
difficulty using EFS in particular and the use of NFS over EBS resolved these
issues immediately. Two possibilities that come to mind are that the OS used
on the EKS K8s is Amazon Linux 2 by default, whereas others may be using Ubuntu
or something else. EFS also, perhaps, may not have been performing as well for
me as for others although there were no indications of any AWS EFS outages and
command line tests of the EFS filesystem didn’t indicate problems. In the
past, I’ve also experienced EFS being unable to support certain disk intensive
applications, such as an ELK stack.
I still have some work to do figuring out loadbalancing and ingressing to the
cluster, but that’s for another day. 😊.
Thanks,
John
---------------
Example random errors in various DB related pods/containers while using EFS
which disappeared when switching to NFS over EBS:
---
InnoDB: Doublewrite does not have page_no=0 of space: 0
InnoDB: space header page consists of zero bytes in data file ./ibdata1
InnoDB: Could not open or create the system tablespace. If you tried to add new
data files to the system tablespace, and it failed here, you should now edit
innodb_data_file_path in my.cnf back to what it was, and remove the new ibdata
files InnoDB created in this failed attempt. InnoDB only wrote those files full
of zeros, but did not yet use them in any way. But be careful: do not remove
old data files which contain your precious data!
Plugin 'InnoDB' init function returned error.
Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
Plugin 'FEEDBACK' is disabled.
Could not open mysql.plugin table. Some plugins may be not loaded
Unknown/unsupported storage engine: innodb
Aborting
---
2019-01-19 2:48:38 0 [Note] mysqld (mysqld
10.3.10-MariaDB-1:10.3.10+maria~bionic) starting as process 1 ...
2019-01-19 2:48:38 0 [Note] InnoDB: Using Linux native AIO
2019-01-19 2:48:38 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic
builtins
2019-01-19 2:48:38 0 [Note] InnoDB: Uses event mutexes
2019-01-19 2:48:38 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2019-01-19 2:48:38 0 [Note] InnoDB: Number of pools: 1
2019-01-19 2:48:38 0 [Note] InnoDB: Using SSE2 crc32 instructions
2019-01-19 2:48:38 0 [Note] InnoDB: Initializing buffer pool, total size =
256M, instances = 1, chunk size = 128M
2019-01-19 2:48:38 0 [Note] InnoDB: Completed initialization of buffer pool
2019-01-19 2:48:38 0 [Note] InnoDB: If the mysqld execution user is
authorized, page cleaner thread priority can be changed. See the man page of
setpriority().
2019-01-19 2:48:38 0 [ERROR] InnoDB: Header page consists of zero bytes in
datafile: ./ibdata1, Space ID:0, Flags: 0. Please refer to
http://dev.mysql.com/doc/refman/5.7/en/innodb-troubleshooting-datadict.html for
how to resolve the issue.
2019-01-19 2:48:38 0 [ERROR] InnoDB: Corrupted page [page id: space=0, page
number=0] of datafile './ibdata1' could not be found in the doublewrite buffer.
2019-01-19 2:48:38 0 [ERROR] InnoDB: Plugin initialization aborted with error
Data structure corruption
2019-01-19 2:48:38 0 [Note] InnoDB: Starting shutdown...
---
2019-01-19 22:25:04 139725913889536 [ERROR] InnoDB: Unable to lock
./nengdb/NELGEN_MESSAGE.ibd, error: 122
2019-01-19 22:25:04 7f1479769b00 InnoDB: Operating system error number 122 in
a file operation.
InnoDB: Error number 122 means 'Disk quota exceeded'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
2019-01-19 22:25:04 139725913889536 [ERROR] InnoDB: Cannot create file
'./nengdb/NELGEN_MESSAGE.ibd'
---
The Amazon EFS troubleshooting document mentions some errors similar to these:
https://docs.aws.amazon.com/efs/latest/ug/troubleshooting.html
John
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#15061): https://lists.onap.org/g/onap-discuss/message/15061
Mute This Topic: https://lists.onap.org/mt/29382184/21656
Group Owner: [email protected]
Unsubscribe: https://lists.onap.org/g/onap-discuss/unsub
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-