Hi All,

I recently did a deployment of the latest ONAP Casablanca git branch and 
decided to try deploying into AWS using Rancher 2.2 and Amazon’s EKS offering 
(Elastic Kubernetes Service).  In case anyone else decides to try similar, here 
are some observations and lessons learned to save time:

These comments are with respect to Rancher 2.X and EKS in particular.  Rancher 
1.6.X as documented in the ONAP docs works great for me, and Rancher 2.X with 
RKE may also work great as well, although I haven’t tried it yet.

Regarding Rancher 2.X with EKS:

  *   I tried with each of Rancher 2.X stable, latest and master (2.1.X, 2.1.5, 
2.2.1) and these comments apply to all of these versions in general
  *   Rancher 2.X, to date, doesn’t save or give access to the SSH key for EKS 
worker nodes it spins up and a specific SSH key cannot be specified, so no 
SSHing into the EKS nodes is possible if any troubleshooting needs to be done.
     *   Resolution: After Rancher spun up the EKS control plane cluster and 
worker node cluster, I deleted the worker node cluster and re-created my own 
with an SSH key specified that I had access to
  *   Rancher 2.X, uses an older EKS worker node Cloudformation template and 
there is no place to specify a Cloudformation template as an override.  There 
is an AMI override in the Rancher interface, but the EKS docs specifically say 
that the latest AMI needs to be paired with the latest Cloudformation template, 
so without the template override as well, the AMI override alone isn’t very 
useful
     *   Resolution: Same as above, spin up my own EKS worker nodes using the 
latest Cloudformation template and AMI
  *   Rancher 2.X doesn’t allow worker node deployment to all regions that are 
AWS EKS supported.  For instance, my basic infrastructure (vpc, subnets, 
bastion host, registry, etc) started in us-east-2 which AWS supports EKS in, 
but Rancher 2.X didn’t support us-east-2 as an option for cluster/worker 
deployment, so I deployed EKS via Rancher 2.X in us-east-1 which caused 
additional work with multi-region deployment, vpc-peering, etc.
     *   Resolution: to keep this as a simple PoC deployment, I migrated my 
basic infrastructure to us-east-1 to avoid multi-region, multi-vpc peering, 
routing, etc.

Regarding ONAP and EFS or NFS/EBS persistent volumes:

  *   I spent a couple of days working with an EFS mounted /dockerdata-nfs 
directory on each of the K8 worker nodes which run Amazon Linux 2 by default in 
the EKS cluster and I ran into a lot of random problems with various ONAP 
database sub-components which were not readily repeatable
     *   Example errors are included below for reference in case anyone else 
runs into the same problem and is searching for resolution.  The random errors 
might cause a component deploy to fail and upon uninstall and re-install the 
problem would be gone, but then other random, similar errors would show up in 
other components or pods that hadn’t been there before
  *   Rancher 2.X automatically sets a storage class of “gp2” as default.  This 
created initial problems with the creation of any persistent volumes until I 
unset the default so there was no default storageclass:

kubectl patch storageclass gp2 -p '{"metadata": {"annotations": 
{"storageclass.kubernetes.io/is-default-class": "false"}}}'


  *   Persistent volumes then created as expected and I didn’t have issues 
except when I got to the following the following SDNC components which refused 
to have their PVC bind to the respective PVs and resulted in pod failures:

$ kubectl get pvc -n onap | grep -vi bound
NAME                                                                           
STATUS    STORAGECLASS                              AGE
onap-sdnc-controller-blueprints-db                                             
Pending                                             13m
onap-sdnc-nengdb                                                               
Pending                                             13m

---

$ kubectl -n onap describe pvc onap-sdnc-controller-blueprints-db
Name:          onap-sdnc-controller-blueprints-db
Namespace:     onap
StorageClass:
Status:        Pending
Volume:
Labels:        app=controller-blueprints-db
               chart=mariadb-galera-3.0.0
               heritage=Tiller
               release=onap-sdnc
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
  Type    Reason         Age                From                         Message
  ----    ------         ----               ----                         -------
  Normal  FailedBinding  3m (x83 over 24m)  persistentvolume-controller  no 
persistent volumes available for this claim and no storage class is set



  *   The resolution for me was to switch from EFS backing the /dockerdata-data 
store to NFS over EBS at which point:
     *   All the random failures stopped
     *   While the PVC for the SDNC pods above still show as “pending” with NFS 
over EBS, there are no longer any FailedBinding events and the pods start 
successfully (the same is true in a metal install I have using RAID1 block 
storage).

It looks like a number of ONAP users are using EFS on AWS for persistent volume 
stores without these issues, so I’m not sure why I was clearly having 
difficulty using EFS in particular and the use of NFS over EBS resolved these 
issues immediately.  Two possibilities that come to mind are that the OS used 
on the EKS K8s is Amazon Linux 2 by default, whereas others may be using Ubuntu 
or something else.  EFS also, perhaps, may not have been performing as well for 
me as for others although there were no indications of any AWS EFS outages and 
command line tests of the EFS filesystem didn’t indicate problems.  In the 
past, I’ve also experienced EFS being unable to support certain disk intensive 
applications, such as an ELK stack.

I still have some work to do figuring out loadbalancing and ingressing to the 
cluster, but that’s for another day. 😊.

Thanks,

John

---------------

Example random errors in various DB related pods/containers while using EFS 
which disappeared when switching to NFS over EBS:

---

InnoDB: Doublewrite does not have page_no=0 of space: 0
InnoDB: space header page consists of zero bytes in data file ./ibdata1
InnoDB: Could not open or create the system tablespace. If you tried to add new 
data files to the system tablespace, and it failed here, you should now edit 
innodb_data_file_path in my.cnf back to what it was, and remove the new ibdata 
files InnoDB created in this failed attempt. InnoDB only wrote those files full 
of zeros, but did not yet use them in any way. But be careful: do not remove 
old data files which contain your precious data!
Plugin 'InnoDB' init function returned error.
Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
Plugin 'FEEDBACK' is disabled.
Could not open mysql.plugin table. Some plugins may be not loaded
Unknown/unsupported storage engine: innodb
Aborting

---

2019-01-19  2:48:38 0 [Note] mysqld (mysqld 
10.3.10-MariaDB-1:10.3.10+maria~bionic) starting as process 1 ...
2019-01-19  2:48:38 0 [Note] InnoDB: Using Linux native AIO
2019-01-19  2:48:38 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic 
builtins
2019-01-19  2:48:38 0 [Note] InnoDB: Uses event mutexes
2019-01-19  2:48:38 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2019-01-19  2:48:38 0 [Note] InnoDB: Number of pools: 1
2019-01-19  2:48:38 0 [Note] InnoDB: Using SSE2 crc32 instructions
2019-01-19  2:48:38 0 [Note] InnoDB: Initializing buffer pool, total size = 
256M, instances = 1, chunk size = 128M
2019-01-19  2:48:38 0 [Note] InnoDB: Completed initialization of buffer pool
2019-01-19  2:48:38 0 [Note] InnoDB: If the mysqld execution user is 
authorized, page cleaner thread priority can be changed. See the man page of 
setpriority().
2019-01-19  2:48:38 0 [ERROR] InnoDB: Header page consists of zero bytes in 
datafile: ./ibdata1, Space ID:0, Flags: 0. Please refer to 
http://dev.mysql.com/doc/refman/5.7/en/innodb-troubleshooting-datadict.html for 
how to resolve the issue.
2019-01-19  2:48:38 0 [ERROR] InnoDB: Corrupted page [page id: space=0, page 
number=0] of datafile './ibdata1' could not be found in the doublewrite buffer.
2019-01-19  2:48:38 0 [ERROR] InnoDB: Plugin initialization aborted with error 
Data structure corruption
2019-01-19  2:48:38 0 [Note] InnoDB: Starting shutdown...

---


2019-01-19 22:25:04 139725913889536 [ERROR] InnoDB: Unable to lock 
./nengdb/NELGEN_MESSAGE.ibd, error: 122
2019-01-19 22:25:04 7f1479769b00  InnoDB: Operating system error number 122 in 
a file operation.
InnoDB: Error number 122 means 'Disk quota exceeded'.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
2019-01-19 22:25:04 139725913889536 [ERROR] InnoDB: Cannot create file 
'./nengdb/NELGEN_MESSAGE.ibd'


---

The Amazon EFS troubleshooting document mentions some errors similar to these: 
https://docs.aws.amazon.com/efs/latest/ug/troubleshooting.html

John


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#15061): https://lists.onap.org/g/onap-discuss/message/15061
Mute This Topic: https://lists.onap.org/mt/29382184/21656
Group Owner: [email protected]
Unsubscribe: https://lists.onap.org/g/onap-discuss/unsub  
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to