Guys, Update: The tlab and the AWS CD systems were in agreement today. Note the AWS system is now running 4 x 64G VMs (32 cores and 256G ram) – (I am bringing up a 2nd 9 x 16G vm mirror system up as well). For those requiring AWS - the details/script on configuring the only cloud native part of the cluster – the EFS wrapper on the NFS file share (will be hosted outside of all the VM’s as a service) – is detailed below – you can run the EFS script after the normal cloud-agnostic oom_rancher_install.sh script in https://jira.onap.org/browse/LOG-325 https://wiki.onap.org/display/DW/Cloud+Native+Deployment#CloudNativeDeployment-EFS/NFSProvisioningScriptforAWS
There were only 2 failures in a good build earlier today on both systems – one due to intermittent timing likely at 41/43 – both agreed at the same time and on the same 2 pods – which is rare. 4 x 64G – AWS http://jenkins.onap.info/job/oom-cd-master/2971/consoleFull 16:00:18 43 critical tests, 41 passed, 2 failed 9 x 16G - TLAB https://jenkins.onap.org/view/External%20Labs/job/lab-tlab-beijing-oom-deploy/321/console 21:37:46 14:19:40-0700 43 critical tests, 41 passed, 2 failed http://kibana.onap.info:5601/app/kibana#/dashboard/AWAtvpS63NTXK5mX2kuS?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2018-05-19T14:46:38.216Z',mode:absolute,to:'2018-05-19T17:22:18.756Z'))&_a=(description:'',filters:!(),options:(darkTheme:!f),panels:!((col:1,id:AWAts77k3NTXK5mX2kuM,panelIndex:1,row:1,size_x:8,size_y:3,type:visualization),(col:9,id:AWAtuTVI3NTXK5mX2kuP,panelIndex:2,row:1,size_x:4,size_y:3,type:visualization),(col:1,id:AWAtuBTY3NTXK5mX2kuO,panelIndex:3,row:7,size_x:6,size_y:3,type:visualization),(col:1,id:AWAttmqB3NTXK5mX2kuN,panelIndex:4,row:4,size_x:6,size_y:3,type:visualization),(col:7,id:AWAtvHtY3NTXK5mX2kuR,panelIndex:6,row:4,size_x:6,size_y:6,type:visualization)),query:(match_all:()),timeRestore:!f,title:'CD%20Health%20Check',uiState:(),viewMode:view) ONAP wide Resource allocations: The healthcheck even on this system (and the tlab system) – is very sensitive to un-optimized rogue containers (logstash being one), the order of pods, and readiness/liveness timing – as we fine tune and prioritize container resources we should get better – I am scheduling a performance meeting for 1130 Thu to go over a couple of the containers (ELK stack under a 30 logs/sec idle load) causing issues like the high indexing behavior on the logstash daemonset and whether/how elasticsearch should also be a ds - Mike and Mandeep have work scheduled to do most of this for all of OOM in general – the meeting will be public and start with logging pods. A lot of this will be things like ReplicaSet sizes - which to switch to DaemonSets (1 container per vm), which to use autoscalers, cpu limits (cores sorry no %), ram limits, collocation rules (which pods get affected by others on the same vm), cluster VM granularity sweetspot (32/16/8g VMs and the tradeoff on limiting vm local affects with reduced collocation) – for example a rogue container on a 128g vm in a cluster hogging 6 of 32 cores is less affecting than one on an 8 core vm, but an 8G vm may only have enough room for 1 or 2 pods that need huge heaps. However all of these optimizations should be done together with all the PTLs because if we arbitrarily set priorities on ram/cpu limits on some pods – others will be effectively downgraded – we need a hierarchy – which Mike mentioned. I recommend the windriver/tlab system report the pod list just before the final healthcheck like below so we can split out heathchecks failing due to failed pods and HC failing on running pods – as well as HC passing on failing pods for false positives. If we use -o wide we can also determine the deployment distribution architecture of that particular install – where pods especially non-DaemonSet replicaSet ones are running on which cluster VM. Example a couple running pods – the list is 125 of 150+ 16:00:01 List of ONAP Modules 16:00:01 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE 16:00:01 onap dep-config-binding-service-68b4695cb4-l4tst 2/2 Running 0 3h 10.42.233.237 ip-10-0-0-80.us-east-2.compute.internal 16:00:01 onap dep-dcae-tca-analytics-68d749cb4c-7mzjg 2/2 Running 0 3h 10.42.174.250 ip-10-0-0-111.us-east-2.compute.internal ….. For robot logs – I’ll think it will be benefical to add a filebeat sidecar to the robot pod – so we can query the elk stack on 30253 on any robot healthcheck and ete logs as well. https://jira.onap.org/browse/LOG-414 this was mentioned in a previous request https://lists.onap.org/pipermail/onap-discuss/2018-April/009199.html thank you /michael This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>
_______________________________________________ onap-discuss mailing list onap-discuss@lists.onap.org https://lists.onap.org/mailman/listinfo/onap-discuss