Guys,
   Update: The tlab and the AWS CD systems were in agreement today.
   Note the AWS system is now running 4 x 64G VMs (32 cores and 256G ram) – (I 
am bringing up a 2nd 9 x 16G vm mirror system up as well).
   For those requiring AWS - the details/script on configuring the only cloud 
native part of the cluster – the EFS wrapper on the NFS file share (will be 
hosted outside of all the VM’s as a service) – is detailed below – you can run 
the EFS script after the normal cloud-agnostic oom_rancher_install.sh script in 
https://jira.onap.org/browse/LOG-325
https://wiki.onap.org/display/DW/Cloud+Native+Deployment#CloudNativeDeployment-EFS/NFSProvisioningScriptforAWS

    There were only 2 failures in a good build earlier today on both systems – 
one due to intermittent timing likely at 41/43 – both agreed at the same time 
and on the same 2 pods – which is rare.
4 x 64G – AWS
http://jenkins.onap.info/job/oom-cd-master/2971/consoleFull
16:00:18 43 critical tests, 41 passed, 2 failed
9 x 16G - TLAB
https://jenkins.onap.org/view/External%20Labs/job/lab-tlab-beijing-oom-deploy/321/console
21:37:46 14:19:40-0700  43 critical tests, 41 passed, 2 failed

http://kibana.onap.info:5601/app/kibana#/dashboard/AWAtvpS63NTXK5mX2kuS?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2018-05-19T14:46:38.216Z',mode:absolute,to:'2018-05-19T17:22:18.756Z'))&_a=(description:'',filters:!(),options:(darkTheme:!f),panels:!((col:1,id:AWAts77k3NTXK5mX2kuM,panelIndex:1,row:1,size_x:8,size_y:3,type:visualization),(col:9,id:AWAtuTVI3NTXK5mX2kuP,panelIndex:2,row:1,size_x:4,size_y:3,type:visualization),(col:1,id:AWAtuBTY3NTXK5mX2kuO,panelIndex:3,row:7,size_x:6,size_y:3,type:visualization),(col:1,id:AWAttmqB3NTXK5mX2kuN,panelIndex:4,row:4,size_x:6,size_y:3,type:visualization),(col:7,id:AWAtvHtY3NTXK5mX2kuR,panelIndex:6,row:4,size_x:6,size_y:6,type:visualization)),query:(match_all:()),timeRestore:!f,title:'CD%20Health%20Check',uiState:(),viewMode:view)

   ONAP wide Resource allocations:
   The healthcheck even on this system (and the tlab system) – is very 
sensitive to un-optimized rogue containers (logstash being one), the order of 
pods, and readiness/liveness timing – as we fine tune and prioritize container 
resources we should get better – I am scheduling a performance meeting for 1130 
Thu to go over a couple of the containers (ELK stack under a 30 logs/sec idle 
load) causing issues like the high indexing behavior on the logstash daemonset 
and whether/how elasticsearch should also be a ds - Mike and Mandeep have work 
scheduled to do most of this for all of OOM in general – the meeting will be 
public and start with logging pods.
    A lot of this will be things like ReplicaSet sizes - which to switch to 
DaemonSets (1 container per vm), which to use autoscalers, cpu limits (cores 
sorry no %), ram limits, collocation rules (which pods get affected by others 
on the same vm), cluster VM granularity sweetspot (32/16/8g VMs and the 
tradeoff on limiting vm local affects with reduced collocation) – for example a 
rogue container on a 128g vm in a cluster hogging 6 of 32 cores is less 
affecting than one on an 8 core vm, but an 8G vm may only have enough room for 
1 or 2 pods that need huge heaps.  However all of these optimizations should be 
done together with all the PTLs because if we arbitrarily set priorities on 
ram/cpu limits on some pods – others will be effectively downgraded – we need a 
hierarchy – which Mike mentioned.

     I recommend the windriver/tlab system report the pod list just before the 
final healthcheck like below so we can split out heathchecks failing due to 
failed pods and HC failing on running pods – as well as HC passing on failing 
pods for false positives.
     If we use -o wide we can also determine the deployment distribution 
architecture of that particular install – where pods especially non-DaemonSet 
replicaSet ones are running on which cluster VM.

Example a couple running pods – the list is 125 of 150+
16:00:01 List of ONAP Modules
16:00:01 NAMESPACE     NAME                                            READY    
 STATUS             RESTARTS   AGE       IP              NODE
16:00:01 onap          dep-config-binding-service-68b4695cb4-l4tst     2/2      
 Running            0          3h        10.42.233.237   
ip-10-0-0-80.us-east-2.compute.internal
16:00:01 onap          dep-dcae-tca-analytics-68d749cb4c-7mzjg         2/2      
 Running            0          3h        10.42.174.250   
ip-10-0-0-111.us-east-2.compute.internal
…..

For robot logs – I’ll think it will be benefical to add a filebeat sidecar to 
the robot pod – so we can query the elk stack on 30253 on any robot healthcheck 
and ete logs as well.
https://jira.onap.org/browse/LOG-414
this was mentioned in a previous request 
https://lists.onap.org/pipermail/onap-discuss/2018-April/009199.html

thank you
/michael


This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer 
<https://www.amdocs.com/about/email-disclaimer>
_______________________________________________
onap-discuss mailing list
onap-discuss@lists.onap.org
https://lists.onap.org/mailman/listinfo/onap-discuss

Reply via email to