If you would collect one minute CP MONITOR data, I would love to analyze
it for you. I've the best CP performance analysis tools at my disposal....
On 2/6/2021 3:44 AM, Mariusz Walczak wrote:
Hello,
I hope someone can share the experience or put some light on the problem. I
will refer to elasticsearch as "ES" in this email.
We are running 2 Openshift clusters on 1 zVM LPAR (16 logical CPU , SMT2).
Cluster 1 (development workload):
3x Master node (each zLinux 8 vCPU) (VSWITCH-1 VLAN 1)
3x Worker node (each 10 vCPU) (VSWITCH-1 VLAN 1)
1x Infra node (6 vCPU) (VSWITCH-1 VLAN 1) ("ES" ON) (high CPU use)
Cluser 2 (no workload, just "ES" ON):
3x Master node (each zLinux 4 vCPU) (VSWITCH-2 VLAN 2)
4x Worker node (each 4 vCPU) (VSWITCH-2 VLAN 2)
2x Infra node (6 vCPU) (VSWITCH-2 VLAN 2) ("ES" on on each) (high CPU use)
Problem:
With "ES" OFF on both clusters, the batch time of APP1 is ~600 seconds.
With "ES" ON on both clusters, batch time is ~1200 seconds.
Sympthoms:
- high cpu steal on zLinux nodes (TOP) with elasticsearch active
- bad network response (git clone, downloading images)
- CPU steal drops if we shutdown elasticsearch
With "ES" ON: zVM perfkit LPAR CPU at ~60% . CEC IFL usage 40%.
Where do you expect the bottleneck and what is causing high CPU steal on
zLinux nodes ?
Some more info - there is Fluentd pod running on every cluster node and is
sending log data constantly (quite big amounts) to Infra node
(elasticsearch)
IBM gave a tip that, CPU steal is accounted to zLinux when VSWITCH is
processing network requests for this zLinux. If so, how can we solve this ?
- run guests on Direct Attached OSA ?
- split Nodes to different VSWITCHES ? (currently all nodes + DB running on
1 vswitch same VLAN)
- ?
The application we are using for testing, is split into 6 microservices
(processes). PROC1 is used to read file and insert data to DB. I saw that
CPU time (top) of PROC1 accounted when file is processed = 10 sec, but the
real time we have to wait to see this work done is from 60-300 seconds
(depending if ES is running)
So far everyone is just saying "you need more IFLs". But why do I need more
IFLs if I'm using 40% of CEC IFL capacity ?
Thanks you,
Mariusz
----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390