Re: Elasticsearch and Openshift on zVM - Suffering from CPU steal ?

barton Sat, 06 Feb 2021 05:22:42 -0800

If you would collect one minute CP MONITOR data, I would love to analyzeit for you. I've the best CP performance analysis tools at my disposal....


On 2/6/2021 3:44 AM, Mariusz Walczak wrote:

Hello,




I hope someone can share the experience or put some light on the problem. I
will refer to elasticsearch as "ES" in this email.

We are running 2 Openshift clusters on 1 zVM LPAR (16 logical CPU , SMT2).



Cluster 1 (development workload):

3x Master node (each zLinux 8 vCPU) (VSWITCH-1 VLAN 1)

3x Worker node (each 10 vCPU) (VSWITCH-1 VLAN 1)

1x Infra node (6 vCPU) (VSWITCH-1 VLAN 1) ("ES" ON) (high CPU use)



Cluser 2 (no workload, just "ES" ON):

3x Master node (each zLinux 4 vCPU) (VSWITCH-2 VLAN 2)

4x Worker node (each 4 vCPU) (VSWITCH-2 VLAN 2)

2x Infra node (6 vCPU) (VSWITCH-2 VLAN 2) ("ES" on on each) (high CPU use)



Problem:

With "ES" OFF on both clusters, the batch time of APP1 is ~600 seconds.

With "ES" ON on both clusters, batch time is ~1200 seconds.



Sympthoms:

- high cpu steal on zLinux nodes (TOP) with elasticsearch active

- bad network response (git clone, downloading images)

- CPU steal drops if we shutdown elasticsearch


With "ES" ON: zVM perfkit LPAR CPU at ~60% . CEC IFL usage 40%.

Where do you expect the bottleneck and what is causing high CPU steal on
zLinux nodes ?

Some more info - there is Fluentd pod running on every cluster node and is
sending log data constantly (quite big amounts) to Infra node
(elasticsearch)



IBM gave a tip that, CPU steal is accounted to zLinux when VSWITCH is
processing network requests for this zLinux. If so, how can we solve this ?

- run guests on Direct Attached OSA ?

- split Nodes to different VSWITCHES ? (currently all nodes + DB running on
1 vswitch same VLAN)

- ?


The application we are using for testing, is split into 6 microservices
(processes). PROC1 is used to read file and insert data to DB. I saw that
CPU time (top) of PROC1 accounted when file is processed = 10 sec, but the
real time we have to wait to see this work done is from 60-300 seconds
(depending if ES is running)



So far everyone is just saying "you need more IFLs". But why do I need more
IFLs if I'm using 40% of CEC IFL capacity ?





Thanks you,

Mariusz

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390


----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390

Re: Elasticsearch and Openshift on zVM - Suffering from CPU steal ?

Reply via email to