Re: [gpfsug-discuss] sequential I/O write - performance

Indulis Bernsteins1 Mon, 26 Feb 2024 09:03:26 -0800

> Using iohist we found out that gpfs is overloading one dm-device (it

> took about 500ms to finish IOs). We replaced the "problematic"


> dm-device (as we have enough drives to play with) for new one but the

> overloading issue just jumped to another dm-device.

> > We believe that this behaviour is caused by the gpfs but we are

> > unable

> to locate the root cause of it.

>

> Hello,

> this behaviour could be caused by an assymmetry in data paths of your

> storage, relatively small imbalance can make request queue of a

> slightly slower disk grow seemingly unproportionally.

This problem is a real "blast from the past" for me. I saw similar behaviour a 
LONG time ago, and I think it is very possible that you might have a "preferred 
paths" issue with your NSD servers to your target drives. If you are using 
Scale talking to a Storage System which has multiple paths to the device, and 
multiple Scale NSD servers can see the same LUN (which is correct from 
availability) then you can in some cases get exactly this sort of behaviour.

I am guessing you are running "LAN free" architecture, with many servers doing 
direct I/O to the NSDs/LUNs. Not doing Scale Client -> Scale NSD server -> 
NSD/LUN

I'll bet you see low I/O rates and long latencies to the "problem" 
NSD/LUN/drive.

The 500ms I/O delay can be the target NSD/LUN being switched from being "owned" 
by one of the controllers in the storage system to the other.

I can't see how Scale can do anything to make a device take 500 ms to complete 
an I/O when tracked by IOHIST at the OS level - because you are clearly not 
able to drive a lot of throughput to the devices, so it can't bethat  device 
overloading is causing a long queue on the device. There is something else 
happening. Not at Scale, not at the device, it is somewhere in whatever network 
or SAN is between the Scale NSD server and the NSD device. Something is trying 
to do recovery.

Say your Scale NSD servers sends an I/O to a target NSD and it goes to Storage 
controller A. Then another Scale NSD server sends an I/O to the same target NSD 
and instead of it going via a path that leads it to Storage system  controller 
A it goes to Storage controller B. At that point the storage system says "Oh it 
looks like future I/O will be coming into Storage controller B, let's switch 
the internal ownership to B. OK, we need to flush write caches and do some 
other things. That will take about 500 ms."

Then an I/O goes to Storage System controller A, and you get another switch 
back of the LUN from B to A. Another 500 ms.

The drive is being "ping pong-ed" from one Storage System controller to 
another. Because there are I/Os randomly coming on to the drive to one or other 
Storage Controller.

You need to make sure that all NSD servers access each LUN using the same 
default path to the same Storage System controller. There is a way to do this, 
to choose the "preferred path" unless that path is down. Could be that some 
servers can't use the "preferred path"?

This will probably only happen if you have something running on the actual 
Scale NSD servers that is accessing the filesystem, otherwise Scale Clients 
will always go across the Scale network to the current Primary NSD server to 
get to an NSD.

Or there is some other problem causing a "ping pong" effect. But it sounds like 
a "ping pong" to me, especially because when you replaced the dm device the 
problem moved elsewhere.

Regards,

Indulis Bernsteins
Storage Architect, IBM Worldwide Technical Sales








Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] sequential I/O write - performance

Reply via email to