Thanks very much for the suggestions. dmesg output is here:
http://pastebin.com/jCafCZiZ
We don't see any disk-related stuff there, and also our GUI shows all the RAID arrays as being fine.

If anything in there jumps out at you, I'd really appreciate your thoughts! We are almost certainly going to reboot the affected OSS later today to see how that goes.

We're a fairly small team (12 people or so) so I have a good feel what everyone is doing and they should not be abusing it too badly... We did recently ask people to delete small files they may have, do you think deletion of a lot of small files could trigger such issues? Thanks again!

-lewis


On 9/20/16 12:29 PM, Joe Landman wrote:
On 09/20/2016 12:21 PM, Lewis Hyatt wrote:

We do not know if it's related, but this same OSS is in a very bad
state, with very high load average (200), very high I/O wait time, and
taking many seconds to respond to each read request, making the array
more or less unusable. That's the problem we are trying to fix.

This sounds like a storage system failure.  Queuing up of IOs to drive the load
to 200 usually means something is broken elsewhere in the stack at a lower
level.  Not always ... sometimes you have users who like to write several
million/billion small ( < 100 byte ) files.

What does dmesg report?  Try to do a pastebin/gist of it, and point it to the
list.

Things that come to mind are

a) offlined RAID (most likely):  This would explain the user load, and all
sorts of strange messages about block devices and file systems in the logs

b) A user DoS against the storage: usually someone writing many tiny files.

There are other possibilities, but these seem more likely.



_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to