Thanks very much for the suggestions. dmesg output is here:
We don't see any disk-related stuff there, and also our GUI shows all the RAID
arrays as being fine.
If anything in there jumps out at you, I'd really appreciate your thoughts! We
are almost certainly going to reboot the affected OSS later today to see how
We're a fairly small team (12 people or so) so I have a good feel what everyone
is doing and they should not be abusing it too badly... We did recently ask
people to delete small files they may have, do you think deletion of a lot of
small files could trigger such issues? Thanks again!
On 9/20/16 12:29 PM, Joe Landman wrote:
On 09/20/2016 12:21 PM, Lewis Hyatt wrote:
We do not know if it's related, but this same OSS is in a very bad
state, with very high load average (200), very high I/O wait time, and
taking many seconds to respond to each read request, making the array
more or less unusable. That's the problem we are trying to fix.
This sounds like a storage system failure. Queuing up of IOs to drive the load
to 200 usually means something is broken elsewhere in the stack at a lower
level. Not always ... sometimes you have users who like to write several
million/billion small ( < 100 byte ) files.
What does dmesg report? Try to do a pastebin/gist of it, and point it to the
Things that come to mind are
a) offlined RAID (most likely): This would explain the user load, and all
sorts of strange messages about block devices and file systems in the logs
b) A user DoS against the storage: usually someone writing many tiny files.
There are other possibilities, but these seem more likely.
lustre-discuss mailing list