Re: [lustre-discuss] "Not on preferred path" error

Lewis Hyatt Tue, 20 Sep 2016 10:39:42 -0700

Thanks very much for the suggestions. dmesg output is here:
http://pastebin.com/jCafCZiZ

We don't see any disk-related stuff there, and also our GUI shows all the RAIDarrays as being fine.

If anything in there jumps out at you, I'd really appreciate your thoughts! Weare almost certainly going to reboot the affected OSS later today to see howthat goes.

We're a fairly small team (12 people or so) so I have a good feel what everyoneis doing and they should not be abusing it too badly... We did recently askpeople to delete small files they may have, do you think deletion of a lot ofsmall files could trigger such issues? Thanks again!


-lewis


On 9/20/16 12:29 PM, Joe Landman wrote:

On 09/20/2016 12:21 PM, Lewis Hyatt wrote:

We do not know if it's related, but this same OSS is in a very bad
state, with very high load average (200), very high I/O wait time, and
taking many seconds to respond to each read request, making the array
more or less unusable. That's the problem we are trying to fix.


This sounds like a storage system failure.  Queuing up of IOs to drive the load
to 200 usually means something is broken elsewhere in the stack at a lower
level.  Not always ... sometimes you have users who like to write several
million/billion small ( < 100 byte ) files.

What does dmesg report?  Try to do a pastebin/gist of it, and point it to the
list.

Things that come to mind are

a) offlined RAID (most likely):  This would explain the user load, and all
sorts of strange messages about block devices and file systems in the logs

b) A user DoS against the storage: usually someone writing many tiny files.

There are other possibilities, but these seem more likely.

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] "Not on preferred path" error

Reply via email to