On 09/20/2016 01:39 PM, Lewis Hyatt wrote:
Thanks very much for the suggestions. dmesg output is here:
http://pastebin.com/jCafCZiZ
We don't see any disk-related stuff there, and also our GUI shows all
the RAID arrays as being fine.


Hmmm .... I rarely trust GUIs for RAID. Do you have underlying CLI tools you can do a sanity check with?

If anything in there jumps out at you, I'd really appreciate your
thoughts! We are almost certainly going to reboot the affected OSS later
today to see how that goes.

Not seeing anything leap out other than two particular targets, twlstr-OST000b and twlstr-OST0006, appear to be "slow". This appears to be what is causing client evictions, lock bits, etc.

The question is, why are these two OSTs slow. What is the underlying RAID, how many operations are queued up, etc.?

A tool we recommend for (nearly instantaneous) holistic level views on a system is glances, which you can install via pip

        pip install glances

then run it as

        glances -t 1

to get a second by second view of your system.  Dstat is also good.

Dumb question ... what does

        swapon -s

report? I am assuming you aren't swapping (and don't have swap enabled on the system, but it never hurts to ask).

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
e: land...@scalableinformatics.com
w: http://scalableinformatics.com
t: @scalableinfo
p: +1 734 786 8423 x121
c: +1 734 612 4615
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to