Thank you all for your help. The machine had panicked several times
since, but the issue went away when I took the client away from the
server. A round of "I told you so" was in order for my higher ups ;-)
because I recommended against this setup in the first place...but I did
what I was told.
-Aaron
Jean-Marc Saffroy wrote:
On Tue, 21 Nov 2006, Aaron Knister wrote:
I have a machine with 20tb of local storage (internal sata drives).
It has three 6.7tb arrays, each of which is configured as an OST and
part of a larger LOV. I'll call this "machine B". Machine B has 16
gigs of memory. Not only does it act as an OSS but it also has this
single large lov mounted and I/O intensive jobs run on this system
reading/writing heavliy from this single LOV.
If machine B acts both as client and server for the same Lustre
filesystem, you may run into recovery problems in case of a crash. CFS
recommends against using such configurations. I'm not sure how
dangerous this is, however.
The LOV is also mounted on another system that we'll call "Machine
A". Machine A does not act as an OSS and is not serving out any disk
via lustre. Last night I noticed that any of my jobs running on
"Machine B" that were reading/writing from this LOV were running at
about 60% of cpu capacity, while the remaining 40% was being used by
the "system". I got those numbers from iostat. Note that the EXACT
same jobs run on "Machine A" were running at 100% cpu.
Can you describe the load generated by your job on Lustre? Metadata
intensive programs can generate a lot of activity on Lustre servers.
Also check that Lustre debugging is set to zero (I suspect many
innocent users get caught by this setting).
I couldn't figure out what system calls were hogging up 40 percent of
the total cpu. I stumbled across an article describing how in large
memory systems if the page size isn't set right the system can end up
spending more time swapping out pages then actually doing user work,
which would account for this 40% system usage. (I don't understand
this).
I guess this was not really about swapping, rather page fault
handling. :) You could probably have a rough idea of what's going on
with a kernel profiling program such as oprofile.
[snip]
I've attached the rest of it as a text file.
The panic message states it's an MCE (Machine Check Exception),
usually this is a hardware problem (memory, CPU, etc.).
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss