Thank you all for your help. The machine had panicked several times since, but the issue went away when I took the client away from the server. A round of "I told you so" was in order for my higher ups ;-) because I recommended against this setup in the first place...but I did what I was told.

-Aaron

Jean-Marc Saffroy wrote:
On Tue, 21 Nov 2006, Aaron Knister wrote:

I have a machine with 20tb of local storage (internal sata drives). It has three 6.7tb arrays, each of which is configured as an OST and part of a larger LOV. I'll call this "machine B". Machine B has 16 gigs of memory. Not only does it act as an OSS but it also has this single large lov mounted and I/O intensive jobs run on this system reading/writing heavliy from this single LOV.

If machine B acts both as client and server for the same Lustre filesystem, you may run into recovery problems in case of a crash. CFS recommends against using such configurations. I'm not sure how dangerous this is, however.

The LOV is also mounted on another system that we'll call "Machine A". Machine A does not act as an OSS and is not serving out any disk via lustre. Last night I noticed that any of my jobs running on "Machine B" that were reading/writing from this LOV were running at about 60% of cpu capacity, while the remaining 40% was being used by the "system". I got those numbers from iostat. Note that the EXACT same jobs run on "Machine A" were running at 100% cpu.

Can you describe the load generated by your job on Lustre? Metadata intensive programs can generate a lot of activity on Lustre servers.

Also check that Lustre debugging is set to zero (I suspect many innocent users get caught by this setting).

I couldn't figure out what system calls were hogging up 40 percent of the total cpu. I stumbled across an article describing how in large memory systems if the page size isn't set right the system can end up spending more time swapping out pages then actually doing user work, which would account for this 40% system usage. (I don't understand this).

I guess this was not really about swapping, rather page fault handling. :) You could probably have a rough idea of what's going on with a kernel profiling program such as oprofile.

[snip]
I've attached the rest of it as a text file.

The panic message states it's an MCE (Machine Check Exception), usually this is a hardware problem (memory, CPU, etc.).



_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to