Re: [Lustre-discuss] Kernel Panic

Aaron Knister Fri, 24 Nov 2006 09:52:21 -0800

Thank you all for your help. The machine had panicked several timessince, but the issue went away when I took the client away from theserver. A round of "I told you so" was in order for my higher ups ;-)because I recommended against this setup in the first place...but I didwhat I was told.


-Aaron


Jean-Marc Saffroy wrote:

On Tue, 21 Nov 2006, Aaron Knister wrote:
I have a machine with 20tb of local storage (internal sata drives).It has three 6.7tb arrays, each of which is configured as an OST andpart of a larger LOV. I'll call this "machine B". Machine B has 16gigs of memory. Not only does it act as an OSS but it also has thissingle large lov mounted and I/O intensive jobs run on this systemreading/writing heavliy from this single LOV.
If machine B acts both as client and server for the same Lustrefilesystem, you may run into recovery problems in case of a crash. CFSrecommends against using such configurations. I'm not sure howdangerous this is, however.
The LOV is also mounted on another system that we'll call "MachineA". Machine A does not act as an OSS and is not serving out any diskvia lustre. Last night I noticed that any of my jobs running on"Machine B" that were reading/writing from this LOV were running atabout 60% of cpu capacity, while the remaining 40% was being used bythe "system". I got those numbers from iostat. Note that the EXACTsame jobs run on "Machine A" were running at 100% cpu.
Can you describe the load generated by your job on Lustre? Metadataintensive programs can generate a lot of activity on Lustre servers.
Also check that Lustre debugging is set to zero (I suspect manyinnocent users get caught by this setting).
I couldn't figure out what system calls were hogging up 40 percent ofthe total cpu. I stumbled across an article describing how in largememory systems if the page size isn't set right the system can end upspending more time swapping out pages then actually doing user work,which would account for this 40% system usage. (I don't understandthis).
I guess this was not really about swapping, rather page faulthandling. :) You could probably have a rough idea of what's going onwith a kernel profiling program such as oprofile.
[snip]
I've attached the rest of it as a text file.
The panic message states it's an MCE (Machine Check Exception),usually this is a hardware problem (memory, CPU, etc.).


_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Kernel Panic

Reply via email to