We have about 177 client nodes. I think the crashed happened only with OSS.
I do not have screenshot yet. How can I get the crashdump log? Wojciech Turek wrote: > Hi, > > how many clients (compute nodes) you have in your cluster? What is > crashing randomly: clients or OSS or MDS or maybe all of them? > Do you have screenshot of the kernel panic or crashdump log? > > cheers, > > Wojciech Turek > On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote: > >> No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3. >> >> I forgot to mention about oops. It's something about lustre >> (lustre_blah_blah_blah something). >> >> All other nodes also use bnx2. There's no problem at all. >> >> Matt wrote: >>> Somsak, >>> >>> Did you build your own bnx2 driver? I was getting kernel panics when >>> hitting a certain load with Dell 1950s that also use the bnx2 driver. >>> My solution was to grab the bnx2 source code and build it under the >>> Lustre kernel. If you search the mailing list you'll find the mails >>> dealing with this. >>> >>> If you see bnx2 mentioned in your kernel panic output, then it's >>> probably the cause. >>> >>> Thanks, >>> >>> Matt >>> >>> On 26/11/2007, *Somsak Sriprayoonsakul * <[EMAIL PROTECTED] >>> <mailto:[EMAIL PROTECTED]> >>> <mailto:[EMAIL PROTECTED]>> wrote: >>> >>> Hello, >>> >>> We have a 4 nodes Lustre Cluster that provides parallel file >>> system >>> for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64 >>> (Intel series 4000), on HP DL360-G5. The cluster that use it is >>> ROCKS >>> 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet, >>> using bnx2 driver. Lustre setup is >>> >>> storage-0-0: mgs+mdt, ost0, ost1 (backup) >>> storage-0-1: mgs+mdt (backup), ost0 (backup), ost1 >>> storage-0-2: ost2, ost3 (backup) >>> storage-0-3: ost2 (backup), ost3 >>> >>> We're using heartbeat 2.0.8 base on pre-built RPM from >>> CentOS. All >>> backup is configure in the way that it'll not run simultaneously >>> with >>> primary. Note that, we enable flock and quota on Lustre. >>> >>> The problem we have right now is, some of the nodes are randomly >>> panic. This happened about once a week or two week. We tolerate this >>> stupidly by setting kernel.panic=60 and hope that the backup node >>> will >>> not failed within the time frame, though this is working quite well >>> (base on user feedback, they do not know that the file system is >>> failed). The backup node take-over OST and do recovery for about 250 >>> secs then everything back to normal. >>> >>> Anyways we're trying to nail down the reason why the file >>> system is >>> panic. I believe that information above will not suffice to >>> track down >>> the reason. Could someone give me a way to debug or dump some useful >>> information that I can send to the list for later analysis? >>> Also, does >>> the "RECOVERING" suffice to make the file system stable? Do we >>> need to >>> shutdown the whole system and do e2fsck+lfsck? >>> >>> Also, every panic time, quota that was enabled will be >>> disabled (lfs >>> quota <user> /fs yield "No such process). I have to do quotaoff and >>> quotaon again. It seems that the quota is not being turn on when >>> OST is >>> boot up. Is there a way to always turn this on? >>> >>> >>> Thank you very much in advance >>> >>> >>> -- >>> >>> >>> ----------------------------------------------------------------------------------- >>> Somsak Sriprayoonsakul >>> >>> Thai National Grid Center >>> Software Industry Promotion Agency >>> Ministry of ICT, Thailand >>> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> >>> >>> ----------------------------------------------------------------------------------- >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> [email protected] <mailto:[email protected]> >>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>> <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss> >>> >>> >> >> -- >> >> ----------------------------------------------------------------------------------- >> Somsak Sriprayoonsakul >> >> Thai National Grid Center >> Software Industry Promotion Agency >> Ministry of ICT, Thailand >> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> >> ----------------------------------------------------------------------------------- >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] <mailto:[email protected]> >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > Mr Wojciech Turek > Assistant System Manager > University of Cambridge > High Performance Computing service > email: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > tel. +441223763517 > > > -- ----------------------------------------------------------------------------------- Somsak Sriprayoonsakul Thai National Grid Center Software Industry Promotion Agency Ministry of ICT, Thailand [EMAIL PROTECTED] ----------------------------------------------------------------------------------- _______________________________________________ Lustre-discuss mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
