No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3. I forgot to mention about oops. It's something about lustre (lustre_blah_blah_blah something).
All other nodes also use bnx2. There's no problem at all. Matt wrote: > Somsak, > > Did you build your own bnx2 driver? I was getting kernel panics when > hitting a certain load with Dell 1950s that also use the bnx2 driver. > My solution was to grab the bnx2 source code and build it under the > Lustre kernel. If you search the mailing list you'll find the mails > dealing with this. > > If you see bnx2 mentioned in your kernel panic output, then it's > probably the cause. > > Thanks, > > Matt > > On 26/11/2007, *Somsak Sriprayoonsakul * <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Hello, > > We have a 4 nodes Lustre Cluster that provides parallel file > system > for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5, x86_64 > (Intel series 4000), on HP DL360-G5. The cluster that use it is ROCKS > 4.2.1, on the same set of hardware. Our network is Gigabit Ethernet, > using bnx2 driver. Lustre setup is > > storage-0-0: mgs+mdt, ost0, ost1 (backup) > storage-0-1: mgs+mdt (backup), ost0 (backup), ost1 > storage-0-2: ost2, ost3 (backup) > storage-0-3: ost2 (backup), ost3 > > We're using heartbeat 2.0.8 base on pre-built RPM from CentOS. All > backup is configure in the way that it'll not run simultaneously with > primary. Note that, we enable flock and quota on Lustre. > > The problem we have right now is, some of the nodes are randomly > panic. This happened about once a week or two week. We tolerate this > stupidly by setting kernel.panic=60 and hope that the backup node > will > not failed within the time frame, though this is working quite well > (base on user feedback, they do not know that the file system is > failed). The backup node take-over OST and do recovery for about 250 > secs then everything back to normal. > > Anyways we're trying to nail down the reason why the file > system is > panic. I believe that information above will not suffice to track down > the reason. Could someone give me a way to debug or dump some useful > information that I can send to the list for later analysis? Also, does > the "RECOVERING" suffice to make the file system stable? Do we need to > shutdown the whole system and do e2fsck+lfsck? > > Also, every panic time, quota that was enabled will be > disabled (lfs > quota <user> /fs yield "No such process). I have to do quotaoff and > quotaon again. It seems that the quota is not being turn on when > OST is > boot up. Is there a way to always turn this on? > > > Thank you very much in advance > > > -- > > > ----------------------------------------------------------------------------------- > Somsak Sriprayoonsakul > > Thai National Grid Center > Software Industry Promotion Agency > Ministry of ICT, Thailand > [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > > ----------------------------------------------------------------------------------- > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] <mailto:[email protected]> > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss> > > -- ----------------------------------------------------------------------------------- Somsak Sriprayoonsakul Thai National Grid Center Software Industry Promotion Agency Ministry of ICT, Thailand [EMAIL PROTECTED] ----------------------------------------------------------------------------------- _______________________________________________ Lustre-discuss mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
