Re: [Lustre-discuss] Node randomly panic

Wojciech Turek Mon, 26 Nov 2007 05:35:15 -0800

Hi,

how many clients (compute nodes) you have in your cluster? What iscrashing randomly: clients or OSS or MDS or maybe all of them?

Do you have screenshot of the kernel panic or crashdump log?


cheers,

Wojciech Turek
On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul wrote:

No. I use stock bnx2 driver from pre-built latest kernel-lustre-1.6.3.

I forgot to mention about oops. It's something about lustre
(lustre_blah_blah_blah something).

All other nodes also use bnx2. There's no problem at all.

Matt wrote:
Somsak,

Did you build your own bnx2 driver? I was getting kernel panics when
hitting a certain load with Dell 1950s that also use the bnx2 driver.
My solution was to grab the bnx2 source code and build it under the
Lustre kernel.  If you search the mailing list you'll find the mails
dealing with this.

If you see bnx2 mentioned in your kernel panic output, then it's
probably the cause.

Thanks,

Matt

On 26/11/2007, *Somsak Sriprayoonsakul * <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:

    Hello,

        We have a 4 nodes Lustre Cluster that provides parallel file
    system
for our 192 nodes cluster. The Lustre Cluster are CentOS 4.5,x86_64(Intel series 4000), on HP DL360-G5. The cluster that use itis ROCKS4.2.1, on the same set of hardware. Our network is GigabitEthernet,
    using bnx2 driver. Lustre setup is

    storage-0-0: mgs+mdt, ost0, ost1 (backup)
    storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
    storage-0-2: ost2, ost3 (backup)
    storage-0-3: ost2 (backup), ost3
We're using heartbeat 2.0.8 base on pre-built RPM fromCentOS. Allbackup is configure in the way that it'll not runsimultaneously with
    primary. Note that, we enable flock and quota on Lustre.
The problem we have right now is, some of the nodes arerandomlypanic. This happened about once a week or two week. Wetolerate this
    stupidly by setting kernel.panic=60 and hope that the backup node
    will
not failed within the time frame, though this is working quitewell
    (base on user feedback, they do not know that the file system is
failed). The backup node take-over OST and do recovery forabout 250
    secs then everything back to normal.

        Anyways we're trying to nail down the reason why the file
    system is
panic. I believe that information above will not suffice totrack downthe reason. Could someone give me a way to debug or dump someusefulinformation that I can send to the list for later analysis?Also, doesthe "RECOVERING" suffice to make the file system stable? Do weneed to
    shutdown the whole system and do e2fsck+lfsck?

        Also, every panic time, quota that was enabled will be
    disabled (lfs
quota <user> /fs yield "No such process). I have to doquotaoff and
    quotaon again. It seems that the quota is not being turn on when
    OST is
    boot up. Is there a way to always turn this on?


        Thank you very much in advance


    --
-----------------------------------------------------------------------------------
    Somsak Sriprayoonsakul

    Thai National Grid Center
    Software Industry Promotion Agency
    Ministry of ICT, Thailand
    [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
-----------------------------------------------------------------------------------
    _______________________________________________
    Lustre-discuss mailing list
[email protected] <mailto:Lustre-[EMAIL PROTECTED]>
    https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
    <https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
--
-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
[EMAIL PROTECTED]
-----------------------------------------------------------------------------------
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss


Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: [EMAIL PROTECTED]
tel. +441223763517

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Node randomly panic

Reply via email to