Hi,
On 26 Nov 2007, at 13:36, Somsak Sriprayoonsakul wrote:
We have about 177 client nodes.
I think the crashed happened only with OSS.
We have had similar problem. We have 600 clients and crashes happened
avery 2 days. There is bug https://bugzilla.lustre.org/show_bug.cgi?
id=14293
If your kernel panic looks similar you might be almost certain that
it is the same issue.
I do not have screenshot yet. How can I get the crashdump log?
You can try netdump
http://www.redhat.com/support/wpapers/redhat/netdump/
Wojciech Turek wrote:
Hi,
how many clients (compute nodes) you have in your cluster? What is
crashing randomly: clients or OSS or MDS or maybe all of them?
Do you have screenshot of the kernel panic or crashdump log?
cheers,
Wojciech Turek On 26 Nov 2007, at 13:25, Somsak Sriprayoonsakul
wrote:
No. I use stock bnx2 driver from pre-built latest kernel-
lustre-1.6.3.
I forgot to mention about oops. It's something about lustre
(lustre_blah_blah_blah something).
All other nodes also use bnx2. There's no problem at all.
Matt wrote:
Somsak,
Did you build your own bnx2 driver? I was getting kernel panics
when hitting a certain load with Dell 1950s that also use the
bnx2 driver. My solution was to grab the bnx2 source code and
build it under the Lustre kernel. If you search the mailing
list you'll find the mails dealing with this.
If you see bnx2 mentioned in your kernel panic output, then it's
probably the cause.
Thanks,
Matt
On 26/11/2007, *Somsak Sriprayoonsakul *
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>> wrote:
Hello,
We have a 4 nodes Lustre Cluster that provides parallel
file
system
for our 192 nodes cluster. The Lustre Cluster are CentOS
4.5, x86_64
(Intel series 4000), on HP DL360-G5. The cluster that use it
is ROCKS
4.2.1, on the same set of hardware. Our network is Gigabit
Ethernet,
using bnx2 driver. Lustre setup is
storage-0-0: mgs+mdt, ost0, ost1 (backup)
storage-0-1: mgs+mdt (backup), ost0 (backup), ost1
storage-0-2: ost2, ost3 (backup)
storage-0-3: ost2 (backup), ost3
We're using heartbeat 2.0.8 base on pre-built RPM from
CentOS. All
backup is configure in the way that it'll not run
simultaneously with
primary. Note that, we enable flock and quota on Lustre.
The problem we have right now is, some of the nodes are
randomly
panic. This happened about once a week or two week. We
tolerate this
stupidly by setting kernel.panic=60 and hope that the backup
node
will
not failed within the time frame, though this is working
quite well
(base on user feedback, they do not know that the file
system is
failed). The backup node take-over OST and do recovery for
about 250
secs then everything back to normal.
Anyways we're trying to nail down the reason why the file
system is
panic. I believe that information above will not suffice to
track down
the reason. Could someone give me a way to debug or dump
some useful
information that I can send to the list for later analysis?
Also, does
the "RECOVERING" suffice to make the file system stable? Do
we need to
shutdown the whole system and do e2fsck+lfsck?
Also, every panic time, quota that was enabled will be
disabled (lfs
quota <user> /fs yield "No such process). I have to do
quotaoff and
quotaon again. It seems that the quota is not being turn on
when
OST is
boot up. Is there a way to always turn this on?
Thank you very much in advance
--
-------------------------------------------------------------------
----------------
Somsak Sriprayoonsakul
Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
-------------------------------------------------------------------
----------------
_______________________________________________
Lustre-discuss mailing list
[email protected] <mailto:Lustre-
[EMAIL PROTECTED]>
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
<https://mail.clusterfs.com/mailman/listinfo/lustre-discuss>
--
--------------------------------------------------------------------
---------------
Somsak Sriprayoonsakul
Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
--------------------------------------------------------------------
---------------
_______________________________________________
Lustre-discuss mailing list
[email protected] <mailto:[email protected]>
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service email: [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>
tel. +441223763517
--
----------------------------------------------------------------------
-------------
Somsak Sriprayoonsakul
Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
[EMAIL PROTECTED]
----------------------------------------------------------------------
-------------
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: [EMAIL PROTECTED]
tel. +441223763517
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss