[Ocfs2-users] Kernel Panic / Fencing

2011-10-06 Thread Tony Rios
Hey all,

I'm running a current version of Ubuntu and we are using OCFS2 across
a cluster of 9 web servers.
Everything works perfectly, so long as none of the servers need to be
rebooted (or crash).

I've done several web searches and one of the items that I've found to
be suggested was to double the Heartbeat threshold.
I increased ours from 31 to 61 and it doesn't appear to have helped at all.

I can't imagine that if a server becomes unreachable that by design it
is intended to crash the entire network.

I'm hoping that someone will have some feedback here because I'm at a loss.

Thanks so much,
Tony

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Kernel Panic / Fencing

2011-10-06 Thread Sunil Mushran
I am unclear. What happens when a server is rebooted (or crashes).
Crash the network? Can you expand on this?

On 10/06/2011 05:52 PM, Tony Rios wrote:
 Hey all,

 I'm running a current version of Ubuntu and we are using OCFS2 across
 a cluster of 9 web servers.
 Everything works perfectly, so long as none of the servers need to be
 rebooted (or crash).

 I've done several web searches and one of the items that I've found to
 be suggested was to double the Heartbeat threshold.
 I increased ours from 31 to 61 and it doesn't appear to have helped at all.

 I can't imagine that if a server becomes unreachable that by design it
 is intended to crash the entire network.

 I'm hoping that someone will have some feedback here because I'm at a loss.

 Thanks so much,
 Tony

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] one node kernel panic

2011-10-06 Thread Hideyasu Kojima
Thank you for responding.

I think UEK5 is based on RHEL5 kernel.
Does the problem same as UEK5 arise?

(2011/10/05 1:45), Sunil Mushran wrote:
 int sigprocmask(int how, sigset_t *set, sigset_t *oldset)
 {
 int error;

 spin_lock_irq(current-sighand-siglock);  CRASH
 if (oldset)
 *oldset = current-blocked;
 ...
 }

 current-sighand is NULL. So definitely a race. Generic kernel issue.
 Ping your kernel vendor.

 On 10/03/2011 07:49 PM, Hideyasu Kojima wrote:
 Hi,

 I run ocfs2/drbd active-active 2node cluster.

 ocfs2 version is 1.4.7-1
 ocfs2-tool version is 1.4.4
 Linux version is RHEL 5.4 (2.6.18-164.el5 x86_64)

 1 node crash with kernel panic once.

 What is the cause?

 The bottom is the analysis of vmcore.

 

 Unable to handle kernel NULL pointer dereference at 0808 RIP:
 [80064ae6] _spin_lock_irq+0x1/0xb
 PGD 187e15067 PUD 187e16067 PMD 0
 Oops: 0002 [1] SMP
 last sysfs file:
 /devices/pci:00/:00:09.0/:06:00.0/:07:00.0/irq
 CPU 1
 Modules linked in: mptctl mptbase softdog autofs4 ipmi_devintf ipmi_si
 ipmi_msghandler ocfs2(U) ocfs2_dlmfs(U) ocfs2_dlm(U)
 ocfs2_nodemanager(U) configfs drbd(U) bonding ipv6 xfrm_nalgo crypto_api
 bnx2i(U) libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic(U)
 dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core
 button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev
 sr_mod cdrom sg pcspkr serio_raw hpilo bnx2(U) dm_raid45 dm_message
 dm_region_hash dm_log dm_mod dm_mem_cache hpahcisr(PU) ata_piix libata
 shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
 Pid: 21924, comm: res Tainted: P 2.6.18-164.el5 #1
 RIP: 0010:[80064ae6] [80064ae6]
 _spin_lock_irq+0x1/0xb
 RSP: 0018:81008b1cfae0 EFLAGS: 00010002
 RAX: 810187af4040 RBX:  RCX: 8101342b7b80
 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808
 RBP: 81008b1cfb98 R08:  R09: 
 R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8
 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8
 FS: () GS:810105d51840()
 knlGS:
 CS: 0010 DS:  ES:  CR0: 8005003b
 CR2: 0808 CR3: 000187e14000 CR4: 06e0
 Process res (pid: 21924, threadinfo 81008b1ce000, task
 810187af4040)
 Stack: 8001db30 81007f070520 885961f3
 810105d39400
 88596323 06ff813231393234 810075463018 810075463018
 0297 81007f070520 810075463028 0246
 Call Trace:
 [8001db30] sigprocmask+0x28/0xdb
 [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691
 [88596323] :ocfs2:ocfs2_delete_inode+0x130/0x1691
 [88581f16] :ocfs2:ocfs2_drop_lock+0x67a/0x77b
 [8858026a] :ocfs2:ocfs2_remove_lockres_tracking+0x10/0x45
 [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691
 [8002f49e] generic_delete_inode+0xc6/0x143
 [88595c85] :ocfs2:ocfs2_drop_inode+0xf0/0x161
 [8000d46e] dput+0xf6/0x114
 [800e9c44] prune_one_dentry+0x66/0x76
 [8002e958] prune_dcache+0x10f/0x149
 [8004d66e] shrink_dcache_parent+0x1c/0xe1
 [80104f8b] proc_flush_task+0x17c/0x1f6
 [8008fa2c] sched_exit+0x27/0xb5
 [80018024] release_task+0x387/0x3cb
 [80015c50] do_exit+0x865/0x911
 [80049281] cpuset_exit+0x0/0x88
 [8002b080] get_signal_to_deliver+0x42c/0x45a
 [8005ae7b] do_notify_resume+0x9c/0x7af
 [8008b6a2] deactivate_task+0x28/0x5f
 [80021f3f] __up_read+0x19/0x7f
 [80066b58] do_page_fault+0x4fe/0x830
 [800b65b2] audit_syscall_exit+0x336/0x362
 [8005d32e] int_signal+0x12/0x17


 Code: f0 ff 0f 0f 88 f3 00 00 00 c3 53 48 89 fb e8 33 f5 02 00 f0
 RIP [80064ae6] _spin_lock_irq+0x1/0xb
 RSP81008b1cfae0
 crash bt
 PID: 21924 TASK: 810187af4040 CPU: 1 COMMAND: res
 #0 [81008b1cf840] crash_kexec at 800ac5b9
 #1 [81008b1cf900] __die at 80065127
 #2 [81008b1cf940] do_page_fault at 80066da7
 #3 [81008b1cfa30] error_exit at 8005dde9
 [exception RIP: _spin_lock_irq+1]
 RIP: 80064ae6 RSP: 81008b1cfae0 RFLAGS: 00010002
 RAX: 810187af4040 RBX:  RCX: 8101342b7b80
 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808
 RBP: 81008b1cfb98 R8:  R9: 
 R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8
 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8
 ORIG_RAX:  CS: 0010 SS: 0018
 #4 [81008b1cfae0] sigprocmask at 8001db30
 #5 [81008b1cfb00] ocfs2_delete_inode at 88596323
 #6 [81008b1cfbf0] generic_delete_inode at 8002f49e
 #7 [81008b1cfc10] ocfs2_drop_inode at 88595c85
 #8