Hi all, Just keep you informed :) After 7 days of normal operations, we again had server failure because of OCFS2/DLM drop reference bug. I have added log is on end of message.
We're running Centos5 with latest available RedHat kernel 2.6.18-238.5.1.el5 and OCFS2 1.4.7 installed from packages provided by Oracle for this kernel. Cluster has 3 nodes. Two nodes are providing shared storage using DRBD and OCFS2. All 3 nodes are accessing shared storage over iSCSI (server 1 is iSCSI target with DRBD as backing device). Cluster is used to host single web site. All 3 nodes are running Apache web servers and accessing web application files on shared storage. Bug happens when rsync is doing daily backup. It's interesting that I noticed similar error logged on other server, but without it hanging because of kernel panic. I plan to install latest kernel provided by Oracle for RHEL5 (2.6.32-100.0.19.el5.x86_64) on public yum (http://public-yum.oracle.com/) and OCFS2 1.6. I hope that this kernel includes bug fix. However, answer I got in original BUG post http://oss.oracle.com/bugzilla/show_bug.cgi?id=912 is not definite :(. I assume compiling kernel or OCFS2 from latest source code is possible, but sounds like too much work. Is anyone using OCFS2 1.4 on Centos5 in similar setup, and doesn't have issues with this bug? It's strange to me that bug which is marked as RESOLVED 1 year ago is not included in OCFS2 packages created few weeks ago :( If bug is kernel related and RHEL kernel 2.6.18-238.5.1.el5 is not updated enough, then my only hope for now is installing Oracle's kernels. Logged errors: Mar 26 04:07:42 server3 kernel: (dlm_thread,5142,2):dlm_drop_lockres_ref:2216 ERROR: while dropping ref on BDB600C633D74D6B85C496D78F566879:N0000000000fc82d0 (master=1) got -22. Mar 26 04:07:42 server3 kernel: lockres: N0000000000fc82d000fce096, owner=1, state=64 Mar 26 04:07:42 server3 kernel: last used: 4899685258, refcnt: 3, on purge list: yes Mar 26 04:07:42 server3 kernel: on dirty list: no, on reco list: no, migrating pending: no Mar 26 04:07:42 server3 kernel: inflight locks: 0, asts reserved: 0 Mar 26 04:07:42 server3 kernel: refmap nodes: [ ], inflight=0 Mar 26 04:07:42 server3 kernel: granted queue: Mar 26 04:07:42 server3 kernel: converting queue: Mar 26 04:07:42 server3 kernel: blocked queue: Mar 26 04:07:44 server3 kernel: ----------- [cut here ] --------- [please bite here ] --------- Mar 26 04:07:44 server3 kernel: Kernel BUG at ...xiaowei/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmmaster.c:2218 Mar 26 04:07:44 server3 kernel: invalid opcode: 0000 [1] SMP Mar 26 04:07:44 server3 kernel: last sysfs file: /devices/system/cpu/cpu7/cpufreq/scaling_cur_freq Mar 26 04:07:44 server3 kernel: CPU 2 Mar 26 04:07:44 server3 kernel: Modules linked in: ocfs2(U) be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic cxgb3i libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ipt_recent acpi_cpufreq freq_table mperf ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs uio cxgb3 8021q iptable_nat ip_nat iptable_mangle ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg tpm_tis tpm i2c_i801 tpm_bios r8169 i2c_core shpchp mii serio_raw pcspkr i7core_edac edac_mc dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod raid10 raid456 xor raid0 sata_nv aacraid 3w_9xxx 3w_xxxx sata_sil sata_via ahci libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Mar 26 04:07:44 server3 kernel: Pid: 5142, comm: dlm_thread Tainted: G 2.6.18-238.5.1.el5 #1 _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users