[Ocfs2-users] A bug in OCFS2 umount
Looks like I found a bug in regards to unmounts. Umount allows you to use directory or device name to unmounts a file system. On at least the OEL6.3 server I am working on, umounting based on device name will lead to a message says that the device is not mounted. This only happens with OCFS2 file systems, it works correctly with ext3/4 ones. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 (slighly OT)
So far we only have OEL Network, but no support as we are still in the investigation phase of switching to OEL from RHEL. So no support yet that route. From: Mihail Daskalov [mailto:mdaska...@technologica.com] Sent: Wednesday, July 10, 2013 07:55 To: Sunil Mushran; Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: RE: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 (slighly OT) Hi Sunil, Regarding the ocfs tools version 1.8.0 you should know best what it was meant to be (maybe not true for 1.8.0-10 in OEL6U3). Is it possible that the tag for 1.8.0 disappeared from the git repository? Or there was never a tag for 1.8.0 ? Bellow is the link to commit in 1.8.2 tag, that brings the version to 1.8.0 https://oss.oracle.com/git/?p=ocfs2-tools.git;a=commitdiff;h=2480a215a600050d2bf923044dffac91439d982a;hp=8b5f4ad727e019cb557c4b516ab401c15c5c317e and later on another commit that bring the version to 1.8.2 https://oss.oracle.com/git/?p=ocfs2-tools.git;a=commitdiff;h=560a1e60936fe868b00cfc9cad5def726e10828e I am sorry I am not actually helping to Ulf's problem. Ulf, maybe you can really follow the head version and try to see an explanation of the error message. Anyway I think it would be best to open a SR with Oracle if you have Linux support contract. Does anyone know how to find you the git repository at least for some packages in Oracle Linux. I know the source for each package is available as .src.rpm but how could I see the changes, or the tag from which every version was build? I remember Wim talking on something like that a while ago (saying oracle is not like redhat mangling changelogs), but I can't find the article right now. If you find out what is behind ocfs2-tools 1.8.0-10 it would be easier to track the problem. Regards, Mihail Daskalov From: ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: Wednesday, July 10, 2013 2:11 AM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 The error does not make sense. Also I don't know what 1.8.0 tools means. I cannot see that label in the src tree. https://oss.oracle.com/git/?p=ocfs2-tools.git;a=summary One option is to build the tools from the head. On Tue, Jul 9, 2013 at 2:25 PM, Ulf Zimmermann u...@openlane.commailto:u...@openlane.com wrote: Sunil, any suggestions on this? From: ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Saturday, June 22, 2013 15:20 To: Sunil Mushran Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 [root@co-db03 ulf]# debugfs.ocfs2 -R stats /dev/mapper/aucp_data_bk_2_x Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Sun Sep 25 05:32:29 2011 Creator OS: 0 Feature Compat: 0 Feature Incompat: 0 Tunefs Incomplete: 0 Feature RO compat: 0 Root Blknum: 513 System Dir Blknum: 514 First Cluster Group Blknum: 256 Block Size Bits: 12 Cluster Size Bits: 20 Max Node Slots: 10 Extended Attributes Inline Size: 0 Label: /export/backuprecovery.AUCP UUID: 5F9C2727159743529200CE9C5E155562 Hash: 0 (0x0) DX Seeds: 0 0 0 (0x 0x 0x) Cluster stack: classic o2cb Cluster flags: 0 Inode: 2 Mode: 00 Generation: 3147295185tel:3147295185 (0xbb97e9d1) FS Generation: 3147295185tel:3147295185 (0xbb97e9d1) CRC32: ECC: Type: Unknown Attr: 0x0 Flags: Valid System Superblock Dynamic Features: (0x0) User: 0 (root) Group: 0 (root) Size: 0 Links: 0 Clusters: 1572864 ctime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011 atime: 0x0 0x0 -- Wed Dec 31 16:00:00.0 1969 mtime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011 dtime: 0x0 -- Wed Dec 31 16:00:00 1969 Refcount Block: 0 Last Extblk: 0 Orphan Slot: 0 Sub Alloc Slot: Global Sub Alloc Bit: 65535 From: Sunil Mushran [mailto:sunil.mush...@gmail.commailto:sunil.mush...@gmail.com] Sent: Friday, June 21, 2013 11:11 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 Can you dump the following using the 1.8 binary. debugfs.ocfs2 -R stats /dev/mapper/. On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann u...@openlane.commailto:u...@openlane.com wrote: We have a production cluster of 6 nodes, which are currently running RHEL 5.8
Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 (slighly OT)
I will see what I can do. How large would a o2image be? To just reiterate, these are not new file systems. They were created with ocfs2-2.6.9-55.ELsmp-1.2.9-1.el4 and ocfs2-tools-1.2.7-1.el4 under RHEL 4. The primary user of these volumes is a cluster of 6-nodes running RHEL 5.8 with ocfs2-2.6.18-308.11.1.el5-1.4.10-1 and ocfs2-tools-1.6.3-2.el5. Another machine, which still runs the same EL4 binaries, is mounting these snap cloned volumes daily, doing operations on the DB files and then copying the data off. From: Herbert van den Bergh [mailto:herbert.van.den.be...@oracle.com] Sent: Wednesday, July 10, 2013 09:54 To: Mihail Daskalov Cc: Sunil Mushran; Ulf Zimmermann; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 (slighly OT) It's possible that the 1.8.0 tag was never created in the ocfs-tools git repository. But it's not of any use anyway. If you check the changelog of the ocfs-tools rpm, you'll see that there were many patches since 1.8.0, so the 1.8.0-10 version that Ulf is using would be very different from a 1.8.0 tag in git. Ulf, I suggest you create an o2image of the bad filesystem, and see if the problem can be reproduced with that image. If it can, then you may want to make that o2image available to the OCFS2 developers so they can debug ocfs2-tools to see what is causing the malloc/free error. You may also want to include the exact steps to take to reproduce this, starting from the mkfs up to the failure, indicating exactly what versions of kernel and tools were used along the way. Thanks, Herbert. On 7/10/13 7:55 AM, Mihail Daskalov wrote: Hi Sunil, Regarding the ocfs tools version 1.8.0 you should know best what it was meant to be (maybe not true for 1.8.0-10 in OEL6U3). Is it possible that the tag for 1.8.0 disappeared from the git repository? Or there was never a tag for 1.8.0 ? Bellow is the link to commit in 1.8.2 tag, that brings the version to 1.8.0 https://oss.oracle.com/git/?p=ocfs2-tools.git;a=commitdiff;h=2480a215a600050d2bf923044dffac91439d982a;hp=8b5f4ad727e019cb557c4b516ab401c15c5c317e and later on another commit that bring the version to 1.8.2 https://oss.oracle.com/git/?p=ocfs2-tools.git;a=commitdiff;h=560a1e60936fe868b00cfc9cad5def726e10828e I am sorry I am not actually helping to Ulf's problem. Ulf, maybe you can really follow the head version and try to see an explanation of the error message. Anyway I think it would be best to open a SR with Oracle if you have Linux support contract. Does anyone know how to find you the git repository at least for some packages in Oracle Linux. I know the source for each package is available as .src.rpm but how could I see the changes, or the tag from which every version was build? I remember Wim talking on something like that a while ago (saying oracle is not like redhat mangling changelogs), but I can't find the article right now. If you find out what is behind ocfs2-tools 1.8.0-10 it would be easier to track the problem. Regards, Mihail Daskalov From: ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: Wednesday, July 10, 2013 2:11 AM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 The error does not make sense. Also I don't know what 1.8.0 tools means. I cannot see that label in the src tree. https://oss.oracle.com/git/?p=ocfs2-tools.git;a=summary One option is to build the tools from the head. On Tue, Jul 9, 2013 at 2:25 PM, Ulf Zimmermann u...@openlane.commailto:u...@openlane.com wrote: Sunil, any suggestions on this? From: ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Saturday, June 22, 2013 15:20 To: Sunil Mushran Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 [root@co-db03 ulf]# debugfs.ocfs2 -R stats /dev/mapper/aucp_data_bk_2_x Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Sun Sep 25 05:32:29 2011 Creator OS: 0 Feature Compat: 0 Feature Incompat: 0 Tunefs Incomplete: 0 Feature RO compat: 0 Root Blknum: 513 System Dir Blknum: 514 First Cluster Group Blknum: 256 Block Size Bits: 12 Cluster Size Bits: 20 Max Node Slots: 10 Extended Attributes Inline Size: 0 Label: /export/backuprecovery.AUCP UUID: 5F9C2727159743529200CE9C5E155562 Hash: 0 (0x0) DX Seeds: 0 0 0 (0x 0x 0x) Cluster stack: classic o2cb Cluster flags: 0
Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6
Sunil, any suggestions on this? From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Saturday, June 22, 2013 15:20 To: Sunil Mushran Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 [root@co-db03 ulf]# debugfs.ocfs2 -R stats /dev/mapper/aucp_data_bk_2_x Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Sun Sep 25 05:32:29 2011 Creator OS: 0 Feature Compat: 0 Feature Incompat: 0 Tunefs Incomplete: 0 Feature RO compat: 0 Root Blknum: 513 System Dir Blknum: 514 First Cluster Group Blknum: 256 Block Size Bits: 12 Cluster Size Bits: 20 Max Node Slots: 10 Extended Attributes Inline Size: 0 Label: /export/backuprecovery.AUCP UUID: 5F9C2727159743529200CE9C5E155562 Hash: 0 (0x0) DX Seeds: 0 0 0 (0x 0x 0x) Cluster stack: classic o2cb Cluster flags: 0 Inode: 2 Mode: 00 Generation: 3147295185 (0xbb97e9d1) FS Generation: 3147295185 (0xbb97e9d1) CRC32: ECC: Type: Unknown Attr: 0x0 Flags: Valid System Superblock Dynamic Features: (0x0) User: 0 (root) Group: 0 (root) Size: 0 Links: 0 Clusters: 1572864 ctime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011 atime: 0x0 0x0 -- Wed Dec 31 16:00:00.0 1969 mtime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011 dtime: 0x0 -- Wed Dec 31 16:00:00 1969 Refcount Block: 0 Last Extblk: 0 Orphan Slot: 0 Sub Alloc Slot: Global Sub Alloc Bit: 65535 From: Sunil Mushran [mailto:sunil.mush...@gmail.com] Sent: Friday, June 21, 2013 11:11 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 Can you dump the following using the 1.8 binary. debugfs.ocfs2 -R stats /dev/mapper/. On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann u...@openlane.commailto:u...@openlane.com wrote: We have a production cluster of 6 nodes, which are currently running RHEL 5.8 with OCFS2 1.4.10. We snapclone these volumes to multiple destinations, one of them is a RHEL4 machine with OCFS2 1.2.9. Because of that the volumes are set so that we can read them there. We are now trying to bring up a new server, this one has OEL 6.3 on it and it comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2 -cloned-volume to reset the UUID, but when I try to change the label I get: [root@co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP /dev/mapper/aucp_data_bk_2_x tunefs.ocfs2: Invalid name for a cluster while opening device /dev/mapper/aucp_data_bk_2_x fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla for that: [root@co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x fsck.ocfs2 1.8.0 *** glibc detected *** fsck.ocfs2: double free or corruption (fasttop): 0x0197f320 *** === Backtrace: = /lib64/libc.so.6[0x3656475366] fsck.ocfs2[0x434c31] fsck.ocfs2[0x403bc2] /lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd] fsck.ocfs2[0x402879] === Memory map: 0040-0045 r-xp fc:00 12489 /sbin/fsck.ocfs2 0064f000-00651000 rw-p 0004f000 fc:00 12489 /sbin/fsck.ocfs2 00651000-00652000 rw-p 00:00 0 0085-00851000 rw-p 0005 fc:00 12489 /sbin/fsck.ocfs2 0197e000-0199f000 rw-p 00:00 0 [heap] 3655c0-3655c2 r-xp fc:00 8797 /lib64/ld-2.12.sohttp://ld-2.12.so 3655e1f000-3655e2 r--p 0001f000 fc:00 8797 /lib64/ld-2.12.sohttp://ld-2.12.so 3655e2-3655e21000 rw-p 0002 fc:00 8797 /lib64/ld-2.12.sohttp://ld-2.12.so 3655e21000-3655e22000 rw-p 00:00 0 365640-3656589000 r-xp fc:00 8798 /lib64/libc-2.12.sohttp://libc-2.12.so 3656589000-3656788000 ---p 00189000 fc:00 8798 /lib64/libc-2.12.sohttp://libc-2.12.so 3656788000-365678c000 r--p 00188000 fc:00 8798 /lib64/libc-2.12.sohttp://libc-2.12.so 365678c000-365678d000 rw-p 0018c000 fc:00 8798 /lib64/libc-2.12.sohttp://libc-2.12.so 365678d000-3656792000 rw-p 00:00 0 3659c0-3659c16000 r-xp fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3659c16000-3659e15000 ---p 00016000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3659e15000-3659e16000 rw-p 00015000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3d3e80-3d3e817000 r-xp fc:00 12028
Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6
[root@co-db03 ulf]# debugfs.ocfs2 -R stats /dev/mapper/aucp_data_bk_2_x Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Sun Sep 25 05:32:29 2011 Creator OS: 0 Feature Compat: 0 Feature Incompat: 0 Tunefs Incomplete: 0 Feature RO compat: 0 Root Blknum: 513 System Dir Blknum: 514 First Cluster Group Blknum: 256 Block Size Bits: 12 Cluster Size Bits: 20 Max Node Slots: 10 Extended Attributes Inline Size: 0 Label: /export/backuprecovery.AUCP UUID: 5F9C2727159743529200CE9C5E155562 Hash: 0 (0x0) DX Seeds: 0 0 0 (0x 0x 0x) Cluster stack: classic o2cb Cluster flags: 0 Inode: 2 Mode: 00 Generation: 3147295185 (0xbb97e9d1) FS Generation: 3147295185 (0xbb97e9d1) CRC32: ECC: Type: Unknown Attr: 0x0 Flags: Valid System Superblock Dynamic Features: (0x0) User: 0 (root) Group: 0 (root) Size: 0 Links: 0 Clusters: 1572864 ctime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011 atime: 0x0 0x0 -- Wed Dec 31 16:00:00.0 1969 mtime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011 dtime: 0x0 -- Wed Dec 31 16:00:00 1969 Refcount Block: 0 Last Extblk: 0 Orphan Slot: 0 Sub Alloc Slot: Global Sub Alloc Bit: 65535 From: Sunil Mushran [mailto:sunil.mush...@gmail.com] Sent: Friday, June 21, 2013 11:11 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 Can you dump the following using the 1.8 binary. debugfs.ocfs2 -R stats /dev/mapper/. On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann u...@openlane.commailto:u...@openlane.com wrote: We have a production cluster of 6 nodes, which are currently running RHEL 5.8 with OCFS2 1.4.10. We snapclone these volumes to multiple destinations, one of them is a RHEL4 machine with OCFS2 1.2.9. Because of that the volumes are set so that we can read them there. We are now trying to bring up a new server, this one has OEL 6.3 on it and it comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2 -cloned-volume to reset the UUID, but when I try to change the label I get: [root@co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP /dev/mapper/aucp_data_bk_2_x tunefs.ocfs2: Invalid name for a cluster while opening device /dev/mapper/aucp_data_bk_2_x fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla for that: [root@co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x fsck.ocfs2 1.8.0 *** glibc detected *** fsck.ocfs2: double free or corruption (fasttop): 0x0197f320 *** === Backtrace: = /lib64/libc.so.6[0x3656475366] fsck.ocfs2[0x434c31] fsck.ocfs2[0x403bc2] /lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd] fsck.ocfs2[0x402879] === Memory map: 0040-0045 r-xp fc:00 12489 /sbin/fsck.ocfs2 0064f000-00651000 rw-p 0004f000 fc:00 12489 /sbin/fsck.ocfs2 00651000-00652000 rw-p 00:00 0 0085-00851000 rw-p 0005 fc:00 12489 /sbin/fsck.ocfs2 0197e000-0199f000 rw-p 00:00 0 [heap] 3655c0-3655c2 r-xp fc:00 8797 /lib64/ld-2.12.sohttp://ld-2.12.so 3655e1f000-3655e2 r--p 0001f000 fc:00 8797 /lib64/ld-2.12.sohttp://ld-2.12.so 3655e2-3655e21000 rw-p 0002 fc:00 8797 /lib64/ld-2.12.sohttp://ld-2.12.so 3655e21000-3655e22000 rw-p 00:00 0 365640-3656589000 r-xp fc:00 8798 /lib64/libc-2.12.sohttp://libc-2.12.so 3656589000-3656788000 ---p 00189000 fc:00 8798 /lib64/libc-2.12.sohttp://libc-2.12.so 3656788000-365678c000 r--p 00188000 fc:00 8798 /lib64/libc-2.12.sohttp://libc-2.12.so 365678c000-365678d000 rw-p 0018c000 fc:00 8798 /lib64/libc-2.12.sohttp://libc-2.12.so 365678d000-3656792000 rw-p 00:00 0 3659c0-3659c16000 r-xp fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3659c16000-3659e15000 ---p 00016000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3659e15000-3659e16000 rw-p 00015000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3d3e80-3d3e817000 r-xp fc:00 12028 /lib64/libpthread-2.12.sohttp://libpthread-2.12.so 3d3e817000-3d3ea17000 ---p 00017000 fc:00 12028 /lib64/libpthread-2.12.sohttp://libpthread-2.12.so 3d3ea17000-3d3ea18000 r--p 00017000 fc:00 12028 /lib64/libpthread-2.12.sohttp
Re: [Ocfs2-users] Unable to Install Ocfs2 in Oracle Linux 5 Machine.
To be move exact to the other replies: You are trying to install 3 packages: ocfs2-2.6.18-308.4.1.el5-1.4.10-1.el5.x86_64.rpm ocfs2-2.6.18-308.4.1.el5debug-1.4.10-1.el5.x86_64.rpm ocfs2-2.6.18-308.4.1.el5xen-1.4.10-1.el5.x86_64.rpm The el5debug package is only needed if you are running the Debug kernel, most people will not run that one. The el5xen package is for the kernel with XEN support. Based on that there is no error message for the first package, you only need to install that particular package. -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of maanas Sent: Monday, March 18, 2013 22:50 To: ocfs2-users@oss.oracle.com Cc: Ranga Babu; Raju Pasagadi Subject: [Ocfs2-users] Unable to Install Ocfs2 in Oracle Linux 5 Machine. Hi, I was trying to install OCFS 2 in my Oracle Linux 5 Machine. I created a mount point for cluster file system: # mkdir /u02 My Kernel version is : # uname -r 2.6.18-308.4.1.0.1.el5xen I am downloading the appropriate version of the kernel module from this location: https://oss.oracle.com/projects/ocfs2/files/RedHat/RHEL5/x86_64/1.4.10- 1/2.6.18-308.4.1.el5/ When I am trying to execute this cmd for installing : # rpm -Uvh ocfs2-2.6.18-308.4.1.el5-1.4.10-1.el5.x86_64.rpm ocfs2-2.6.18-308.4.1.el5debug-1.4.10-1.el5.x86_64.rpm ocfs2-2.6.18-308.4.1.el5xen-1.4.10-1.el5.x86_64.rpm I am getting this error: warning: ocfs2-2.6.18-308.4.1.el5-1.4.10-1.el5.x86_64.rpm: Header V3 DSA signature: NOKEY, key ID 1e5e0159 error: Failed dependencies: kernel-debug = 2.6.18-308.4.1.el5 is needed by ocfs2-2.6.18-308.4.1.el5debug-1.4.10-1.el5.x86_64 kernel-xen = 2.6.18-308.4.1.el5 is needed by ocfs2-2.6.18-308.4.1.el5xen-1.4.10-1.el5.x86_64 Can anyone point out what error is there in this process? Thanks, Maanas. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Subject: problem configuring ocfs on rhel5.8 kernel 2.6.18-300.el5
What is your kernel version? Run “uname –r” My guess is that your kernel is not 2.6.18-238.9.1.el5 nor 2.6.18-308.1.1.el5 You will need to install the matching ocfs2-2.6.18-300.el5-1.4.9-1.el5.x86_64 package. From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Asanka Gunasekera Sent: Wednesday, September 12, 2012 12:32 AM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Subject: problem configuring ocfs on rhel5.8 kernel 2.6.18-300.el5 Hi this is a resend from a subscribed address sorry if I am causing any inconvenience hope some one can help me on this, I have been straggling to get this working for few weeks now. My issue is as below I am just trying to use ocfs2 as shared file system between 2 node HA cluster for a application that runs on these nodes I have downloaded below packages ocfs2-2.6.18-238.9.1.el5-1.4.9-1.el5.x86_64.r and ocfs2-2.6.18-308.1.1.el5-1.4.10-1.el5.x86_64 Installation goes with out any complains but when its time to configure I get below errors [root@ccbsn01 ~]# /etc/init.d/o2cb configure Configuring the O2CB driver. This will configure the on-boot properties of the O2CB driver. The following questions will determine whether the driver is loaded on boot. The current values will be shown in brackets ('[]'). Hitting ENTER without typing an answer will keep that current value. Ctrl-C will abort. Load O2CB driver on boot (y/n) [y]: Cluster stack backing O2CB [o2cb]: Cluster to start on boot (Enter none to clear) [ocfs2]: Specify heartbeat dead threshold (=7) [31]: Specify network idle timeout in ms (=5000) [3]: Specify network keepalive delay in ms (=1000) [2000]: Specify network reconnect delay in ms (=2000) [2000]: Writing O2CB configuration: OK Loading filesystem ocfs2_dlmfs: Unable to load filesystem ocfs2_dlmfs Failed And in the /var/log/message log I get below error Sep 12 11:21:41 node01 modprobe: FATAL: Module ocfs2_stackglue not found. Sep 12 11:21:41 node01 modprobe: FATAL: Module ocfs2_dlmfs not found. Sep 12 11:33:13 node01 modprobe: FATAL: Module ocfs2_stackglue not found. Sep 12 11:33:13 node01 modprobe: FATAL: Module ocfs2_dlmfs not found. How can I fix this and get this working Thanks and Best Regards ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] RPM packages for EL5 U8 kernel 2.6.18-308.11.1.el5?
Latest packages at oss.oracle.com is 308.8.1.el5, any plans to provide packages for 308.11.1.el5? Never mind, there is OCFS2 1.4.10, which has the packages. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] RPM packages for EL5 U8 kernel 2.6.18-308.11.1.el5?
Latest packages at oss.oracle.com is 308.8.1.el5, any plans to provide packages for 308.11.1.el5? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5
- -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Monday, September 26, 2011 10:09 AM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5 I'll look at the tunefs issue. But the other one does not make sense. strict_jbd is a compat flag. Mount should work. What is the mount error? As in, in dmesg. I don't see any dmesg or /var/log/messages, but the error I saw was from tunefs: demodb01 root /home/ulf # /usr/bin/yes | /sbin/tunefs.ocfs2 -U -L /export/u07 /dev/mapper/u07 tunefs.ocfs2 1.2.7 tunefs.ocfs2: Filesystem has unsupported feature(s) while opening device /dev/mapper/u07 On 09/25/2011 04:43 AM, Ulf Zimmermann wrote: As tunefs.ocfs2 wasn't working for us, I tried to mkfs.ocfs2 the volumes again with --fs-feature-level=max-compat. This still turns on strict-journal- super and there seems no way around this? This makes the volume not compatible with OCFS 1.2.9 -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Sunday, September 25, 2011 1:43 AM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5 We are running into a problem which looks like the same we had with fsck.ocfs2 a while back. This is with ocfs2-tools 1.4.4. I am trying to use tunefs.ocfs2 to turn off some features. The program starts up but then starts eating all available memory and more and the system starts to swap like crazy in and out. This is exactly the same behavior as the fsck.ocfs2 for which we were given a patched binary. I tried to compile the tunefs.ocfs2 from 1.6.x but the same problem with that binary. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5
-Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Tuesday, September 27, 2011 9:27 AM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5 On 09/27/2011 09:12 AM, Ulf Zimmermann wrote: - -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Monday, September 26, 2011 10:09 AM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5 I'll look at the tunefs issue. But the other one does not make sense. strict_jbd is a compat flag. Mount should work. What is the mount error? As in, in dmesg. I don't see any dmesg or /var/log/messages, but the error I saw was from tunefs: demodb01 root /home/ulf # /usr/bin/yes | /sbin/tunefs.ocfs2 -U -L /export/u07 /dev/mapper/u07 tunefs.ocfs2 1.2.7 tunefs.ocfs2: Filesystem has unsupported feature(s) while opening device /dev/mapper/u07 So that is correct. In short that flag was added to allow us to use the jbd(2) features. We use this to create volumes 16TB. I guess if you want to use with 1.2, format it with 1.2 tools. That is what I ended up with. And I also made a point to certain people in the company about not stopping in the middle of upgrading database servers. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5
As tunefs.ocfs2 wasn't working for us, I tried to mkfs.ocfs2 the volumes again with --fs-feature-level=max-compat. This still turns on strict-journal-super and there seems no way around this? This makes the volume not compatible with OCFS 1.2.9 -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Sunday, September 25, 2011 1:43 AM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5 We are running into a problem which looks like the same we had with fsck.ocfs2 a while back. This is with ocfs2-tools 1.4.4. I am trying to use tunefs.ocfs2 to turn off some features. The program starts up but then starts eating all available memory and more and the system starts to swap like crazy in and out. This is exactly the same behavior as the fsck.ocfs2 for which we were given a patched binary. I tried to compile the tunefs.ocfs2 from 1.6.x but the same problem with that binary. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Linux RHEL 5.2 hangs for 1.5 hrs while fsck'ing the OCFS2 file system
While you run it, what is the memory usage? Is the system going into swap? Our 700GB large volume would take over half a day or more to finish, due to the system constantly having to swap. From: Sergey Prilutsky [mailto:sprilut...@hotmail.com] Sent: Wednesday, April 20, 2011 5:22 AM To: Ulf Zimmermann; ocfs2-users@oss.oracle.com Subject: RE: [Ocfs2-users] Linux RHEL 5.2 hangs for 1.5 hrs while fsck'ing the OCFS2 file system Hi there, No, we are using the earlier version - 1.4.1: [root@box1]# rpm -qa |grep ocfs2-tools ocfs2-tools-debuginfo-1.4.1-1.el5.x86_64 ocfs2-tools-1.4.1-1.el5.x86_64 Thanks Sergey Prilutsky From: u...@openlane.com To: sprilut...@hotmail.com; ocfs2-users@oss.oracle.com Date: Tue, 19 Apr 2011 14:09:31 -0700 Subject: RE: [Ocfs2-users] Linux RHEL 5.2 hangs for 1.5 hrs while fsck'ing the OCFS2 file system If you are using ocfs2-tools-1.4.4, there is a bug with memory usage. Sunil provided me a patched version, which is afik still not in the downloadable tools version. From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sergey Prilutsky Sent: Tuesday, April 19, 2011 11:30 AM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Linux RHEL 5.2 hangs for 1.5 hrs while fsck'ing the OCFS2 file system Hi there, A month ago we ran into the fsck issue while rebooting one of the Oracle RAC nodes running on Linux RHEL 5.2. It was hanging for 1.5 hours During the reboot, OS portion went fine, then it activated the data volumes in all data vg's with [OK] Then displayed message: Checking filesystems - and it took it 1.5 hrs, then it finished the reboot. Last weekend we rebooted the same box and faced the same issue, however, we sent the break, commented out (last 0 in /etc/fstab did not work neither) all OCFS2 lines in /etc/fstab and it booted fine. Then we mount -a them and life was good. Also we added later on the fastboot to the grub.conf, then booted it - no problem for obvious reasons. If anyone experienced the same issue - would you mind to light into the tonnel and share your experience and perhaps the fix? Thanks Sergey Prilutsky ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] What could cause slow down betwen OCFS2 1.2.9 and 1.4.4
We upgraded our production database cluster (6 node) from EL4 Update 5 to EL5 Update 5, including upgrading OCFS2 from 1.2.9 to 1.4.4. We are now noticing slowdown of batch jobs in Oracle, while hotbackup runs faster. One thing we saw is that journal mode changed from write-back to ordered, as we don't specify journal mode during mount. Oracle sees this as slowdown based on higher IO latency, going from 6-8ms to 13-15ms for single block IO. Total IO throughput has dropped. Can this be caused by the journal mode being ordered? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] What could cause slow down betwen OCFS2 1.2.9 and 1.4.4
Another change we found is we used scheduler deadline, we are doing downtime tonight to change scheduler and journal mode. Ulf Zimmermann | Senior System Architect OPENLANE, Inc | 2200 Bridge Parkway, Suite 202 Redwood City, CA 94065 | (650) 412-4042 u...@openlane.com On Mar 11, 2011, at 14:32, Sunil Mushran sunil.mush...@oracle.com wrote: Verifying the journal mode is easy enough. Remount with data=writeback. It can be done one node at a time. But since you upgraded from 4.5 to 5.5, you may have to cast a wider net considering the entire kernel also changed. On 03/11/2011 02:22 PM, Ulf Zimmermann wrote: We upgraded our production database cluster (6 node) from EL4 Update 5 to EL5 Update 5, including upgrading OCFS2 from 1.2.9 to 1.4.4. We are now noticing slowdown of batch jobs in Oracle, while hotbackup runs faster. One thing we saw is that journal mode changed from write-back to ordered, as we don't specify journal mode during mount. Oracle sees this as slowdown based on higher IO latency, going from 6-8ms to 13-15ms for single block IO. Total IO throughput has dropped. Can this be caused by the journal mode being ordered? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Reservation conflicts
-Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Joel Becker Sent: Thursday, December 09, 2010 2:54 PM To: brad hancock Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Reservation conflicts On Thu, Dec 09, 2010 at 04:45:25PM -0600, brad hancock wrote: Yeah both guest have the same Harddrive attached with the virtual scsi controller configured as Physical to set a policy to allow virtual disk to be used simultaneously by multi virtual machines. as /dev/sdb1 It sure seems like VMWare is caching some data somewhere. That's my best guess. These are on the same host, right? Joel I have configured: SCSI Controller 1 Virtual (Virtual disks can be shared between any virtual machines on the same server.) Disks are configured as Independent. That works for my test cluster using OCFS inside of Vmware. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] heartbeat and slot issues.
After the clone, you want to probably run tunefs.ocfs2 -U to reset the UUID. This is one of the steps we do when cloning volumes for database refreshes. From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of brad hancock Sent: Wednesday, November 24, 2010 12:35 PM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] heartbeat and slot issues. I setup a host with an ocfs partition on a san and then cloned that host to another and renamed. Both machines mount their ocfs partitions but give the following errors. Host that was cloned: (1888,0):o2hb_do_disk_heartbeat:762 ERROR: Device sdb1: another node is heartbeating in our slot! [345413.242260] sd 1:0:0:0: reservation conflict [345413.242270] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK [345413.242274] end_request: I/O error, dev sdb, sector 1735 [345413.242536] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5 [345413.242788] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5 [345413.243159] sd 1:0:0:0: reservation conflict [345413.243163] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK [345413.243166] end_request: I/O error, dev sdb, sector 1735 [345413.243401] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5 [345413.243639] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5 [448460.370132] sd 1:0:0:0: reservation conflict [448460.370145] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK [448460.370149] end_request: I/O error, dev sdb, sector 1735 [448460.370395] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5 [448460.370638] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5 Clone: sd 1:0:0:0: reservation conflict [17643.588011] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK [17643.588011] end_request: I/O error, dev sdb, sector 1735 [17643.588011] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5 [17643.588011] (1859,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5 [17643.588011] sd 1:0:0:0: reservation conflict This didn't seem to be a problem, but im noticing the host are no longer seeing the same data. I unmount the drives and remounted and they were the same again. Thanks for any guidance, cat /etc/ocfs2/cluster.conf node: ip_port = ip_address = 10.x.x.248 number = 0 name = smes01 cluster = ocfs2 node: ip_port = ip_address = 10.x.x.249 number = 1 name = smes02 cluster = ocfs2 cluster: node_count = 2 name = ocfs2 cluster.conf same on both hosts. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] heartbeat and slot issues.
Then you haven't cloned the volume, but it is the same, would be my guess. From: brad hancock [mailto:braddhanc...@gmail.com] Sent: Wednesday, November 24, 2010 1:54 PM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] heartbeat and slot issues. Thanks for the response. Is it normal for when I change it on one node, the other node reflects the same UUID? node1: tunefs.ocfs2 -q -Q BS=%5B\nUUID=%U\n /dev/sdb1 BS= 4096 UUID=ea0778bd-bdaa-44af-8fbf-cb4a5d85e79f node2: tunefs.ocfs2 -q -Q BS=%5B\nUUID=%U\n /dev/sdb1 BS= 4096 UUID=ea0778bd-bdaa-44af-8fbf-cb4a5d85e79f On Wed, Nov 24, 2010 at 3:00 PM, Ulf Zimmermann u...@openlane.commailto:u...@openlane.com wrote: After the clone, you want to probably run tunefs.ocfs2 -U to reset the UUID. This is one of the steps we do when cloning volumes for database refreshes. From: ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of brad hancock Sent: Wednesday, November 24, 2010 12:35 PM To: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] heartbeat and slot issues. I setup a host with an ocfs partition on a san and then cloned that host to another and renamed. Both machines mount their ocfs partitions but give the following errors. Host that was cloned: (1888,0):o2hb_do_disk_heartbeat:762 ERROR: Device sdb1: another node is heartbeating in our slot! [345413.242260] sd 1:0:0:0: reservation conflict [345413.242270] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK [345413.242274] end_request: I/O error, dev sdb, sector 1735 [345413.242536] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5 [345413.242788] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5 [345413.243159] sd 1:0:0:0: reservation conflict [345413.243163] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK [345413.243166] end_request: I/O error, dev sdb, sector 1735 [345413.243401] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5 [345413.243639] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5 [448460.370132] sd 1:0:0:0: reservation conflict [448460.370145] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK [448460.370149] end_request: I/O error, dev sdb, sector 1735 [448460.370395] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5 [448460.370638] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5 Clone: sd 1:0:0:0: reservation conflict [17643.588011] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_OK,SUGGEST_OK [17643.588011] end_request: I/O error, dev sdb, sector 1735 [17643.588011] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5 [17643.588011] (1859,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5 [17643.588011] sd 1:0:0:0: reservation conflict This didn't seem to be a problem, but im noticing the host are no longer seeing the same data. I unmount the drives and remounted and they were the same again. Thanks for any guidance, cat /etc/ocfs2/cluster.conf node: ip_port = ip_address = 10.x.x.248 number = 0 name = smes01 cluster = ocfs2 node: ip_port = ip_address = 10.x.x.249 number = 1 name = smes02 cluster = ocfs2 cluster: node_count = 2 name = ocfs2 cluster.conf same on both hosts. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] some beginner questions
OCFS2 requires shared storage. Is /dev/sdb a shared device? -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Alexander Nagel Sent: Wednesday, July 14, 2010 12:03 PM To: Ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] some beginner questions Hi, I'am new to ocfs2 filesystem and I have some questions about it. I installed three server according to the user guide from http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.4/ocfs2-1_4- usersguide.pdf For every single server I have a working ocfs2 partition /dev/sdb1 on /mnt/oc1 type ocfs2 (rw,_netdev,heartbeat=local) As I understand the ocfs2 system I can use now these partition as a single data storage. But it doesn't work. When I create a file or directory on the ocfs2 partition, it doesn't appear on the other servers. SERVER01: server01:~# echo sdhfksjdhfskhskjgh /mnt/oc1/testfile server01:~# cat /mnt/oc1/testfile sdhfksjdhfskhskjgh server01:~# ls -lh /mnt/oc1/ insgesamt 0 drwxr-xr-x 2 root root 3,9K 13. Jul 16:53 lost+found -rw-r--r-- 1 root root 19 14. Jul 20:48 testfile SERVER02 server02:~# ls -lh /mnt/oc1/ insgesamt 0 drwxr-xr-x 2 root root 3,9K 13. Jul 16:34 lost+found server03 same result The config file is the same on all three servers. I made it and copied it with the gui program on all three servers. So all server have the absolutly same file. What did I miss? Can somebody give me a hint? Or did I misunderstand the ocfs2? thanks Alexander -- Alexander Nagel E-mail: alexan...@acwn.de Homepage: http://www.acwn.de/ http://www.standspur-kadaver.de/ ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] some beginner questions
As Sunil already wrote: iSCSI, FC attached SAN, Firewire (not really recommended for production). -Original Message- From: Alexander Nagel [mailto:alexan...@acwn.de] Sent: Wednesday, July 14, 2010 12:44 PM To: Ulf Zimmermann Cc: Ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] some beginner questions Hi, Am 14.07.2010 21:08, schrieb Ulf Zimmermann: OCFS2 requires shared storage. Is /dev/sdb a shared device? thanks for your quick response, and that clarify the situation. /dev/sdb is a single harddisk in the servers, it is not a shared storage. I thought that ocfs2 would work then like a single disk. What type of storage must this shared device be? thanks Alexander -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Alexander Nagel Sent: Wednesday, July 14, 2010 12:03 PM To: Ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] some beginner questions Hi, I'am new to ocfs2 filesystem and I have some questions about it. I installed three server according to the user guide from http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.4/ocfs2- 1_4- usersguide.pdf For every single server I have a working ocfs2 partition /dev/sdb1 on /mnt/oc1 type ocfs2 (rw,_netdev,heartbeat=local) As I understand the ocfs2 system I can use now these partition as a single data storage. But it doesn't work. When I create a file or directory on the ocfs2 partition, it doesn't appear on the other servers. SERVER01: server01:~# echo sdhfksjdhfskhskjgh /mnt/oc1/testfile server01:~# cat /mnt/oc1/testfile sdhfksjdhfskhskjgh server01:~# ls -lh /mnt/oc1/ insgesamt 0 drwxr-xr-x 2 root root 3,9K 13. Jul 16:53 lost+found -rw-r--r-- 1 root root 19 14. Jul 20:48 testfile SERVER02 server02:~# ls -lh /mnt/oc1/ insgesamt 0 drwxr-xr-x 2 root root 3,9K 13. Jul 16:34 lost+found server03 same result The config file is the same on all three servers. I made it and copied it with the gui program on all three servers. So all server have the absolutly same file. What did I miss? Can somebody give me a hint? Or did I misunderstand the ocfs2? thanks Alexander -- Alexander Nagel E-mail: alexan...@acwn.de Homepage: http://www.acwn.de/ http://www.standspur-kadaver.de/ ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users -- Alexander Nagel E-mail: alexan...@acwn.de Homepage: http://www.acwn.de/ http://www.standspur-kadaver.de/ ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] df showing wrong size
Make sure you don't have deleted files, which are still open. You can use lsof to find those. From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Garcia, Raymundo Sent: Sunday, June 27, 2010 11:18 PM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] df showing wrong size Hello... it was put under my attention that a partition we have in one of our production system was displaying wrong size with df command 123 GB... but in fact the size of all the files is a mere 15 GB What is going on? Shall we use ocfs.fsck to fix that? Is strange... Thanks for any comment Raymundo Garcia The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Info on Version Upgrade
We have been running the -55 kernel for a long time on our Oracle servers, there is one issue for us known, which is a memory leak in the kernel in conjunction with HP management agents, but otherwise -55 has been ok for us. But there have been plenty of fixes in newer kernels, but also traps. When we did try to upgrade we ran into kernel panics which got triggered by network drivers and we rolled back. From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: Wednesday, June 16, 2010 8:15 AM To: Martin Eddy Cc: Ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Info on Version Upgrade As far as ocfs2 is concerned, the current version of ocfs2 1.2 is ocfs2 1.2.9. You will find the packages for your kernel on oss.oracle.com. The news section has the list of changes/bugs fixed. asmlib also has some updates. You can review the fixes to see whether an upgrade is warranted. For all other qs, ping Oracle support and/or Red Hat support. Sunil On 06/16/2010 08:02 AM, Martin Eddy wrote: I hope this question can be posted here, if not I do apologize. I have inherited a Production system running at a client site. It is a RAC system 2 cluster node presently running on an HP EVA 8000 with * Oracle 10.2.0.3 * Redhat 4.5 Kernel 2.6.9-55(updates to 4.8 kernel remains at 2.6.9-55) * OCFS2 * OCFS2-tools-1.2.7-1.e14.i386 * OCFS2console-1.2.7-e14.i386 * OCFS2-2.6.9-55.ELsmp-1.2.8-2.e14.i386 * ASM * oracelasm-support-2.0.3-1.i386 * oracleasm-2.6.9-55.ELsmp-2.0.3-1.i386 * oracleasmlib-2.0.2-1.i386 There was a kernel crash on the weekend and I suspect a request will be to update the kernel. There has been talks around upgrading Oracle to 10.2.0.4 for some time but the client has not been willing to move on it. I am looking for some professional insight on what can be upgraded vs what should be upgraded, just doing the OCFS2 and ASM vs both these as well as the RDBMS. Also the HBA drivers would need to be upgraded. Any info or opinions would be greatly appreciated. Any additional info needed please do not hesitate to ask. Thanks ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Why OCFS2 with RAC
-Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of David Johle Sent: Wednesday, June 16, 2010 10:37 AM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Why OCFS2 with RAC I have been a user of OCFS2 for quite some time now (2 years or so) and a user of Oracle RAC for several years as well. My usage of these two is completely independent though, the cluster filesystem is for application level usage (web servers, etc.) only because RAC can manage its own shared storage directly. And that brings me to my observation/question... We have been running Oracle RAC on top of OCFS1 and OCFS2 now for close to over 7 years I think. We had throughout that time only 2 times major issues, which in both cases were quickly solved with help of Sunil and other OCFS developers. I keep seeing a lot of messages on the list relating to the use of ocfs2 on systems running RAC--often with problems in the OCFS2 camp, but that's just the nature of this list. What I'm trying to understand is why one would even consider adding OCFS2 to the systems running RAC. To me that is just adding complexity to critical systems, and complexity tyipcally results in decreased reliability/availability in the long term. For us it was the only choice because with the amount of databases our DBAs were not willing to go to raw storage (when ASM wasn't available yet I believe original). OCFS2 has worked pretty well for us. I have iSCSI targets, available on multiple paths, presented as block devices with device-mapper-multipath. My data flash recovery are then managed by ASM, which directly uses those block devices. For my OCR voting disks, one could do the same with block devices, although I know there were some issues with using block devices on certain versions, but raw devices are another option there. In my case, I simply use libraw to present the iSCSI based block devices as raw devices, and that's what they use. Having RAC directly deal with the shared storage eliminates a lot of filesystem level overhead and removes a potential outside force that could unexpectedly bring DB nodes down (i.e. ocfs2 cluster stack fencing itself). Not to mention the reduction in system administration resources for manging the filesystem. One place I could see a use for the OCFS2 is a shared ORACLE_HOME among nodes, but that has its own pros cons which can be debated on some other mailing list :) So I'm curious, what benefits are there to having the OCFS2 available on the RAC system, moreso related to using it for CRS and DB storage purposes? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] fsck.ocfs2 using huge amount of memory?
We are setting up 2 new EL5 U4 machines to replace our current database servers running our demo environment. We use 3Par SANs and their snap clone options. The current production system we snap clone from is EL4 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part of the refresh process is to run fsck.ocfs2 on the volume to recover, but right now as I am trying to run it on our 700GB volume it shows a virtual memory size of 21.9GB, resident of 10GB and it is killing the machine with swapping (24GB physical memory). Can anyone enlighten what is going on? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
Correction, kernel modules are 1.4.4, the tools and console is 1.4.3. -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Thursday, May 20, 2010 6:00 PM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory? We are setting up 2 new EL5 U4 machines to replace our current database servers running our demo environment. We use 3Par SANs and their snap clone options. The current production system we snap clone from is EL4 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part of the refresh process is to run fsck.ocfs2 on the volume to recover, but right now as I am trying to run it on our 700GB volume it shows a virtual memory size of 21.9GB, resident of 10GB and it is killing the machine with swapping (24GB physical memory). Can anyone enlighten what is going on? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
And upgrading to kernel modules 1.4.7, tools 1.4.4 didn't change the memory part: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 29532 root 18 0 21.9g 10g4 D 21.1 45.0 0:15.24 fsck.ocfs2 -Original Message- From: Ulf Zimmermann Sent: Thursday, May 20, 2010 6:06 PM To: Ulf Zimmermann; ocfs2-users@oss.oracle.com Subject: RE: fsck.ocfs2 using huge amount of memory? Correction, kernel modules are 1.4.4, the tools and console is 1.4.3. -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Thursday, May 20, 2010 6:00 PM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory? We are setting up 2 new EL5 U4 machines to replace our current database servers running our demo environment. We use 3Par SANs and their snap clone options. The current production system we snap clone from is EL4 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part of the refresh process is to run fsck.ocfs2 on the volume to recover, but right now as I am trying to run it on our 700GB volume it shows a virtual memory size of 21.9GB, resident of 10GB and it is killing the machine with swapping (24GB physical memory). Can anyone enlighten what is going on? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 works like standalone
OCFS needs shared storage, your /dev/sda sounds like local storage, not shared. From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of v...@ghs.l.google.com Sent: Thursday, March 18, 2010 11:16 AM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] OCFS2 works like standalone I have installed OCFS2 on two nodes SuSE 10. Seems all works superb and nice from the first sight. But, /dev/sda ocfs2 rac1 is not sharing through net (port ) with rac0. On both nodes I have 500Mb /dev/sda disks that are mounted (and are ocfs2). But they did not share the content with each other (files and folders in it). So when I am creating the file in one node I am expecting to receive this file in another node, but it does not appeared. So how to make OCFS share the same disk between both nodes (mounted.ocfs2 -f - shows only one node that handle with this disk 1. Two nodes are connected with interconnect 1Gb cards. 2. netstat on two nodes says that they are listening on 3. I can go with telnet from one node to another on port. (connection established and then closing with ^ character - so that works) 4. Both nodes configured well, see below (there is rac1, - the rac0 has analogue result) 5. ocfs2console - configure nodes shows this two nodes + Propagate was performed + the device is mounted in mounpoin 6. On both nodes I have 500Mb /dev/sda disks that are mounted (and are ocfs2). But they did not share the content: files and folders in it.So when I am creating the file in one node I am expecting to receive this file in another node, but it does not appeared. So how to make OCFS share the same disk on between both nodes (mounted.ocfs2 -f - shows only one node that handle with this disk) rac1:/var/log # modinfo ocfs2 filename: /lib/modules/2.6.16.21-0.8-default/kernel/fs/ocfs2/ocfs2.ko author: Oracle license:GPL description:OCFS2 1.2.1-SLES Tue Apr 25 14:46:36 PDT 2006 (build sles) version:1.2.1-SLES vermagic: 2.6.16.21-0.8-default 586 REGPARM gcc-4.1 supported: yes depends:ocfs2_nodemanager,ocfs2_dlm,jbd,configfs srcversion: B45E2E0A0B86D1E2295CD6B rac1:/var/log # rac1:/var/log # vi /etc/ocfs/cluster.conf node: ip_port = ip_address = 192.168.56.121 number = 0 name = rac1 cluster = ocfs2 node: ip_port = ip_address = 192.168.56.101 number = 1 name = rac0 cluster = ocfs2 cluster: node_count = 2 name = ocfs2 rac1:~ # netstat -anlp | grep tcp0 0 0.0.0.0:http://0.0.0.0:0.0.0.0:* LISTEN - rac1:~ # rac1:~ # /etc/rc.d/o2cb status Module configfs: Loaded Filesystem configfs: Mounted Module ocfs2_nodemanager: Loaded Module ocfs2_dlm: Loaded Module ocfs2_dlmfs: Loaded Filesystem ocfs2_dlmfs: Mounted Checking cluster ocfs2: Online Checking heartbeat: Active rac1:~ # rac1:~ # /etc/rc.d/ocfs2 status Active OCFS2 mountpoints: /mnt/u01 rac1:~ # rac1:~ # mounted.ocfs2 -f DeviceFS Nodes /dev/sda ocfs2 rac1 gmesg says: ocfs2_dlm: Nodes in domain (6BC17BABF90444138BFD125263D82586): 0 kjournald starting. Commit interval 5 seconds ocfs2: Mounting device (8,0) on (node 0, slot 0) SeSe Linux 10 #uname -r 2.6.16.21-0.8-defaults Thank in advance ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 Multipath Configuration
On our 3Par SAN we use O2CB_HEARTBEAT_THRESHOLD of 76 (180 seconds). This is the time needed for the controller to fully recover in case of a crash or software upgrade. Multipath is configured with a polling_interval of 10, no_path_retry of 60. With these settings we are able to survive SAN switch crash (has happened), SAN controller crash (has happened) and SAN controller upgrades. -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Elliott Perrin Sent: Thursday, March 18, 2010 6:58 PM To: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 Multipath Configuration I have seen node fences occur on all of our OCFS2 systems when the O2CB_HEARTBEAT_THRESHOLD is set less than 31 on both FC and iSCSI connected systems. In some testing I have done I found that all nodes would panic with O2CB_HEARTBEAT_THRESHOLD set to any value less than 15 and we would see single node fencing with values between 15 and 30. With the O2CB_HEARTBEAT_THRESHOLD set to 31 or greater we do not see a fence occur when fabric disruption occurs, again on both iSCSI and FC. This is with a multipath configuration that includes polling_interval 5 no_path_retry 5 features 1 queue_if_no_path failback immediate selector round-robin 0 as some of the configuration variables used in the setup. As long as the output of mount (not mounted.ocfs2) shows that your file system mount is from a /dev/mapper created multipath device then OCFS2 is using the multipath provided device. -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: Thursday, March 18, 2010 2:11 PM To: David Johle Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 Multipath Configuration Yeah.. mounted is a bit dumb. In the next release, it will recognize /dev/mapper devices. We still need to teach it to handle multipathing fully. David Johle wrote: I'm not sure about why mounted.ocfs2 is showing both the dm and the sd devices for the same volume. But this could all be very similar to a problem I've experienced with OCFS2 mount finding the right device with multipathing. See the following thread for some more insight: http://oss.oracle.com/pipermail/ocfs2-users/2009-March/003391.html At 04:12 PM 3/15/2010, you wrote: Date: Wed, 16 Dec 2009 14:20:45 -0500 From: Nyburg, Daryl daryl.nyb...@utoledo.edu Subject: [Ocfs2-users] OCFS2 Multipath Configuration To: ocfs2-users@oss.oracle.com Hello All, I am having some problems configuring OCFS2 to use only the multipath device name. We have been doing failover testing with our ISCI SAN and as soon as we unplug one NIC the following messages appear in /var/log/messages and the system reboots. Dec 16 12:39:03 mcprac01 kernel: (56,6):o2hb_write_timeout:172 ERROR: Heartbeat write timeout to device dm-34 after 12 milliseconds Dec 16 12:39:03 mcprac01 kernel: (56,6):o2hb_stop_all_regions:1967 ERROR: stopping heartbeat on all active regions. Dec 16 12:39:03 mcprac01 kernel: ocfs2 is very sorry to be fencing this system by restarting RHEL 5.3 Kernel 2.6.18-128.el5 OCFS2 Versions ocfs2console-1.4.3-1.el5 ocfs2-2.6.18-128.el5-1.4.2-1.el5 ocfs2-tools-1.4.3-1.el5 Why does this show all device paths? Is there anyway to tell OCFS2 to ignore the /dev/sd* devices ? $ mounted.ocfs2 -d DeviceFS UUID Label /dev/sdf1 ocfs2 2eaddbd4-fac6-4c83-a86d-357215730b23 /dev/dm-25 ocfs2 2eaddbd4-fac6-4c83-a86d-357215730b23 /dev/sdaj1 ocfs2 2eaddbd4-fac6-4c83-a86d-357215730b23 /dev/sdd1ocfs2 199f76a6-280a-46c6-812e-50712170a823 /dev/dm-32 ocfs2 199f76a6-280a-46c6-812e-50712170a823 /dev/sdai1 ocfs2 199f76a6-280a-46c6-812e-50712170a823 ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Fencing options
-Original Message- Questions: Can we set up redundant heartbeat ip connections? Can we also add a disk heartbeat? If it truly is network connectivity, can we set the timeout to be more lenient? And can we change the fencing to something other than machine reset? Eg unmount the volume, change it to read only, etc? There is a network and a disk heartbeat afik. The timeouts are controlled via /etc/sysconfig/o2cb (On RedHat at least, not sure if Suse follows the same way). In there you have: # O2CB_ENABELED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=dbtest # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=76 # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=3 # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent O2CB_KEEPALIVE_DELAY_MS=2000 # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts O2CB_RECONNECT_DELAY_MS=2000 The above values is what we use on our clusters, we got three 2-node, one 4-node and one 6-node cluster. These are all running RedHat EL4 on HP hardware (DL360 g4, g5 or DL380 g5). Thanks... Angelo ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Persistent lun problem
Look at device-mapper-multipath Regards, Ulf. - OPENLANE Inc., T: 650-532-6382, F: 650-532-6441 4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025 - -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Pedro Figueira Sent: 10/01/2008 11:40 To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Persistent lun problem Dear all I'm investigating the possible solutions for persistent device names to mount OCFS2 disks, without using LVM2 that is not supported. In RH4 if you reboot the server, add or remove LUNS the device files (traditionally /dev/sdX) can change, so you can mount the filesystem on the wrong directory. From what I can see there are some solution like the multipath or udev to create persistent device files. However the solution can be more simple if in the /etc/fstab file one indicate the label of the OCFS2 volume and not the device file. I've tried this with OCFS1.2 with good results but my question is if it's supported (i.e if mounting an OCFS2 volume by label is supported) and if there is any problem with this approach. Will it work on future OCFS2 releases? Best regards and thanks Pedro Figueira CONFIDENCIAL NOTICE: This message, as well as any existing attached files, is confidential and intended exclusively for the individual(s) named as addressees. If you are not the intended recipient, you are kindly requested not to make any use whatsoever of its contents and to proceed to the destruction of the message, thereby notifying the sender. DISCLAIMER: The sender of this message can NOT ensure the security of its electronic transmission and consequently does not accept liability for any fact, which may interfere with the integrity of its content. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] migration methods (ocfs - ocfs2)
No, the fscat tools can only read certain unmounted file systems. They cannot write. You can use it to copy data from ocfs to ocfs2 on a box running the 2.6 kernel (sles9/10, el4/5). And when we tried to use fscat about a year ago to read our OCFS volumes on a RHEL4 machine it wasn't working. The source code itself had to be updated and as far I know that has not happened. I had email contact with Joel Becker and he said he would look into it, but it never happened. Mehmet Can ÖNAL wrote: Hi everyone; we have a production system with 6 nodes of RAC upon ocfs file system. We will upgrade our production system that we prefer and (also should) to migrate our data to ocfs2. Both methods of migration, fscat and backup/restore could be useful for us but one thing that we could not find is 1) If we are using fscp fort he migration method can we use fscp to pass data from ocfs to ocfs2. is fscp biwise, it can copy either ocfs to ocfs2 and also ocfs2 to ocfs? 2) For backup/restore the doubt is the same. We can restore ocfs backup to an ocfs2, could we also restore ocfs2 backup to an ocfs volume? Thanx a lot Ulf Zimmermann | Senior System Architect OPENLANE 4600 Bohannon Drive, Suite 100 Menlo Park, CA 94025 O: 650-532-6382 M: (510) 396-1764 F: (510) 580-0929 Email: u...@openlane.com | Web: www.openlane.com ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 mount points not automatically mounting onserver reboot
Isn't your problem that you are setting the filesystem type to ocfs2_oracw and ocfs2_oragrid ? Unless this has changed after 1.2.9, it should be ocfs2. From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of McKinley, Reid Sent: Thursday, August 06, 2009 12:07 PM To: McKinley, Reid; Srinivas Eeda Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 mount points not automatically mounting onserver reboot I should mention that the OCFS2 mount points mount fine when the manual mount is done after the reboot. Cmds for manual mounting: mount -o datavolume,nointr,_netdev,noatime -t ocfs2 /dev/mapper/mpath0 /u02 mount -o datavolume,nointr,_netdev,noatime -t ocfs2 /dev/mapper/mpath1 /u03 They just do not auto mount when rebooting. /etc/fstab entries look like this: /dev/mapper/mpath0 /u02 ocfs2_oracw datavolume,nointr,_netdev 0 0 /dev/mapper/mpath1 /u03 ocfs2_oragrid datavolume,nointr,_netdev 0 0 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of McKinley, Reid Sent: Thursday, August 06, 2009 2:47 PM To: Srinivas Eeda Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 mount points not automatically mounting onserver reboot Thanks, Srini. We cannot locate any other applicable msgs in the dmesg for this. Here is a portion. SCSI device sds: drive cache: write back sds: unknown partition table SCSI device sdt: 57688320 512-byte hdwr sectors (29536 MB) sdt: Write Protect is off sdt: Mode Sense: 6b 00 00 08 SCSI device sdt: drive cache: write back sdt: unknown partition table SCSI device sdu: 57688320 512-byte hdwr sectors (29536 MB) sdu: Write Protect is off sdu: Mode Sense: 6b 00 00 08 SCSI device sdu: drive cache: write back sdu: unknown partition table SCSI device sdv: 57688320 512-byte hdwr sectors (29536 MB) sdv: Write Protect is off sdv: Mode Sense: 6b 00 00 08 SCSI device sdv: drive cache: write back sdv: unknown partition table SCSI device sdw: 57688320 512-byte hdwr sectors (29536 MB) sdw: Write Protect is off sdw: Mode Sense: 6b 00 00 08 SCSI device sdw: drive cache: write back sdw: unknown partition table SCSI device sdx: 57688320 512-byte hdwr sectors (29536 MB) sdx: Write Protect is off sdx: Mode Sense: 6b 00 00 08 SCSI device sdx: drive cache: write back sdx: unknown partition table Hangcheck: starting hangcheck timer 0.9.0 (tick is 1 seconds, margin is 10 seconds). Hangcheck: Using monotonic_clock(). Here is the additional info. [r...@servername01 rc6.d]# ls K02avahi-daemonK20oracleasm K74ntpd K89netplugd K02avahi-dnsconfd K25sshdK75netfsK89rdisc K02haldaemon K30sendmailK80kdumpK90network K03rhnsd K35smb K85mdmonitorK92ip6tables K03yum-updatesdK35winbind K85mdmpdK92iptables K05atd K50netconsole K85messagebus K95kudzu K05saslauthd K50snmpd K85rpcgssd K96init.crs K10cupsK50snmptrapd K85rpcidmapdK97sysstat K10psacct K50xinetd K86nfslock K99lvm2-monitor K10Tivoli_lcf1 K60crond K87irqbalance K99microcode_ctl K10xfs K69rpcsvcgssd K87mcstrans S00killall K15httpd K72autofs K87multipathd S01reboot K19ocfs2 K73ypbind K87portmap K20nfs K74lm_sensors K87restorecond K20o2cbK74nscdK88syslog [r...@chastoemgc01 rc6.d]# From: Srinivas Eeda [mailto:srinivas.e...@oracle.com] Sent: Thursday, August 06, 2009 1:52 PM To: McKinley, Reid Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 mount points not automatically mounting on server reboot Reid, that error about stackglue module is harmless and can be ignored. Are you seeing any other errors associated with mounts failing in dmesg? Is the storage up by then? Can you list ls /etc/rcrunlevel.d/ thanks, --Srini McKinley, Reid wrote: Our ocfs2 mount points will not mount on server reboot. This is a critical issue because we are storing the Oracle Clusterware OCR and Voting files on a OCFS2 mount point. We receive this error in the system log: modprobe: FATAL: Module ocfs2_stackglue not found. /etc/fstab entries look like this: /dev/mapper/mpath0 /u02 ocfs2_oracw datavolume,nointr,_netdev 0 0 /dev/mapper/mpath1 /u03 ocfs2_oragrid datavolume,nointr,_netdev 0 0 We are using the following config: [r...@servername02 ~]# uname -r 2.6.18-92.el5 [r...@servername02 ~]# rpm -qa|grep -i ocfs2 ocfs2-tools-devel-1.4.2-1.el5 ocfs2-2.6.18-92.el5-1.4.2-1.el5 ocfs2-tools-1.4.2-1.el5 ocfs2console-1.4.2-1.el5 We have the ocfs and oc2b services configured to restart: [r...@servername02 ~]# chkconfig --list |grep ocfs ocfs2 0:off 1:off 2:on3:on4:on5:on6:off [r...@servername02 ~]# chkconfig
Re: [Ocfs2-users] OCFS2 FS with BACKUP Tools/Vendors
From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Daniel Keisling Sent: Thursday, April 02, 2009 1:22 PM To: Bumpass, Brian; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 FS with BACKUP Tools/Vendors I use HP Data Protector. OCFS2 is supported in v6.0. How do you do that? I don't see that support in 6.0 nor 6.1 ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] mounting mpath devices
-Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Masaryk Kevin D Sent: 02/24/2009 13:21 To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] mounting mpath devices I'm seeing some strange behavior from OCFS2 while trying to mount mpath/dm-multipath devices under RHEL5. Sometimes I can mount the EMC-connected, dual-pathed volumes just fine and sometimes I get device busy errors. I've tried mounting by label and also by explicit /dev/mapper/mpathXpY name with the same unpredictable behavior. I've also noticed that sometimes when a device is successfully mounted on all nodes, each node may return different output from the mount command regarding the device mounted; e.g. /dev/mapper/mpath0p1 vs. /dev/dm-17. One consistent aspect I have noticed whenever I receive the device busy error is that the /dev/dm-X names don't match up on each node. I also see that ocfs2console refers to each device by the /dev/dm-X name instead of the /dev/mapper/XX name. I guess my question is simply: Are dm-multipath devices supported under OCFS2? Are multipathed devices not recommended with OCFS2? Any documentation available on this? We are using OCFS2 with device-mapper-multipath (4 paths) and we use the /dev/mapper/XX names. Names match across machines although we did not particular paid attention to make it the same. Regards, Ulf. - OPENLANE Inc., T: 650-532-6382, F: 650-532-6441 4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025 - ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] strange node reboot in RAC environment
-Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Pedro Figueira Sent: 02/03/2009 09:07 To: ocfs2-users@oss.oracle.com Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida Subject: [Ocfs2-users] strange node reboot in RAC environment Hi all We have a 4 Oracle RAC with the following versions of software versions: Oracle and clusterware version 10.2.0.4 Red Hat Enterprise Linux AS release 4 with kernel version 2.6.9- 55.ELlargesmp ocfs2-tools-1.2.4-1 ocfs2-2.6.9-55.ELlargesmp-1.2.5-2 ocfs2console-1.2.4-1 timeout parameters: Heartbeat dead threshold: 31 Network idle timeout: 1 Network keepalive delay: 5000 Network reconnect delay: 2000 Until later last year the cluster was rock solid (hundreds). From January forward all the servers started to reboot synchronized but the strange thing is that there are no log messages in /var/log/messages, so we don't know if this a ocfs2 related problem. This reboots seems be related with the backup process (maybe extra load?). Other reboots only affect 2 out of 4 nodes. As ocfs2 will print out messages to the console and they might not get capture by anything, I recommend to setup the virtual serial of iLO and use something like conserver to attach a console to that virtual serial. I do this for all our OCFS hosts and have a log of anything going on, including BIOS screen. If ocfs2 is fencing because of I/O issues, it will show there. Last night we updated the firmware and drivers from HP of the DL580G4 server and today we had another reboot (now with the following messages in /var/log/messages): NODE 1: -- Feb 3 14:12:52 grid2db1 kernel: o2net: connection to node grid2db4 (num 3) at 10.0.2.52: has been idle for 10.0 seconds, shutting it down. Feb 3 14:12:52 grid2db1 kernel: (0,0):o2net_idle_timer:1418 here are some times that might help debug the situation: (tmr 1233670362.97595 now 1233670372.96280 dr 1233670362.97580 adv 1233670362.97604:1233670362.97604 func (c77ed98a:504) 1233670067.138220:1233670067.138233) Feb 3 14:12:52 grid2db1 kernel: o2net: no longer connected to node grid2db4 (num 3) at 10.0.2.52: Feb 3 14:16:26 grid2db1 syslogd 1.4.1: restart. Feb 3 14:16:26 grid2db1 syslog: syslogd startup succeeded NODE 4: -- Feb 3 14:12:46 grid2db4 kernel: (20,2):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device sdl after 6 milliseconds Feb 3 14:12:46 grid2db4 kernel: Heartbeat thread (20) printing last 24 blocking operations (cur = 18): Feb 3 14:16:27 grid2db4 syslogd 1.4.1: restart. Feb 3 14:16:27 grid2db4 syslog: syslogd startup succeeded Other reboots simple don't log any error message. So my question is if it's possible this reboots are triggers by OCFS2 and how to debug this problem? Should I change the timeout parameters? We are also planning to upgrade to OCFS2 1.2.9-1 and OCFS2 Tools 1.2.7- 1 and latest distro kernel, any catch? Best regards and thanks for any answer. Pedro Figueira Serviço de Estrangeiros e Fronteiras Direcção Central de Informática Departamento de Produção Telefone: + 351 217 115 153 -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: sábado, 31 de Janeiro de 2009 15:59 To: Carl Benson Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] one node rejects connection from new node Nodes can be added to an online cluster. The instructions are listed in the user's guide. On Jan 31, 2009, at 7:53 AM, Carl Benson cben...@fhcrc.org wrote: Sunil, Thank you for responding. I will try o2cb_ctl on Monday, when I have physical access to hit Reset in case one or more nodes lock up. If there really is a requirement to restart the cluster on wilson1 every time I add a new node (and I have five or six more nodes to add), that is too bad. Wilson1 is a 24x7 production system. --Carl Benson Sunil Mushran wrote: Could be that the cluster was already online on wilson1 when you propagated the cluster.conf to all nodes. If so, restart the cluster on that node. To add a node to an online cluster, you need to use the o2cb_ctl command. Details are in the 1.4 user's guide. Carl J. Benson wrote: Hello. I have three systems that share an ocfs2 filesystem, and I'm trying to add a fourth system. These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default. All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9 cluster.conf looks like this: node: ip_port = ip_address = 140.107.170.116 number = 0 name = merlot1 cluster = ocfs2 node: ip_port = ip_address = 140.107.158.54 number = 1 name =
Re: [Ocfs2-users] ocfs2 hangs during webserver usage
-Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of David Johle Sent: 01/28/2009 10:12 To: jmose...@corp.xanadoo.com Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] ocfs2 hangs during webserver usage At 06:32 PM 1/27/2009, jmose...@corp.xanadoo.com wrote: As others have indicated, I don't think that's going to work very well. You've got two different nodes trying to write to the same file constantly. I would keep each server's log on a locally mounted file system, or simply keep the logs on the OCFS2 filesystem, but have each node write to different log files. Yeah, that makes parsing access_logs slightly more of a problem for producing hit reports, etc, but I think you'll notice performance improve. Yes, parsing logs is just one good reason for having unified log files -- one of the motivations for using OCFS2 even. If our statistics program can handle multiple files, then at least having them in a shared directory would be useful. Another major area this would affect is web site issue troubleshooting which outputs to log files (not the access logs but others). I can only imagine the complexity of having to deal with locating specific logging information for a site user who is having trouble by going to 5 different nodes to dig through locally stored log files. Or worse yet, trying to correlate actions of multiple users who are each hitting different nodes! On that note, these other logs are written to by our aplications running under Tomcat. I really am not seeing any similar lags for those processes, only from apache. The only big difference I can see between them is the I/O pattern -- apache is usually 1 line per request as they are serviced, java web apps are more bursts of numerous lines, but not every request. There is still a non-trivial amount of logging happening for these java apps though, so I am surprised. In fact, Tomcat itself is configured to log each request with the processing time (used to produce user response time statistics), but those shared logs don't seem to be a point of contention like the apache access logs. For informational purposes, here are some line counts for logs on our main web site yesterday: 1577860 access log 1361 error log 4887437 web app log 340164 processing time log 6806822 total So only about 20% of the requests are handled by Tomcat. The web app log actually writes 3x as many lines, but overall it's less data (373M vs. 428M) and fewer actual write operations. This could explain why it is not/less prone to these write delays. 1.5 million hits for access log is not that much and you should be able to use separate files and then combine it into 1 before processing. The tools are out there for that. Another option is to send Apache logs to syslog, which means you have now 1 process receiving and writing the logfiles. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Another node is heartbeating in ourslot! errorswith LUN removal/addition
-Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Brian Kroth Sent: 12/05/2008 06:11 To: Daniel Keisling Cc: ocfs2-users@oss.oracle.com; Joel Becker Subject: Re: [Ocfs2-users] Another node is heartbeating in ourslot! errorswith LUN removal/addition Just for clarity, can you post the proper sequence you're now using to take SAN based snapshots? I'd like to try this on a new cluster I'm setting up. Thanks, Brian Here is how we do backup and refresh of development databases from our production database. The SANs involved in this are 3Par E200 and S400 using Rcopy and SnapClone. Production database gets put into backup mode Execute either Rcopy refresh or SnapClone Refresh on S400 (Rcopy for Dev, SnapClone for Backup) Take production database out of backup mode For backups we continue with: Running fsck to replay journal Mounting and unmounting volume on backup server to clear the dirty flag, just running fsck will not do that. We reset at this point the UUID and label of the volume to not run into issues we want to mount 2 different version of the snapclone Running one more time fsck to ensure no errors Mount volume Recover database via log files Clean shutdown of database Backup Database in cold state Unmount volume For development database refresh: Rcopy above refreshed a master volume on E200 SAN We shutdown development database X (we got several copies) and unmount volume on all nodes in the RAC cluster Run SnapClone refresh command on SAN Run fsck from one node to replay journal Mount and unmount volume on one node to clear dirty flag Reset UUID and label Run fsck one more time Mount volume on all RAC nodes again Recover database from logs and modify name to the development name At this point after other script run to modify contents in the database (email addresses, phone numbers, etc) And voila database is ready for use by developers. Ulf Zimmermann | Senior System Architect OPENLANE 4600 Bohannon Drive, Suite 100 Menlo Park, CA 94025 O: 650-532-6382 M: (510) 396-1764 F: (510) 580-0929 Email: [EMAIL PROTECTED] | Web: www.openlane.com ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Node reboot during network outage
We are running a bonded interface too with two separate switches behind it. We are using two Cisco 2960 for this interconnect, they run separate vlans for different oracle clusters but we do not run spanning tree on them. A failover for us takes about 100ms and we simulated a switch failure by turning one off. We also had an Ethernet port in a machine itself fail and it used the second port almost immediately. Ulf Zimmermann | Senior System Architect OPENLANE 4600 Bohannon Drive, Suite 100 Menlo Park, CA 94025 O: 650-532-6382 M: (510) 396-1764 F: (510) 580-0929 Email: [EMAIL PROTECTED] | Web: www.openlane.com -Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Sunil Mushran Sent: 04/22/2008 10:20 To: Mick Waters Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Node reboot during network outage The issue is not the time the switch takes to reboot. The issue is the amount of time the secondary switch takes to find a unique path. http://en.wikipedia.org/wiki/Spanning_tree_protocol Mick Waters wrote: Thanks Sunil, The network switch is brand new but has a fairly complex configuration due to us running a number of VLANs - however, we have found that it has always taken quite a while to reboot. I'll try increasing the idle timeout as suggested and let you know what happens. However, surely this is only treating the symptoms of what is, after all, a contrived scenario. Rebooting the switch is supposed to test what would happen if we had a real network outage. What if the switch were to stay down? My issue is that we have an alternative route via the other NIC in the bond and the other switch. The affected nodes in cluster shouldn't fence because they should still be able to see all of the other nodes in the cluster via this other route. Does this make sense? Regards, Mick. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: 22 April 2008 17:40 To: Mick Waters Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Node reboot during network outage The interface died at 14:25:44 and recovered at 14:27:43. That's two minutes. One solution is to increase o2cb_idle_timeout to 2mins. Better solution would be to look into your router setup to determine why it is taking 2 minutes for the router to reconfigure. Mick Waters wrote: Hi, my company is in the process of moving our web and database servers to new hardware. We have a HP EVA 4100 SAN which is being used by two database servers running in an Oracle 10g cluster and that works fine. We have gone to extreme lengths to ensure high availability. The SAN has twin disk arrays, twin controllers, and all servers have dual fibre interfaces. Networking is (should be) similarly redundant with bonded NICs connected in two-switch configuration, two firewalls and so on. We also want to share regular Linux filesystems between our servers - HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and we chose OCFS2 (1.2.8) to manage the cluster. As stated, each server in the 4 node cluster has a bonded interface set up as bond0 in a two-switch configuration (each NIC in the bond is connected to a different switch). Because this is a two-switch configuration, we are running the bond in active-standby mode and this works just fine. Our problem occurred when we were doing failover testing where we simulated the loss of one of the network switches by powering it off. The result was that the servers rebooted and this make a mockery of our attempts at a HA solution. Here is a short section from /var/log/messages following a reboot of one of the switches to simulate an outage: -- Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup interface eth0 is now down Apr 22 14:25:44 mtkws01p1 kernel: bnx2: eth0 NIC Link is Down Apr 22 14:26:13 mtkws01p1 kernel: o2net: connection to node mtkdb01p2 (num 1) at 10.1.3.50: has been idle for 30.0 seconds, shutting it down. Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv 1208870743.673433:1208870743.673434 func (97690d75:2) 1208870697.670758:1208870697.670760) Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node mtkdb01p2 (num 1) at 10.1.3.50: Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup interface eth0 is now up Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_do_master_request:1418 ERROR: link to 1 went down! Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995 ERROR: status = -107 Apr 22 14:28:35 mtkws01p1
[Ocfs2-users] OCFS2 and Cloning
I am working currently on cloning on a regular basis our production OCFS2 volumes to our test environment. For the database (Oracle 10G R2 RAC) we put it into backup mode, then execute a Snapclone on our 3Par SAN. Then we use RemoteCopy and SnapClone to our development 3Par SAN. To recover the OCFS2 volume I got through the following steps: Stop database umount /export/volume name Log into SAN to refresh Snapclone fsck.ocfs -y /dev/mapper/volume name mount /export/volume name umount /export/volume name tunefs.ocfs -U /dev/mapper/volume name tunefs.ocfs -L /export/volume name /dev/mapper/volume name mount /export/volume name Go through steps to recover and rename database Start database This seems to work, although I am curiously why I have to mount/umount the volume in between fsck and tunefs. The fsck obviously will go through and recover the journal but unless I mount/umount the volume once, tunefs will come back with dirty file system. Are there any other steps I should be doing or does this sequence look ok? Regards, Ulf. - ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441 4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025 - ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] Anyone have an idea how to find file i/othroughput?
I will look at it. In the meanwhile I did find at least one of the standby processes reading in bursts every 60-70 seconds like 400MB in 14.xx seconds from control01.ctl, even that file is only 94MB large. -Original Message- From: Andrew Phillips [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 19, 2008 01:43 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: RE: [Ocfs2-users] Anyone have an idea how to find file i/othroughput? Ulf, Have you considered using systemtap? There is a recipe here that could be used to find out whats going on; http://sourceware.org/systemtap/wiki/WSDeviceMonitor?highlight=%28% 28WarStories%29%29 I'm not sure how well that would work with ocfs2. Unlike dtrace, systemtap can be more uneven in coverage. Its also something that requires a bit of fiddling (installing debuginfo packages). The recipe above traps vfs_read and vfs_write so should work as a first stab at identifying the process id thats causing the I/O. I'd also advise some thought if its to be used on a production environment. Having said that, I've used it on a production oracle RAC database server and found it very valuable. I don't recall you mentioning the distribution, but RH, CentOS, and oracle's version of CentOS should all work. As always, read the instructions on the label, etc... Andy On Mon, 2008-02-18 at 23:14 -0800, Ulf Zimmermann wrote: Forgot to mention, this remote server is just Oracle. It has one standby database and one local database, the local one is suppose to be idle, i.e. nothing connecting to it, besides once in a while for available check. While the primary database of the standby was down, I saw less disk read access, but every 5 minutes for about 60 seconds I would see 50-60MB/sec. After the primary came back up, read access is as high as 160MB/sec. We are only seeing it on this single node of the remote standby. The local standby (on EXT3) is not doing the same thing. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Monday, February 18, 2008 19:28 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Anyone have an idea how to find file i/o throughput? If a userspace process is behind the io surge, then strace should help. But determining the process may require a bit of trial and error. Ulf Zimmermann wrote: We got a remote Oracle 10g R2 standby running on OCFS2. Initial when we started the standby, read I/O was 5MB/sec on average. Since then it has grown to over 40MB/sec (longer average, it peaks much higher). Here is a graph showing this: http://www.alameda.net/~ulf/dbphx01.png We also have a local standby running (on EXT3) which is not showing the same symptom. I am trying to find where all these reads are happening. Anyone have an idea how to figure that out on Linux? Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users In order to protect our email recipients, Betfair Group use SkyScan from MessageLabs to scan all Incoming and Outgoing mail for viruses. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] OCFS2 DLM problems
Hello everyone, once again. We are running into a problem, which has shown now 2 times, possible 3 (once the systems looked different.) The environment is 6 HP DL360/380 g5 servers with eth0 being the public interface, eth1 and bond0 (eth2 and eth3) used for clusterware and bond0 also used for OCFS2. The bond0 interface is in active/passive mode. There are no network errors counters showing and even during the problem we can communicate via the bond0 interface. This setup has been running for more then 2 months but last Wednesday morning and today again, we had 2 nodes causing locking problems. The problem starts with messages like this: Jan 23 03:20:44 dbprd01 kernel: o2net: no longer connected to node dbprd02 (num 1) at 192.168.202.2: Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR: status = -107 Jan 23 03:20:44 dbprd02 kernel: (5096,0):o2net_sendpage:868 ERROR: sendpage of size 24 to node dbprd01 (num 0) at 192.168.202.1: failed with -11 Jan 23 03:20:44 dbprd02 kernel: o2net: no longer connected to node dbprd01 (num 0) at 192.168.202.1: After these there are plenty of more messages, such as dlm_wait_for_node_death, dlm_send_remote_convert_request on dbprd02 and dlm_send_proxy_ast_msg, dlm_flush_asts on dbprd01. We are currently running OCFS2 1.2.5, the kernel is EL4 Update 5 x86_64 (2.6.9-55.ELsmp). I see there is one bug fixed in 1.2.6/1.2.7 related to DLM and I was wondering if the above problem could be related to it or if this is something different. Ulf Zimmermann | Senior System Architect ATC-Onlane, Inc. 4600 Bohannon Drive, Suite 100 Menlo Park, CA 94025 O: 650-532-6382 M: (510) 396-1764 F: (510) 580-0929 Email: [EMAIL PROTECTED] | Web: www.atc-onlane.com DISCLAIMER: This e-mail and any attachments are confidential and also may be privileged. If you are not the named recipient, or have otherwise received this communication in error, please delete it from your inbox, notify the sender immediately, and do not disclose its contents to any other person, use them for any purpose, or store or copy them in any medium. Thank you for your cooperation. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] OCFS2 DLM problems
Currently running 1.2.5-1 so we should upgrade. Is there any explanation how this bug gets triggered? We are trying to understand why we are suddenly hitting this bug, as this has been running for several months without being triggered. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 23, 2008 9:58 AM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 DLM problems 1.2.5-what? If you are not on 1.2.5-6, upgrade to that. It could be you are hitting the following issue addressed in that release. r3033 tcp - Retry sendpage() if it returns EAGAIN (bugzilla#896) No, don't upgrade to 1.2.7. We just discovered an issue in it and will be releasing 1.2.8 shortly. Ulf Zimmermann wrote: Hello everyone, once again. We are running into a problem, which has shown now 2 times, possible 3 (once the systems looked different.) The environment is 6 HP DL360/380 g5 servers with eth0 being the public interface, eth1 and bond0 (eth2 and eth3) used for clusterware and bond0 also used for OCFS2. The bond0 interface is in active/passive mode. There are no network errors counters showing and even during the problem we can communicate via the bond0 interface. This setup has been running for more then 2 months but last Wednesday morning and today again, we had 2 nodes causing locking problems. The problem starts with messages like this: Jan 23 03:20:44 dbprd01 kernel: o2net: no longer connected to node dbprd02 (num 1) at 192.168.202.2: Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR: status = -107 Jan 23 03:20:44 dbprd02 kernel: (5096,0):o2net_sendpage:868 ERROR: sendpage of size 24 to node dbprd01 (num 0) at 192.168.202.1: failed with -11 Jan 23 03:20:44 dbprd02 kernel: o2net: no longer connected to node dbprd01 (num 0) at 192.168.202.1: After these there are plenty of more messages, such as dlm_wait_for_node_death, dlm_send_remote_convert_request on dbprd02 and dlm_send_proxy_ast_msg, dlm_flush_asts on dbprd01. We are currently running OCFS2 1.2.5, the kernel is EL4 Update 5 x86_64 (2.6.9-55.ELsmp). I see there is one bug fixed in 1.2.6/1.2.7 related to DLM and I was wondering if the above problem could be related to it or if this is something different. Ulf Zimmermann | Senior System Architect ATC-Onlane, Inc. 4600 Bohannon Drive, Suite 100 Menlo Park, CA 94025 O: 650-532-6382 M: (510) 396-1764 F: (510) 580-0929 Email: [EMAIL PROTECTED] | Web: www.atc-onlane.com DISCLAIMER: This e-mail and any attachments are confidential and also may be privileged. If you are not the named recipient, or have otherwise received this communication in error, please delete it from your inbox, notify the sender immediately, and do not disclose its contents to any other person, use them for any purpose, or store or copy them in any medium. Thank you for your cooperation. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] OCFS2 DLM problems
It looks like around 3:20am we had about 800 to 1,200 packets per second coming in per node. But the packet size was not large, looks like less then 1Mbit/sec. 4 of the nodes are connected to our front end application servers and they would be pretty much idle at 3am. Our first customers usual do not login until just about then (East coast people starting to get to the dealer ships) and only in small numbers. We did not have much batch processing happening on the 5. and 6. node. We are planning on upgrading to 1.2.5-6 tonight but people here want to know more why it suddenly now happens. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 23, 2008 1:07 PM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 DLM problems Depends on the net traffic I guess. The error returned asks the user to retry and the older code wasn't. AFAIR, we have never encountered this in our main test cluster. Ulf Zimmermann wrote: Currently running 1.2.5-1 so we should upgrade. Is there any explanation how this bug gets triggered? We are trying to understand why we are suddenly hitting this bug, as this has been running for several months without being triggered. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 23, 2008 9:58 AM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] OCFS2 DLM problems 1.2.5-what? If you are not on 1.2.5-6, upgrade to that. It could be you are hitting the following issue addressed in that release. r3033 tcp - Retry sendpage() if it returns EAGAIN (bugzilla#896) No, don't upgrade to 1.2.7. We just discovered an issue in it and will be releasing 1.2.8 shortly. Ulf Zimmermann wrote: Hello everyone, once again. We are running into a problem, which has shown now 2 times, possible 3 (once the systems looked different.) The environment is 6 HP DL360/380 g5 servers with eth0 being the public interface, eth1 and bond0 (eth2 and eth3) used for clusterware and bond0 also used for OCFS2. The bond0 interface is in active/passive mode. There are no network errors counters showing and even during the problem we can communicate via the bond0 interface. This setup has been running for more then 2 months but last Wednesday morning and today again, we had 2 nodes causing locking problems. The problem starts with messages like this: Jan 23 03:20:44 dbprd01 kernel: o2net: no longer connected to node dbprd02 (num 1) at 192.168.202.2: Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459 ERROR: status = -107 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR: status = -107 Jan 23 03:20:44 dbprd02 kernel: (5096,0):o2net_sendpage:868 ERROR: sendpage of size 24 to node dbprd01 (num 0) at 192.168.202.1: failed with -11 Jan 23 03:20:44 dbprd02 kernel: o2net: no longer connected to node dbprd01 (num 0) at 192.168.202.1: After these there are plenty of more messages, such as dlm_wait_for_node_death, dlm_send_remote_convert_request on dbprd02 and dlm_send_proxy_ast_msg, dlm_flush_asts on dbprd01. We are currently running OCFS2 1.2.5, the kernel is EL4 Update 5 x86_64 (2.6.9-55.ELsmp). I see there is one bug fixed in 1.2.6/1.2.7 related to DLM and I was wondering if the above problem could be related to it or if this is something different. Ulf Zimmermann | Senior System Architect ATC-Onlane, Inc. 4600 Bohannon Drive, Suite 100 Menlo Park, CA 94025 O: 650-532-6382 M: (510) 396-1764 F: (510) 580-0929 Email: [EMAIL PROTECTED] | Web: www.atc-onlane.com DISCLAIMER: This e-mail and any attachments are confidential and also may be privileged. If you are not the named recipient, or have otherwise received this communication in error, please delete it from your inbox, notify the sender immediately, and do not disclose its contents to any other person, use them for any purpose, or store or copy them in any medium. Thank you for your cooperation. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] Missing something basic...
You need shared storage to use OCFS, not local storage on each server. -Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Benjamin Smith Sent: Wednesday, October 17, 2007 18:00 To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Missing something basic... I'm stumped. I'm doing some research on clustered file systems to be deployed over winter break, and am testing on spare machines first. I have two identically configured computers, each with a 10 GB partition, /dev/hda2. I intend to combine these two LAN/RAID1 style to represent 10 GB of redundant cluster storage, so that if either machine fails, computing can resume with reasonable efficiency. These machines are called cluster1 and cluster2, and are currently on a local Gb LAN. They are running CentOS 4.4 (recompile of RHEL 4.4) I've set up SSH RSA keys so that I can ssh directly from either to the other without passwords, though I use a non-standard port, defined in ssh_config and sshd_config. I've installed the RPMs without incident. I've set up a cluster called ocfs2 with nodes cluster1 and cluster2, with the corresponding LAN IP addresses. I've confirmed that configuration changes populate to cluster2 when I push the appropriate button in the X11 ocfs2console on cluster1. I've checked the firewall(s) to allow inbound TCP to port connections on both machines, and verified this with nmap. I've also tried turning off iptables completely. On cluster1, I've formatted and mounted partition oracle to /meda/cluster using the ocfs2console and I can r/w to this partition with other applications. There's about a 5-second delay when mounting/unmounting, and the FAQ reflects that this is normal. SELinux is completely off. Questions: 1) How do I get this oracle partition to show/mount on host cluster2, and subsequent systems added to the cluster? Should I be expecting a /dev/* block device to mount, or is there some other program I should be using, similar to smbmount? 2) How do I get this /dev/hda2 (aka oracle) on cluster1 to combine (RAID1 style) with /dev/hda2 on cluster2, so that if either host goes down I still have a complete FS to work from? Am I mis-understanding the abilities and intentions of OCFS2? Do I need to do something with NBD, GNBD, ENDB, or similar? If so, what's the recommended approach? Thanks, -Ben -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] Cluster setup
You have Oracle people telling us not to use bonding. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Thursday, October 11, 2007 15:28 To: Ulf Zimmermann Cc: Randy Ramsdell; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Cluster setup How is this a fs problem? Ulf Zimmermann wrote: We don't and when we were investigating why we had on the ProCurve 4108gl reassembly problems, we were specific asked if we are doing bonding or VLAN tagging (neither we were doing). Just looks like the ProCurve are loosing packets without telling so. We switched in Cisco 2960G-48 with Jumbo Frames now and haven't had any reassembly timeouts since then. Global Cache timeout has gone down significant. Each Interconnect for Oracle 10G has its own Cisco 2960G-48 now. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Thursday, October 11, 2007 15:13 To: Ulf Zimmermann Cc: Randy Ramsdell; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Cluster setup Use network bonding. Ulf Zimmermann wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Alexei_Roudnev Sent: Thursday, October 11, 2007 11:10 To: Sunil Mushran; Randy Ramsdell Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Cluster setup I explained you: 1 - single heartbeat interface IS A BUG for me. I haven't really followed the whole discussion but that point above did just come to my mind a few days ago when we replaced our HP ProCurve 4108gl used for 3 separate Interconnects on 10g, where only 1 also carries the OCFS2 heartbeat. So if that switch dies, OCFS2 will go down while Oracle 10g could survive (if OCFS2 wouldn't die). I have to agree that is a bad design at this point. Heartbeat should also be on at least 2 links for OCFS2. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] Cluster setup
-Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Alexei_Roudnev Sent: Thursday, October 11, 2007 11:10 To: Sunil Mushran; Randy Ramsdell Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Cluster setup I explained you: 1 - single heartbeat interface IS A BUG for me. I haven't really followed the whole discussion but that point above did just come to my mind a few days ago when we replaced our HP ProCurve 4108gl used for 3 separate Interconnects on 10g, where only 1 also carries the OCFS2 heartbeat. So if that switch dies, OCFS2 will go down while Oracle 10g could survive (if OCFS2 wouldn't die). I have to agree that is a bad design at this point. Heartbeat should also be on at least 2 links for OCFS2. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] Cluster setup
We don't and when we were investigating why we had on the ProCurve 4108gl reassembly problems, we were specific asked if we are doing bonding or VLAN tagging (neither we were doing). Just looks like the ProCurve are loosing packets without telling so. We switched in Cisco 2960G-48 with Jumbo Frames now and haven't had any reassembly timeouts since then. Global Cache timeout has gone down significant. Each Interconnect for Oracle 10G has its own Cisco 2960G-48 now. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Thursday, October 11, 2007 15:13 To: Ulf Zimmermann Cc: Randy Ramsdell; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Cluster setup Use network bonding. Ulf Zimmermann wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Alexei_Roudnev Sent: Thursday, October 11, 2007 11:10 To: Sunil Mushran; Randy Ramsdell Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Cluster setup I explained you: 1 - single heartbeat interface IS A BUG for me. I haven't really followed the whole discussion but that point above did just come to my mind a few days ago when we replaced our HP ProCurve 4108gl used for 3 separate Interconnects on 10g, where only 1 also carries the OCFS2 heartbeat. So if that switch dies, OCFS2 will go down while Oracle 10g could survive (if OCFS2 wouldn't die). I have to agree that is a bad design at this point. Heartbeat should also be on at least 2 links for OCFS2. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] 6 node cluster with unexplained reboots
-Original Message- From: Mark Fasheh [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 15, 2007 16:49 To: Ulf Zimmermann Cc: Sunil Mushran; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote: Index 22: took 10003 ms to do waiting for write completion *** ocfs2 is very sorry to be fencing this system by restarting *** There were no SCSI errors on the console or logs around the time of this reboot. It looks like the write took too long - as a first step, you might want to up the disk heartbeat timeouts on those systems. Run: $ /etc/init.d/o2cb configure on each node to do that. That won't hide any hardware problems, but if the problem is just a latency to get the write to disk, it'd help tune it away. --Mark Ok, we had now 4 reboots, plus 2 more by my own action, which were by OCFS2 fencing. As said in previous emails we were seeing some SCSI errors and although device-mapper-multipath seems to take care of it, sometimes the 10 second configured in multipath.conf and the default timings of o2cb are colliding. On the two clusters we have run into this, I have now replaced several fibre cables and it seems we also have 1 bad port on one of the fibre channel switches. Swapped first cable, still problems. Swapped SPF, still problem, moved node to another port from where the SPF was swapped from, 0 errors. Now I am still concerned about the timing of device-mapper-multipath and o2cb. O2cb is currently set to the default of: Specify heartbeat dead threshold (=7) [7]: Specify network idle timeout in ms (=5000) [1]: Specify network keepalive delay in ms (=1000) [5000]: Specify network reconnect delay in ms (=2000) [2000]: So the timeout I seem to hit is the 10,000 of network idle timeout? Even this timeout occurs on the disk? What values would you recommend I should set this to? Another question in case someone can answer this. If I get a syslog entries like: Aug 16 00:44:33 dbprd01 kernel: SCSI error : 1 0 0 1 return code = 0x2 Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector 346452448 Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing path 8:144. Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector 346452456 Aug 16 00:44:33 dbprd01 kernel: SCSI error : 1 0 1 1 return code = 0x2 Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector 1469242384 Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing path 8:208. Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector 1469242392 Aug 16 00:44:33 dbprd01 multipathd: 8:144: mark as failed Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 3 Aug 16 00:44:33 dbprd01 multipathd: 8:208: mark as failed Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 2 Does this actually errors out all the way or does the request still go to one of the remaining paths? If this request doesn't error out, because it was able to still fulfill it via the 2 remaining paths, then it is really just the timing between device-mapper-multipath recovering this request through the remain paths and our o2cb settings. If not, we might still have another problem. We have seen many such errors but only had like 8 reboots, all I think attributed to fencing now. Regards, Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] 6 node cluster with unexplained reboots
-Original Message- From: Mark Fasheh [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 15, 2007 16:49 To: Ulf Zimmermann Cc: Sunil Mushran; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote: Index 22: took 10003 ms to do waiting for write completion *** ocfs2 is very sorry to be fencing this system by restarting *** There were no SCSI errors on the console or logs around the time of this reboot. It looks like the write took too long - as a first step, you might want to up the disk heartbeat timeouts on those systems. Run: $ /etc/init.d/o2cb configure on each node to do that. That won't hide any hardware problems, but if the problem is just a latency to get the write to disk, it'd help tune it away. --Mark The SAN is a 3Par E200, which does write into cache on its two controllers, then acknowledges a write and then writes it actually to disk. I have not found any reason for this delay yet, so sofar I am stumped why it had such a long delay writing. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] 6 node cluster with unexplained reboots
-Original Message- From: Mark Fasheh [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 15, 2007 17:50 To: Ulf Zimmermann Cc: Sunil Mushran; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots On Wed, Aug 15, 2007 at 05:43:14PM -0700, Ulf Zimmermann wrote: The SAN is a 3Par E200, which does write into cache on its two controllers, then acknowledges a write and then writes it actually to disk. I have not found any reason for this delay yet, so sofar I am stumped why it had such a long delay writing. Are you saying that the controllers are doing write-back caching? If they're in that sort of mode, you need to change it to write-through for a clustered environment. --Mark The controller getting the request mirrors the request to the second controller (in this case there is only 1, there can be up to 7 other). Then it acknowledges the request and writes it to disk. Each controller has double batteries to be able to finish any pending writes. If a controller fails, it will only acknowledge the write after it is physical on the disk. This is part of the 3Par operation. I have submitted a request to 3Par to check the extensive logs they generate to see if there is anything which can explain this write delay. The previous reboots we had, for which we have no console logs, may have been OCFS2 fencing or something else, all of which happened while the cluster has been pretty much idle, while this time there was activity (import). Monday's reboot was the first since the initial 4 reboots. I wished OCFS2 would still log more then just on the console so we had evidence on the other reboots. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] 6 node cluster with unexplained reboots
-Original Message- From: Mark Fasheh [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 15, 2007 18:04 To: Alexei_Roudnev Cc: Ulf Zimmermann; Sunil Mushran; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots On Wed, Aug 15, 2007 at 05:52:49PM -0700, Alexei_Roudnev wrote: ANY SCSI controller can quitly delay IO for 10 - 20 seconds, without errors and explanationbs. 10 seconds threshold in OCFSv2 will never work properly. That has nothing to do with what I'm asking him. Ulf was described his controller thusly: does write into cache on its two controllers, then acknowledges a write and then writes it actually to disk. I'm keying in on the part where it acknowledges a write (presumably to the host os) and _then_ pushes that write out to the disk. In general, that's the wrong order ;) Anyway, getting back to the task of trying to fix someone's problem, I admit that I don't really know whether it's possible for a controller to do writeback caching, I'm just trying to clarify what's going on, that's all. --Mark I primary posted the messages just as a follow up for now. Waiting for 3Par to tell me if they have anything in the logs before I decide on further progression, i.e. raising the write timeout or not. The first 4 reboots we had, which may or may not have been OCFS2, happened on our 3Par S400 which has 16GB of cache per controller. The last reboot for which I do have the console messages (thanks HP for iLO and virtual serial plus Conserver :-) ), happened on our E200, which has 8GB of cache per controller. We also have some SCSI errors on some nodes and I am currently awaiting a maintance window to replace two FC cables to see if that clears up the errors. As you can see, all kind of things unfortunately going on. And I am official on vacation right now too. Sigh. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] 6 node cluster with unexplained reboots
One node of our 4-node cluster rebooted last night: (11,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device dm-1 after 12000 milliseconds Heartbeat thread (11) printing last 24 blocking operations (cur = 22): Heartbeat thread stuck at waiting for write completion, stuffing current time in to that blocker (index 22) Index 23: took 0 ms to do checking slots Index 0: took 1 ms to do waiting for write completion Index 1: took 1997 ms to do msleep Index 2: took 0 ms to do allocating bios for read Index 3: took 0 ms to do bio alloc read Index 4: took 0 ms to do bio add page read Index 5: took 0 ms to do submit_bio for read Index 6: took 8 ms to do waiting for read completion Index 7: took 0 ms to do bio alloc write Index 8: took 0 ms to do bio add page write Index 9: took 0 ms to do submit_bio for write Index 10: took 0 ms to do checking slots Index 11: took 0 ms to do waiting for write completion Index 12: took 1992 ms to do msleep Index 13: took 0 ms to do allocating bios for read Index 14: took 0 ms to do bio alloc read Index 15: took 0 ms to do bio add page read Index 16: took 0 ms to do submit_bio for read Index 17: took 7 ms to do waiting for read completion Index 18: took 0 ms to do bio alloc write Index 19: took 0 ms to do bio add page write Index 20: took 0 ms to do submit_bio for write Index 21: took 0 ms to do checking slots Index 22: took 10003 ms to do waiting for write completion *** ocfs2 is very sorry to be fencing this system by restarting *** There were no SCSI errors on the console or logs around the time of this reboot. -Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Ulf Zimmermann Sent: Monday, July 30, 2007 11:11 To: Sunil Mushran Cc: ocfs2-users@oss.oracle.com Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots Too early to call. Management made the call This hardware seems to have been stable, lets use it. -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Monday, July 30, 2007 11:07 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots So are you suggesting the reason was bad hardware? Or, is it too early to call? Ulf Zimmermann wrote: I have serial console setup with logging via conserver but so far no further crash. We also swapped hardware a bit around (another 4 node cluster with DL360g5 was working without crash for several weeks, we swapped those 4 nodes in for the first 4 in the 6 node cluster). -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Monday, July 30, 2007 10:21 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots Do you have a netconsole setup? If not, set it up. That will capture the real reason for the reset. Well, it typically does. Ulf Zimmermann wrote: We just installed a new cluster with 6 HP DL380g5, dual single port Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches to a 3Par S400. We are using the 3Par recommended config for the Qlogic driver and device-mapper-multipath giving us 4 paths to the SAN. We do see some SCSI errors where DM-MP is failing a path after get a 0x2000 error from the SAN controller, but the path gets puts back in service in less then 10 seconds. This needs to be fixed but I don't think it is what is causing our reboots. 2 of the nodes rebooted once while being idle (ocfs2 and clusterware were running, no db) and one node rebooted while idle (another node was copying using fscat our 9i db from ocfs1 to the ocfs2 data volume) and once while some load was put on it via the upgraded 10g database. In all cases it is as if someone a hardware reset button. No kernel panic (at least not one leading to a stop with visable message), we can get a dirty write cache for the internal cciss controller. The only messages we get on the nodes are when the crashed node is already in reset and it missed its ocfs2 heartbeat (set to the default of 7), followed later by crs moving the vip. Any hints on trouble shooting this would be appreciated. Regards, Ulf. -- Sent from my BlackBerry Wireless Handheld ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users
RE: [Ocfs2-users] 6 node cluster with unexplained reboots
I have serial console setup with logging via conserver but so far no further crash. We also swapped hardware a bit around (another 4 node cluster with DL360g5 was working without crash for several weeks, we swapped those 4 nodes in for the first 4 in the 6 node cluster). -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: Monday, July 30, 2007 10:21 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots Do you have a netconsole setup? If not, set it up. That will capture the real reason for the reset. Well, it typically does. Ulf Zimmermann wrote: We just installed a new cluster with 6 HP DL380g5, dual single port Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches to a 3Par S400. We are using the 3Par recommended config for the Qlogic driver and device-mapper-multipath giving us 4 paths to the SAN. We do see some SCSI errors where DM-MP is failing a path after get a 0x2000 error from the SAN controller, but the path gets puts back in service in less then 10 seconds. This needs to be fixed but I don't think it is what is causing our reboots. 2 of the nodes rebooted once while being idle (ocfs2 and clusterware were running, no db) and one node rebooted while idle (another node was copying using fscat our 9i db from ocfs1 to the ocfs2 data volume) and once while some load was put on it via the upgraded 10g database. In all cases it is as if someone a hardware reset button. No kernel panic (at least not one leading to a stop with visable message), we can get a dirty write cache for the internal cciss controller. The only messages we get on the nodes are when the crashed node is already in reset and it missed its ocfs2 heartbeat (set to the default of 7), followed later by crs moving the vip. Any hints on trouble shooting this would be appreciated. Regards, Ulf. -- Sent from my BlackBerry Wireless Handheld ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] Adding new nodes to OCFS2?
I actually did the below command (for node 2 and 3), it added it to the /etc/ocfs2/cluster.conf but as far I could tell, it didn't allow me to actually mount then on node 2 or 3. But as I had to do some other work (resize another volume) I ended up rebooting and got them added that way. -Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Nuno Fernandes Sent: Monday, July 09, 2007 02:58 To: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Adding new nodes to OCFS2? Do: o2cb_ctl -C -i -n dbcl1n5 -t node -a number=4 -a ip_address=192.168.201.5 -a ip_port= -a cluster=dbcl1 on all nodes. It automaticaly updates cluster.conf e add on-the-fly the nodes. Check faq and make sure you understand all command line options. Rgds ./npf On Sunday 08 July 2007 04:18:51 Ulf Zimmermann wrote: I looked around, found older post which seems not applicable anymore. I have a cluster of 2 nodes right now, which has 3 OCFS2 file systems. All the file systems were formatted with 4 node slots. I added the two news nodes (by hand, by ocfs2console and o2cb_ctl), so my /etc/ofcfs/cluster.conf looks right: node: ip_port = ip_address = 192.168.201.1 number = 0 name = dbcl1n1 cluster = dbcl1 node: ip_port = ip_address = 192.168.201.2 number = 1 name = dbcl1n2 cluster = dbcl1 node: ip_port = ip_address = 192.168.201.3 number = 2 name = dbcl1n3 cluster = dbcl1 node: ip_port = ip_address = 192.168.201.4 number = 3 name = dbcl1n4 cluster = dbcl1 cluster: node_count = 4 name = dbcl1 But is there a way to get node 0 and 1 to dynamically accept the addition of node 2 and 3? Everything I find seems to indicate I have to unmount, run /etc/init.d/ocfs2 stop, /etc/init.d/o2cb restart and then /etc/init.d/ocfs2 start. Is there no way of telling o2cb there are two new nodes? Like a /etc/init.d/o2cb reconfigure? - ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441 4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025 - ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] Adding new nodes to OCFS2?
I looked around, found older post which seems not applicable anymore. I have a cluster of 2 nodes right now, which has 3 OCFS2 file systems. All the file systems were formatted with 4 node slots. I added the two news nodes (by hand, by ocfs2console and o2cb_ctl), so my /etc/ofcfs/cluster.conf looks right: node: ip_port = ip_address = 192.168.201.1 number = 0 name = dbcl1n1 cluster = dbcl1 node: ip_port = ip_address = 192.168.201.2 number = 1 name = dbcl1n2 cluster = dbcl1 node: ip_port = ip_address = 192.168.201.3 number = 2 name = dbcl1n3 cluster = dbcl1 node: ip_port = ip_address = 192.168.201.4 number = 3 name = dbcl1n4 cluster = dbcl1 cluster: node_count = 4 name = dbcl1 But is there a way to get node 0 and 1 to dynamically accept the addition of node 2 and 3? Everything I find seems to indicate I have to unmount, run /etc/init.d/ocfs2 stop, /etc/init.d/o2cb restart and then /etc/init.d/ocfs2 start. Is there no way of telling o2cb there are two new nodes? Like a /etc/init.d/o2cb reconfigure? - ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441 4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025 - ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] Hi
-Original Message- From: [EMAIL PROTECTED] [mailto:ocfs2-users- [EMAIL PROTECTED] On Behalf Of Sunil Mushran Sent: 05/07/2007 10:47 To: Alexei_Roudnev Cc: Ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Hi None of what you have written allows you to use our resources to spread your opinions as official recommendation. Alexei_Roudnev wrote: Oracle itself have not a SINGLE opinion (to be curious, I hear a strong recommendation against OCFSv2 from oracle support, which I can not agree with), so we can't treat your recommendations as official as well - you are interested in OCFSv2 while users are not (users are interested in making our data centers run smoothly). The only _official_ thing is _certification matrix_. - Original Message - Alexei, While you are free to use this forum to share your opinions, do not couch these opinions as official recommendations. When push comes to shove, we are helping users not you. We develop, build, distribute the software, not you. So it may serve to community better if you let us offer the official recommendations and not you. Sunil Just to add some comments from a user of Oracle 9i with OCFSv1 on RedHat AS2.1 who tried to upgrade to EL4 and OCFSv2 and failed miserable: Oracle support pretty much told us the problems we were running into are problems of OCFSv2 and they weren't really willing to help us. The feeling we were getting was that two Oracle departments (the one writing the Database RAC engine and the one writing OCFSv2) are fighting with each other. In general I have a very low opinion of Oracle and their quality of code and tools. Like patch revision numbering? Does not exist. Patch tools suppose to patch all machines in clusters? You wish. Decent error messages? They never heard about that. We ended up with staying on AS2.1 and OCFSv1 for now and just migrating our data to a new SAN. Regards, Ulf. - ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441 4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025 - ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
RE: [Ocfs2-users] Some questions about ocfs2
-Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: 04/25/2007 12:16 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Some questions about ocfs2 What's the blocksize? Block Size Bits: 12 Cluster Size Bits: 14 Ulf Zimmermann wrote: -Original Message- From: Sunil Mushran [mailto:[EMAIL PROTECTED] Sent: 04/25/2007 10:31 To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Some questions about ocfs2 # debugfs.ocfs2 -R stats -h /dev/sdy2 | grep Cluster Size Block Size Bits: 12 Cluster Size Bits: 17 12 = 4K 17 = 128K Have you tried stracing the process? # strace -tt -T -o /tmp/strace.out ... Yes, strace shows shows most time is spent in lstat64 ( 99%), where average execution time on ext3 is 60 usecs/call while on the ocfs2 volume it is 500 usecs/call. Ulf Zimmermann wrote: Is there a way to see how a file system was formatted, i.e. the block size and cluster size? I currently have a 2TB file system, of which about 840GB are in use by around 9 million image files. Average size of these images is 60-100KB. Currently our production servers still have separate file systems on ext3 and we are doing nightly rsync from there to this ocfs2 volume. This currently takes ~6 hours, which seems a tad slow. The system spends most time during writing files which have changed on the production servers, with high I/O wait. The SAN this ocfs2 volume is on is pretty much idle, I only see up to about 20MB/sec traffic and the two nodes which have this volume mounted have a private GigE interconnect setup for cluster.conf. Any tips on how to debug where this slowness comes from? Or even suggestion to use another cluster file system for a scenario like this. Regards, Ulf. - ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441 4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025 - ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users