[Ocfs2-users] A bug in OCFS2 umount

2013-07-26 Thread Ulf Zimmermann
Looks like I found a bug in regards to unmounts. Umount allows you to use 
directory or device name to unmounts a file system. On at least the OEL6.3 
server I am working on, umounting based on device name will lead to a message 
says that the device is not mounted. This only happens with OCFS2 file systems, 
it works correctly with ext3/4 ones.

Ulf.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 (slighly OT)

2013-07-10 Thread Ulf Zimmermann
So far we only have OEL Network, but no support as we are still in the 
investigation phase of switching to OEL from RHEL.

So no support yet that route.


From: Mihail Daskalov [mailto:mdaska...@technologica.com]
Sent: Wednesday, July 10, 2013 07:55
To: Sunil Mushran; Ulf Zimmermann
Cc: ocfs2-users@oss.oracle.com
Subject: RE: [Ocfs2-users] Problems with volumes coming from RHEL5 going to 
OEL6 (slighly OT)

Hi Sunil,
Regarding the ocfs tools version 1.8.0 you should know best what it was meant 
to be (maybe not true for 1.8.0-10 in OEL6U3).

Is it possible that the tag for 1.8.0 disappeared from the git repository? Or 
there was never a tag for 1.8.0 ?

Bellow is the link to commit in 1.8.2 tag, that brings the version to 1.8.0

https://oss.oracle.com/git/?p=ocfs2-tools.git;a=commitdiff;h=2480a215a600050d2bf923044dffac91439d982a;hp=8b5f4ad727e019cb557c4b516ab401c15c5c317e

and later on another commit that bring the version to 1.8.2
https://oss.oracle.com/git/?p=ocfs2-tools.git;a=commitdiff;h=560a1e60936fe868b00cfc9cad5def726e10828e

I am sorry I am not actually helping to Ulf's problem.
Ulf, maybe you can really follow the head version and try to see an explanation 
of the error message.
Anyway I think it would be best to open a SR with Oracle if you have Linux 
support contract.

Does anyone know how to find you the git repository at least for some packages 
in Oracle Linux. I know the source for each package is available as .src.rpm 
but how could I see the changes, or the tag from which every version was build?

I remember Wim talking on something like that a while ago (saying  oracle is 
not like redhat mangling changelogs), but I can't find the article right now.

If you find out what is behind ocfs2-tools 1.8.0-10 it would be easier to track 
the problem.

Regards,
Mihail Daskalov


From: 
ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran
Sent: Wednesday, July 10, 2013 2:11 AM
To: Ulf Zimmermann
Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

The error does not make sense. Also I don't know what 1.8.0 tools means. I 
cannot see that label in the src tree.
https://oss.oracle.com/git/?p=ocfs2-tools.git;a=summary
One option is to build the tools from the head.

On Tue, Jul 9, 2013 at 2:25 PM, Ulf Zimmermann 
u...@openlane.commailto:u...@openlane.com wrote:
Sunil, any suggestions on this?


From: 
ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com]
 On Behalf Of Ulf Zimmermann
Sent: Saturday, June 22, 2013 15:20
To: Sunil Mushran

Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

[root@co-db03 ulf]# debugfs.ocfs2 -R stats /dev/mapper/aucp_data_bk_2_x
Revision: 0.90
Mount Count: 0   Max Mount Count: 20
State: 0   Errors: 0
Check Interval: 0   Last Check: Sun Sep 25 05:32:29 2011
Creator OS: 0
Feature Compat: 0
Feature Incompat: 0
Tunefs Incomplete: 0
Feature RO compat: 0
Root Blknum: 513   System Dir Blknum: 514
First Cluster Group Blknum: 256
Block Size Bits: 12   Cluster Size Bits: 20
Max Node Slots: 10
Extended Attributes Inline Size: 0
Label: /export/backuprecovery.AUCP
UUID: 5F9C2727159743529200CE9C5E155562
Hash: 0 (0x0)
DX Seeds: 0 0 0 (0x 0x 0x)
Cluster stack: classic o2cb
Cluster flags: 0
Inode: 2   Mode: 00   Generation: 3147295185tel:3147295185 
(0xbb97e9d1)
FS Generation: 3147295185tel:3147295185 (0xbb97e9d1)
CRC32:    ECC: 
Type: Unknown   Attr: 0x0   Flags: Valid System Superblock
Dynamic Features: (0x0)
User: 0 (root)   Group: 0 (root)   Size: 0
Links: 0   Clusters: 1572864
ctime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011
atime: 0x0 0x0 -- Wed Dec 31 16:00:00.0 1969
mtime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011
dtime: 0x0 -- Wed Dec 31 16:00:00 1969
Refcount Block: 0
Last Extblk: 0   Orphan Slot: 0
Sub Alloc Slot: Global   Sub Alloc Bit: 65535


From: Sunil Mushran 
[mailto:sunil.mush...@gmail.commailto:sunil.mush...@gmail.com]
Sent: Friday, June 21, 2013 11:11
To: Ulf Zimmermann
Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

Can you dump the following using the 1.8 binary.
debugfs.ocfs2 -R stats /dev/mapper/.

On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann 
u...@openlane.commailto:u...@openlane.com wrote:
We have a production cluster of 6 nodes, which are currently running RHEL 5.8

Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6 (slighly OT)

2013-07-10 Thread Ulf Zimmermann
I will see what I can do. How large would a o2image be?

To just reiterate, these are not new file systems. They were created with 
ocfs2-2.6.9-55.ELsmp-1.2.9-1.el4 and ocfs2-tools-1.2.7-1.el4 under RHEL 4. The 
primary user of these volumes is a cluster of 6-nodes running RHEL 5.8 with 
ocfs2-2.6.18-308.11.1.el5-1.4.10-1 and ocfs2-tools-1.6.3-2.el5. Another 
machine, which still runs the same EL4 binaries, is mounting these snap cloned 
volumes daily, doing operations on the DB files and then copying the data off.



From: Herbert van den Bergh [mailto:herbert.van.den.be...@oracle.com]
Sent: Wednesday, July 10, 2013 09:54
To: Mihail Daskalov
Cc: Sunil Mushran; Ulf Zimmermann; ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to 
OEL6 (slighly OT)

It's possible that the 1.8.0 tag was never created in the ocfs-tools git 
repository.  But it's not of any use anyway.  If you check the changelog of the 
ocfs-tools rpm, you'll see that there were many patches since 1.8.0, so the 
1.8.0-10 version that Ulf is using would be very different from a 1.8.0 tag in 
git.

Ulf, I suggest you create an o2image of the bad filesystem, and see if the 
problem can be reproduced with that image.  If it can, then you may want to 
make that o2image available to the OCFS2 developers so they can debug 
ocfs2-tools to see what is causing the malloc/free error.  You may also want to 
include the exact steps to take to reproduce this, starting from the mkfs up to 
the failure, indicating exactly what versions of kernel and tools were used 
along the way.

Thanks,
Herbert.

On 7/10/13 7:55 AM, Mihail Daskalov wrote:
Hi Sunil,
Regarding the ocfs tools version 1.8.0 you should know best what it was meant 
to be (maybe not true for 1.8.0-10 in OEL6U3).

Is it possible that the tag for 1.8.0 disappeared from the git repository? Or 
there was never a tag for 1.8.0 ?

Bellow is the link to commit in 1.8.2 tag, that brings the version to 1.8.0

https://oss.oracle.com/git/?p=ocfs2-tools.git;a=commitdiff;h=2480a215a600050d2bf923044dffac91439d982a;hp=8b5f4ad727e019cb557c4b516ab401c15c5c317e

and later on another commit that bring the version to 1.8.2
https://oss.oracle.com/git/?p=ocfs2-tools.git;a=commitdiff;h=560a1e60936fe868b00cfc9cad5def726e10828e

I am sorry I am not actually helping to Ulf's problem.
Ulf, maybe you can really follow the head version and try to see an explanation 
of the error message.
Anyway I think it would be best to open a SR with Oracle if you have Linux 
support contract.

Does anyone know how to find you the git repository at least for some packages 
in Oracle Linux. I know the source for each package is available as .src.rpm 
but how could I see the changes, or the tag from which every version was build?

I remember Wim talking on something like that a while ago (saying  oracle is 
not like redhat mangling changelogs), but I can't find the article right now.

If you find out what is behind ocfs2-tools 1.8.0-10 it would be easier to track 
the problem.

Regards,
Mihail Daskalov


From: 
ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran
Sent: Wednesday, July 10, 2013 2:11 AM
To: Ulf Zimmermann
Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

The error does not make sense. Also I don't know what 1.8.0 tools means. I 
cannot see that label in the src tree.
https://oss.oracle.com/git/?p=ocfs2-tools.git;a=summary
One option is to build the tools from the head.

On Tue, Jul 9, 2013 at 2:25 PM, Ulf Zimmermann 
u...@openlane.commailto:u...@openlane.com wrote:
Sunil, any suggestions on this?


From: 
ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com]
 On Behalf Of Ulf Zimmermann
Sent: Saturday, June 22, 2013 15:20
To: Sunil Mushran

Cc: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

[root@co-db03 ulf]# debugfs.ocfs2 -R stats /dev/mapper/aucp_data_bk_2_x
Revision: 0.90
Mount Count: 0   Max Mount Count: 20
State: 0   Errors: 0
Check Interval: 0   Last Check: Sun Sep 25 05:32:29 2011
Creator OS: 0
Feature Compat: 0
Feature Incompat: 0
Tunefs Incomplete: 0
Feature RO compat: 0
Root Blknum: 513   System Dir Blknum: 514
First Cluster Group Blknum: 256
Block Size Bits: 12   Cluster Size Bits: 20
Max Node Slots: 10
Extended Attributes Inline Size: 0
Label: /export/backuprecovery.AUCP
UUID: 5F9C2727159743529200CE9C5E155562
Hash: 0 (0x0)
DX Seeds: 0 0 0 (0x 0x 0x)
Cluster stack: classic o2cb
Cluster flags: 0

Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

2013-07-09 Thread Ulf Zimmermann
Sunil, any suggestions on this?


From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
Sent: Saturday, June 22, 2013 15:20
To: Sunil Mushran
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

[root@co-db03 ulf]# debugfs.ocfs2 -R stats /dev/mapper/aucp_data_bk_2_x
Revision: 0.90
Mount Count: 0   Max Mount Count: 20
State: 0   Errors: 0
Check Interval: 0   Last Check: Sun Sep 25 05:32:29 2011
Creator OS: 0
Feature Compat: 0
Feature Incompat: 0
Tunefs Incomplete: 0
Feature RO compat: 0
Root Blknum: 513   System Dir Blknum: 514
First Cluster Group Blknum: 256
Block Size Bits: 12   Cluster Size Bits: 20
Max Node Slots: 10
Extended Attributes Inline Size: 0
Label: /export/backuprecovery.AUCP
UUID: 5F9C2727159743529200CE9C5E155562
Hash: 0 (0x0)
DX Seeds: 0 0 0 (0x 0x 0x)
Cluster stack: classic o2cb
Cluster flags: 0
Inode: 2   Mode: 00   Generation: 3147295185 (0xbb97e9d1)
FS Generation: 3147295185 (0xbb97e9d1)
CRC32:    ECC: 
Type: Unknown   Attr: 0x0   Flags: Valid System Superblock
Dynamic Features: (0x0)
User: 0 (root)   Group: 0 (root)   Size: 0
Links: 0   Clusters: 1572864
ctime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011
atime: 0x0 0x0 -- Wed Dec 31 16:00:00.0 1969
mtime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011
dtime: 0x0 -- Wed Dec 31 16:00:00 1969
Refcount Block: 0
Last Extblk: 0   Orphan Slot: 0
Sub Alloc Slot: Global   Sub Alloc Bit: 65535


From: Sunil Mushran [mailto:sunil.mush...@gmail.com]
Sent: Friday, June 21, 2013 11:11
To: Ulf Zimmermann
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

Can you dump the following using the 1.8 binary.
debugfs.ocfs2 -R stats /dev/mapper/.

On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann 
u...@openlane.commailto:u...@openlane.com wrote:
We have a production cluster of 6 nodes, which are currently running RHEL 5.8 
with OCFS2 1.4.10. We snapclone these volumes to multiple destinations, one of 
them is a RHEL4 machine with OCFS2 1.2.9. Because of that the volumes are set 
so that we can read them there.

We are now trying to bring up a new server, this one has OEL 6.3 on it and it 
comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2 
-cloned-volume to reset the UUID, but when I try to change the label I get:

[root@co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP 
/dev/mapper/aucp_data_bk_2_x
tunefs.ocfs2: Invalid name for a cluster while opening device 
/dev/mapper/aucp_data_bk_2_x

fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla for 
that:

[root@co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x
fsck.ocfs2 1.8.0
*** glibc detected *** fsck.ocfs2: double free or corruption (fasttop): 
0x0197f320 ***
=== Backtrace: =
/lib64/libc.so.6[0x3656475366]
fsck.ocfs2[0x434c31]
fsck.ocfs2[0x403bc2]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd]
fsck.ocfs2[0x402879]
=== Memory map: 
0040-0045 r-xp  fc:00 12489  
/sbin/fsck.ocfs2
0064f000-00651000 rw-p 0004f000 fc:00 12489  
/sbin/fsck.ocfs2
00651000-00652000 rw-p  00:00 0
0085-00851000 rw-p 0005 fc:00 12489  
/sbin/fsck.ocfs2
0197e000-0199f000 rw-p  00:00 0  [heap]
3655c0-3655c2 r-xp  fc:00 8797   
/lib64/ld-2.12.sohttp://ld-2.12.so
3655e1f000-3655e2 r--p 0001f000 fc:00 8797   
/lib64/ld-2.12.sohttp://ld-2.12.so
3655e2-3655e21000 rw-p 0002 fc:00 8797   
/lib64/ld-2.12.sohttp://ld-2.12.so
3655e21000-3655e22000 rw-p  00:00 0
365640-3656589000 r-xp  fc:00 8798   
/lib64/libc-2.12.sohttp://libc-2.12.so
3656589000-3656788000 ---p 00189000 fc:00 8798   
/lib64/libc-2.12.sohttp://libc-2.12.so
3656788000-365678c000 r--p 00188000 fc:00 8798   
/lib64/libc-2.12.sohttp://libc-2.12.so
365678c000-365678d000 rw-p 0018c000 fc:00 8798   
/lib64/libc-2.12.sohttp://libc-2.12.so
365678d000-3656792000 rw-p  00:00 0
3659c0-3659c16000 r-xp  fc:00 8802   
/lib64/libgcc_s-4.4.6-20120305.so.1
3659c16000-3659e15000 ---p 00016000 fc:00 8802   
/lib64/libgcc_s-4.4.6-20120305.so.1
3659e15000-3659e16000 rw-p 00015000 fc:00 8802   
/lib64/libgcc_s-4.4.6-20120305.so.1
3d3e80-3d3e817000 r-xp  fc:00 12028

Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

2013-06-22 Thread Ulf Zimmermann
[root@co-db03 ulf]# debugfs.ocfs2 -R stats /dev/mapper/aucp_data_bk_2_x
Revision: 0.90
Mount Count: 0   Max Mount Count: 20
State: 0   Errors: 0
Check Interval: 0   Last Check: Sun Sep 25 05:32:29 2011
Creator OS: 0
Feature Compat: 0
Feature Incompat: 0
Tunefs Incomplete: 0
Feature RO compat: 0
Root Blknum: 513   System Dir Blknum: 514
First Cluster Group Blknum: 256
Block Size Bits: 12   Cluster Size Bits: 20
Max Node Slots: 10
Extended Attributes Inline Size: 0
Label: /export/backuprecovery.AUCP
UUID: 5F9C2727159743529200CE9C5E155562
Hash: 0 (0x0)
DX Seeds: 0 0 0 (0x 0x 0x)
Cluster stack: classic o2cb
Cluster flags: 0
Inode: 2   Mode: 00   Generation: 3147295185 (0xbb97e9d1)
FS Generation: 3147295185 (0xbb97e9d1)
CRC32:    ECC: 
Type: Unknown   Attr: 0x0   Flags: Valid System Superblock
Dynamic Features: (0x0)
User: 0 (root)   Group: 0 (root)   Size: 0
Links: 0   Clusters: 1572864
ctime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011
atime: 0x0 0x0 -- Wed Dec 31 16:00:00.0 1969
mtime: 0x4e7f1f5d 0x0 -- Sun Sep 25 05:32:29.0 2011
dtime: 0x0 -- Wed Dec 31 16:00:00 1969
Refcount Block: 0
Last Extblk: 0   Orphan Slot: 0
Sub Alloc Slot: Global   Sub Alloc Bit: 65535


From: Sunil Mushran [mailto:sunil.mush...@gmail.com]
Sent: Friday, June 21, 2013 11:11
To: Ulf Zimmermann
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

Can you dump the following using the 1.8 binary.
debugfs.ocfs2 -R stats /dev/mapper/.

On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann 
u...@openlane.commailto:u...@openlane.com wrote:
We have a production cluster of 6 nodes, which are currently running RHEL 5.8 
with OCFS2 1.4.10. We snapclone these volumes to multiple destinations, one of 
them is a RHEL4 machine with OCFS2 1.2.9. Because of that the volumes are set 
so that we can read them there.

We are now trying to bring up a new server, this one has OEL 6.3 on it and it 
comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2 
-cloned-volume to reset the UUID, but when I try to change the label I get:

[root@co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP 
/dev/mapper/aucp_data_bk_2_x
tunefs.ocfs2: Invalid name for a cluster while opening device 
/dev/mapper/aucp_data_bk_2_x

fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla for 
that:

[root@co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x
fsck.ocfs2 1.8.0
*** glibc detected *** fsck.ocfs2: double free or corruption (fasttop): 
0x0197f320 ***
=== Backtrace: =
/lib64/libc.so.6[0x3656475366]
fsck.ocfs2[0x434c31]
fsck.ocfs2[0x403bc2]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd]
fsck.ocfs2[0x402879]
=== Memory map: 
0040-0045 r-xp  fc:00 12489  
/sbin/fsck.ocfs2
0064f000-00651000 rw-p 0004f000 fc:00 12489  
/sbin/fsck.ocfs2
00651000-00652000 rw-p  00:00 0
0085-00851000 rw-p 0005 fc:00 12489  
/sbin/fsck.ocfs2
0197e000-0199f000 rw-p  00:00 0  [heap]
3655c0-3655c2 r-xp  fc:00 8797   
/lib64/ld-2.12.sohttp://ld-2.12.so
3655e1f000-3655e2 r--p 0001f000 fc:00 8797   
/lib64/ld-2.12.sohttp://ld-2.12.so
3655e2-3655e21000 rw-p 0002 fc:00 8797   
/lib64/ld-2.12.sohttp://ld-2.12.so
3655e21000-3655e22000 rw-p  00:00 0
365640-3656589000 r-xp  fc:00 8798   
/lib64/libc-2.12.sohttp://libc-2.12.so
3656589000-3656788000 ---p 00189000 fc:00 8798   
/lib64/libc-2.12.sohttp://libc-2.12.so
3656788000-365678c000 r--p 00188000 fc:00 8798   
/lib64/libc-2.12.sohttp://libc-2.12.so
365678c000-365678d000 rw-p 0018c000 fc:00 8798   
/lib64/libc-2.12.sohttp://libc-2.12.so
365678d000-3656792000 rw-p  00:00 0
3659c0-3659c16000 r-xp  fc:00 8802   
/lib64/libgcc_s-4.4.6-20120305.so.1
3659c16000-3659e15000 ---p 00016000 fc:00 8802   
/lib64/libgcc_s-4.4.6-20120305.so.1
3659e15000-3659e16000 rw-p 00015000 fc:00 8802   
/lib64/libgcc_s-4.4.6-20120305.so.1
3d3e80-3d3e817000 r-xp  fc:00 12028  
/lib64/libpthread-2.12.sohttp://libpthread-2.12.so
3d3e817000-3d3ea17000 ---p 00017000 fc:00 12028  
/lib64/libpthread-2.12.sohttp://libpthread-2.12.so
3d3ea17000-3d3ea18000 r--p 00017000 fc:00 12028  
/lib64/libpthread-2.12.sohttp

Re: [Ocfs2-users] Unable to Install Ocfs2 in Oracle Linux 5 Machine.

2013-03-20 Thread Ulf Zimmermann
To be move exact to the other replies:

You are trying to install 3 packages:

ocfs2-2.6.18-308.4.1.el5-1.4.10-1.el5.x86_64.rpm
ocfs2-2.6.18-308.4.1.el5debug-1.4.10-1.el5.x86_64.rpm
ocfs2-2.6.18-308.4.1.el5xen-1.4.10-1.el5.x86_64.rpm

The el5debug package is only needed if you are running the Debug kernel, most 
people will not run that one.

The el5xen package is for the kernel with XEN support. 

Based on that there is no error message for the first package, you only need to 
install that particular package.

 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of maanas
 Sent: Monday, March 18, 2013 22:50
 To: ocfs2-users@oss.oracle.com
 Cc: Ranga Babu; Raju Pasagadi
 Subject: [Ocfs2-users] Unable to Install Ocfs2 in Oracle Linux 5 Machine.
 
 Hi,
 
 I was trying to install OCFS 2 in my Oracle Linux 5 Machine.
 
 I created a mount point for cluster file system:
 
 # mkdir /u02
 
 My Kernel version is :
 # uname -r
 2.6.18-308.4.1.0.1.el5xen
 
 I am downloading the appropriate version of the kernel module from this
 location:
 
 https://oss.oracle.com/projects/ocfs2/files/RedHat/RHEL5/x86_64/1.4.10-
 1/2.6.18-308.4.1.el5/
 
 
 When I am trying to execute this cmd for installing :
 
 # rpm -Uvh ocfs2-2.6.18-308.4.1.el5-1.4.10-1.el5.x86_64.rpm
 ocfs2-2.6.18-308.4.1.el5debug-1.4.10-1.el5.x86_64.rpm
 ocfs2-2.6.18-308.4.1.el5xen-1.4.10-1.el5.x86_64.rpm
 
 I am getting this error:
 
 warning: ocfs2-2.6.18-308.4.1.el5-1.4.10-1.el5.x86_64.rpm: Header V3 DSA
 signature: NOKEY, key ID 1e5e0159
 error: Failed dependencies:
  kernel-debug = 2.6.18-308.4.1.el5 is needed by
 ocfs2-2.6.18-308.4.1.el5debug-1.4.10-1.el5.x86_64
  kernel-xen = 2.6.18-308.4.1.el5 is needed by
 ocfs2-2.6.18-308.4.1.el5xen-1.4.10-1.el5.x86_64
 
 Can anyone point out what error is there in this process?
 
 Thanks,
 Maanas.
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Subject: problem configuring ocfs on rhel5.8 kernel 2.6.18-300.el5

2012-09-12 Thread Ulf Zimmermann
What is your kernel version? Run “uname –r”

My guess is that your kernel is not 2.6.18-238.9.1.el5 nor 2.6.18-308.1.1.el5

You will need to install the matching ocfs2-2.6.18-300.el5-1.4.9-1.el5.x86_64 
package.


From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Asanka Gunasekera
Sent: Wednesday, September 12, 2012 12:32 AM
To: ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] Subject: problem configuring ocfs on rhel5.8 kernel 
2.6.18-300.el5

Hi this is a resend from a subscribed address sorry if I am causing any 
inconvenience

hope some one can help me on this, I have been straggling to get this working 
for few weeks now. My issue is as below

I am just trying to use ocfs2 as shared file system between 2 node HA cluster 
for a application that runs on these nodes

I have downloaded below packages

ocfs2-2.6.18-238.9.1.el5-1.4.9-1.el5.x86_64.r
and
ocfs2-2.6.18-308.1.1.el5-1.4.10-1.el5.x86_64

Installation goes with out any complains but when its time to configure I get 
below errors

[root@ccbsn01 ~]# /etc/init.d/o2cb configure
Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot. The current values will be shown in brackets ('[]'). Hitting
ENTER without typing an answer will keep that current value. Ctrl-C
will abort.

Load O2CB driver on boot (y/n) [y]:
Cluster stack backing O2CB [o2cb]:
Cluster to start on boot (Enter none to clear) [ocfs2]:
Specify heartbeat dead threshold (=7) [31]:
Specify network idle timeout in ms (=5000) [3]:
Specify network keepalive delay in ms (=1000) [2000]:
Specify network reconnect delay in ms (=2000) [2000]:
Writing O2CB configuration: OK
Loading filesystem ocfs2_dlmfs: Unable to load filesystem ocfs2_dlmfs
Failed

And in the /var/log/message log I get below error

Sep 12 11:21:41 node01 modprobe: FATAL: Module ocfs2_stackglue not found.
Sep 12 11:21:41 node01 modprobe: FATAL: Module ocfs2_dlmfs not found.
Sep 12 11:33:13 node01 modprobe: FATAL: Module ocfs2_stackglue not found.
Sep 12 11:33:13 node01 modprobe: FATAL: Module ocfs2_dlmfs not found.

How can I fix this and get this working

Thanks and Best Regards
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] RPM packages for EL5 U8 kernel 2.6.18-308.11.1.el5?

2012-07-30 Thread Ulf Zimmermann
 Latest packages at oss.oracle.com is 308.8.1.el5, any plans to provide 
 packages for 308.11.1.el5?

Never mind, there is OCFS2 1.4.10, which has the packages.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


[Ocfs2-users] RPM packages for EL5 U8 kernel 2.6.18-308.11.1.el5?

2012-07-27 Thread Ulf Zimmermann
Latest packages at oss.oracle.com is 308.8.1.el5, any plans to provide packages 
for 308.11.1.el5?

Ulf.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5

2011-09-27 Thread Ulf Zimmermann
- -Original Message-
 From: Sunil Mushran [mailto:sunil.mush...@oracle.com]
 Sent: Monday, September 26, 2011 10:09 AM
 To: Ulf Zimmermann
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on
 EL5
 
 I'll look at the tunefs issue. But the other one does not make sense.
 strict_jbd is a compat flag. Mount should work. What is the mount
 error? As in, in dmesg.

I don't see any dmesg or /var/log/messages, but the error I saw was from tunefs:

demodb01 root /home/ulf # /usr/bin/yes | /sbin/tunefs.ocfs2 -U -L /export/u07 
/dev/mapper/u07 
tunefs.ocfs2 1.2.7
tunefs.ocfs2: Filesystem has unsupported feature(s) while opening device 
/dev/mapper/u07


 
 On 09/25/2011 04:43 AM, Ulf Zimmermann wrote:
  As tunefs.ocfs2 wasn't working for us, I tried to mkfs.ocfs2 the volumes
 again with --fs-feature-level=max-compat. This still turns on strict-journal-
 super and there seems no way around this? This makes the volume not
 compatible with OCFS 1.2.9
 
  -Original Message-
  From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
  boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
  Sent: Sunday, September 25, 2011 1:43 AM
  To: ocfs2-users@oss.oracle.com
  Subject: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on
 EL5
 
  We are running into a problem which looks like the same we had with
  fsck.ocfs2 a while back. This is with ocfs2-tools 1.4.4. I am trying to use
  tunefs.ocfs2 to turn off some features. The program starts up but then
 starts
  eating all available memory and more and the system starts to swap like
 crazy
  in and out. This is exactly the same behavior as the fsck.ocfs2 for which
 we
  were given a patched binary.
 
  I tried to compile the tunefs.ocfs2 from 1.6.x but the same problem with
 that
  binary.
 
  Ulf.
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5

2011-09-27 Thread Ulf Zimmermann
 -Original Message-
 From: Sunil Mushran [mailto:sunil.mush...@oracle.com]
 Sent: Tuesday, September 27, 2011 9:27 AM
 To: Ulf Zimmermann
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on
 EL5
 
 On 09/27/2011 09:12 AM, Ulf Zimmermann wrote:
  - -Original Message-
  From: Sunil Mushran [mailto:sunil.mush...@oracle.com]
  Sent: Monday, September 26, 2011 10:09 AM
  To: Ulf Zimmermann
  Cc: ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2
 on
  EL5
 
  I'll look at the tunefs issue. But the other one does not make sense.
  strict_jbd is a compat flag. Mount should work. What is the mount
  error? As in, in dmesg.
  I don't see any dmesg or /var/log/messages, but the error I saw was from
 tunefs:
 
  demodb01 root /home/ulf # /usr/bin/yes | /sbin/tunefs.ocfs2 -U -L
 /export/u07 /dev/mapper/u07
  tunefs.ocfs2 1.2.7
  tunefs.ocfs2: Filesystem has unsupported feature(s) while opening device
 /dev/mapper/u07
 
 
 So that is correct. In short that flag was added to allow us to use the
 jbd(2) features. We use this to create volumes  16TB.
 
 I guess if you want to use with 1.2, format it with 1.2 tools.

That is what I ended up with. And I also made a point to certain people in the 
company about not stopping in the middle of upgrading database servers.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5

2011-09-25 Thread Ulf Zimmermann
As tunefs.ocfs2 wasn't working for us, I tried to mkfs.ocfs2 the volumes again 
with --fs-feature-level=max-compat. This still turns on strict-journal-super 
and there seems no way around this? This makes the volume not compatible with 
OCFS 1.2.9

 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
 Sent: Sunday, September 25, 2011 1:43 AM
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5
 
 We are running into a problem which looks like the same we had with
 fsck.ocfs2 a while back. This is with ocfs2-tools 1.4.4. I am trying to use
 tunefs.ocfs2 to turn off some features. The program starts up but then starts
 eating all available memory and more and the system starts to swap like crazy
 in and out. This is exactly the same behavior as the fsck.ocfs2 for which we
 were given a patched binary.
 
 I tried to compile the tunefs.ocfs2 from 1.6.x but the same problem with that
 binary.
 
 Ulf.
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Linux RHEL 5.2 hangs for 1.5 hrs while fsck'ing the OCFS2 file system

2011-04-20 Thread Ulf Zimmermann
While you run it, what is the memory usage? Is the system going into swap?

Our 700GB large volume would take over half a day or more to finish, due to the 
system constantly having to swap.

From: Sergey Prilutsky [mailto:sprilut...@hotmail.com]
Sent: Wednesday, April 20, 2011 5:22 AM
To: Ulf Zimmermann; ocfs2-users@oss.oracle.com
Subject: RE: [Ocfs2-users] Linux RHEL 5.2 hangs for 1.5 hrs while fsck'ing the 
OCFS2 file system

Hi there,

No, we are using the earlier version - 1.4.1:
[root@box1]# rpm -qa |grep ocfs2-tools
ocfs2-tools-debuginfo-1.4.1-1.el5.x86_64
ocfs2-tools-1.4.1-1.el5.x86_64


Thanks
  Sergey Prilutsky





From: u...@openlane.com
To: sprilut...@hotmail.com; ocfs2-users@oss.oracle.com
Date: Tue, 19 Apr 2011 14:09:31 -0700
Subject: RE: [Ocfs2-users] Linux RHEL 5.2 hangs for 1.5 hrs while fsck'ing the 
OCFS2 file system
If you are using ocfs2-tools-1.4.4, there is a bug with memory usage. Sunil 
provided me a patched version, which is afik still not in the downloadable 
tools version.


From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sergey Prilutsky
Sent: Tuesday, April 19, 2011 11:30 AM
To: ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] Linux RHEL 5.2 hangs for 1.5 hrs while fsck'ing the 
OCFS2 file system

Hi there,
A month ago we ran into the fsck issue while rebooting one of the Oracle RAC 
nodes running on Linux RHEL 5.2. It was hanging for 1.5 hours
During the reboot, OS portion went fine, then it activated the data volumes in 
all data vg's with [OK]
Then displayed message: Checking filesystems - and it took it 1.5 hrs, then it 
finished the reboot.

Last weekend we rebooted the same box and faced the same issue, however, we 
sent the break, commented out (last 0 in /etc/fstab did not work neither) all 
OCFS2 lines in /etc/fstab and it booted fine. Then we mount -a them and life 
was good.

Also we added later on the fastboot to the grub.conf, then booted it - no 
problem for obvious reasons.

If anyone experienced the same issue - would you mind to light into the tonnel 
and share your experience and perhaps the fix?
Thanks
  Sergey Prilutsky

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] What could cause slow down betwen OCFS2 1.2.9 and 1.4.4

2011-03-11 Thread Ulf Zimmermann
We upgraded our production database cluster (6 node) from EL4 Update 5 to EL5 
Update 5, including upgrading OCFS2 from 1.2.9 to 1.4.4.

We are now noticing slowdown of batch jobs in Oracle, while hotbackup runs 
faster. One thing we saw is that journal mode changed from write-back to 
ordered, as we don't specify journal mode during mount. Oracle sees this as 
slowdown based on higher IO latency, going from 6-8ms to 13-15ms for single 
block IO. Total IO throughput has dropped. 

Can this be caused by the journal mode being ordered?

Ulf.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] What could cause slow down betwen OCFS2 1.2.9 and 1.4.4

2011-03-11 Thread Ulf Zimmermann
Another change we found is we used scheduler deadline, we are doing downtime 
tonight to change scheduler and journal mode.


Ulf Zimmermann | Senior System Architect
OPENLANE, Inc | 2200 Bridge Parkway, Suite 202
Redwood City, CA 94065 | (650) 412-4042
u...@openlane.com


On Mar 11, 2011, at 14:32, Sunil Mushran sunil.mush...@oracle.com wrote:

 Verifying the journal mode is easy enough. Remount with data=writeback.
 It can be done one node at a time.
 
 But since you upgraded from 4.5 to 5.5, you may have to cast a wider net
 considering the entire kernel also changed.
 
 On 03/11/2011 02:22 PM, Ulf Zimmermann wrote:
 We upgraded our production database cluster (6 node) from EL4 Update 5 to 
 EL5 Update 5, including upgrading OCFS2 from 1.2.9 to 1.4.4.
 
 We are now noticing slowdown of batch jobs in Oracle, while hotbackup runs 
 faster. One thing we saw is that journal mode changed from write-back to 
 ordered, as we don't specify journal mode during mount. Oracle sees this as 
 slowdown based on higher IO latency, going from 6-8ms to 13-15ms for single 
 block IO. Total IO throughput has dropped.
 
 Can this be caused by the journal mode being ordered?
 
 Ulf.
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
 

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Reservation conflicts

2010-12-09 Thread Ulf Zimmermann

 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Joel Becker
 Sent: Thursday, December 09, 2010 2:54 PM
 To: brad hancock
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Reservation conflicts
 
 On Thu, Dec 09, 2010 at 04:45:25PM -0600, brad hancock wrote:
  Yeah both guest have the same Harddrive attached with the virtual
 scsi
  controller configured
  as Physical to set a  policy to allow virtual disk to be
  used simultaneously by multi virtual machines.
 
  as /dev/sdb1
 
   It sure seems like VMWare is caching some data somewhere.
 That's my best guess.  These are on the same host, right?
 
 Joel

I have configured:

SCSI Controller 1 Virtual (Virtual disks can be shared between any virtual 
machines on the same server.)
Disks are configured as Independent.

That works for my test cluster using OCFS inside of Vmware.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] heartbeat and slot issues.

2010-11-24 Thread Ulf Zimmermann
After the clone, you want to probably run tunefs.ocfs2 -U to reset the UUID. 
This is one of the steps we do when cloning volumes for database refreshes.


From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of brad hancock
Sent: Wednesday, November 24, 2010 12:35 PM
To: ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] heartbeat and slot issues.

I setup a host with an ocfs partition on a san and then cloned that host to 
another and renamed. Both machines mount their ocfs partitions but give the 
following errors.


Host that was cloned:
(1888,0):o2hb_do_disk_heartbeat:762 ERROR: Device sdb1: another node is 
heartbeating in our slot!
[345413.242260] sd 1:0:0:0: reservation conflict
[345413.242270] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK,SUGGEST_OK
[345413.242274] end_request: I/O error, dev sdb, sector 1735
[345413.242536] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5
[345413.242788] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5
[345413.243159] sd 1:0:0:0: reservation conflict
[345413.243163] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK,SUGGEST_OK
[345413.243166] end_request: I/O error, dev sdb, sector 1735
[345413.243401] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5
[345413.243639] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5
[448460.370132] sd 1:0:0:0: reservation conflict
[448460.370145] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK,SUGGEST_OK
[448460.370149] end_request: I/O error, dev sdb, sector 1735
[448460.370395] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5
[448460.370638] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5


Clone:

 sd 1:0:0:0: reservation conflict
[17643.588011] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK,SUGGEST_OK
[17643.588011] end_request: I/O error, dev sdb, sector 1735
[17643.588011] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5
[17643.588011] (1859,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5
[17643.588011] sd 1:0:0:0: reservation conflict

This didn't seem to be a problem, but im noticing the host are no longer seeing 
the same data. I unmount the drives and remounted and they were the same again.


Thanks for any guidance,


cat /etc/ocfs2/cluster.conf
node:
ip_port = 
ip_address = 10.x.x.248
number = 0
name = smes01
cluster = ocfs2

node:
ip_port = 
ip_address = 10.x.x.249
number = 1
name = smes02
cluster = ocfs2

cluster:
node_count = 2
name = ocfs2

cluster.conf same on both hosts.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] heartbeat and slot issues.

2010-11-24 Thread Ulf Zimmermann
Then you haven't cloned the volume, but it is the same, would be my guess.


From: brad hancock [mailto:braddhanc...@gmail.com]
Sent: Wednesday, November 24, 2010 1:54 PM
To: Ulf Zimmermann
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] heartbeat and slot issues.

Thanks for the response.
Is it normal for when I change it on one node, the other node reflects the same 
UUID?


node1:
tunefs.ocfs2 -q -Q BS=%5B\nUUID=%U\n /dev/sdb1
BS= 4096
UUID=ea0778bd-bdaa-44af-8fbf-cb4a5d85e79f


node2:
 tunefs.ocfs2 -q -Q BS=%5B\nUUID=%U\n /dev/sdb1
BS= 4096
UUID=ea0778bd-bdaa-44af-8fbf-cb4a5d85e79f



On Wed, Nov 24, 2010 at 3:00 PM, Ulf Zimmermann 
u...@openlane.commailto:u...@openlane.com wrote:
After the clone, you want to probably run tunefs.ocfs2 -U to reset the UUID. 
This is one of the steps we do when cloning volumes for database refreshes.


From: 
ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.commailto:ocfs2-users-boun...@oss.oracle.com]
 On Behalf Of brad hancock
Sent: Wednesday, November 24, 2010 12:35 PM
To: ocfs2-users@oss.oracle.commailto:ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] heartbeat and slot issues.

I setup a host with an ocfs partition on a san and then cloned that host to 
another and renamed. Both machines mount their ocfs partitions but give the 
following errors.


Host that was cloned:
(1888,0):o2hb_do_disk_heartbeat:762 ERROR: Device sdb1: another node is 
heartbeating in our slot!
[345413.242260] sd 1:0:0:0: reservation conflict
[345413.242270] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK,SUGGEST_OK
[345413.242274] end_request: I/O error, dev sdb, sector 1735
[345413.242536] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5
[345413.242788] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5
[345413.243159] sd 1:0:0:0: reservation conflict
[345413.243163] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK,SUGGEST_OK
[345413.243166] end_request: I/O error, dev sdb, sector 1735
[345413.243401] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5
[345413.243639] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5
[448460.370132] sd 1:0:0:0: reservation conflict
[448460.370145] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK,SUGGEST_OK
[448460.370149] end_request: I/O error, dev sdb, sector 1735
[448460.370395] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5
[448460.370638] (1888,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5


Clone:

 sd 1:0:0:0: reservation conflict
[17643.588011] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK,SUGGEST_OK
[17643.588011] end_request: I/O error, dev sdb, sector 1735
[17643.588011] (0,0):o2hb_bio_end_io:225 ERROR: IO Error -5
[17643.588011] (1859,0):o2hb_do_disk_heartbeat:753 ERROR: status = -5
[17643.588011] sd 1:0:0:0: reservation conflict

This didn't seem to be a problem, but im noticing the host are no longer seeing 
the same data. I unmount the drives and remounted and they were the same again.


Thanks for any guidance,


cat /etc/ocfs2/cluster.conf
node:
ip_port = 
ip_address = 10.x.x.248
number = 0
name = smes01
cluster = ocfs2

node:
ip_port = 
ip_address = 10.x.x.249
number = 1
name = smes02
cluster = ocfs2

cluster:
node_count = 2
name = ocfs2

cluster.conf same on both hosts.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] some beginner questions

2010-07-14 Thread Ulf Zimmermann
OCFS2 requires shared storage. Is /dev/sdb a shared device?


 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Alexander Nagel
 Sent: Wednesday, July 14, 2010 12:03 PM
 To: Ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] some beginner questions
 
 Hi,
 
 I'am new to ocfs2 filesystem and I have some questions about it.
 
 I installed three server according to the user guide from
 http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.4/ocfs2-1_4-
 usersguide.pdf
 
 For every single server I have a working ocfs2 partition
 /dev/sdb1 on /mnt/oc1 type ocfs2 (rw,_netdev,heartbeat=local)
 
 As I understand the ocfs2 system I can use now these partition as a
 single data storage. But it doesn't work. When I create a file or
 directory on the ocfs2 partition, it doesn't appear on the other
 servers.
 
 SERVER01:
 server01:~# echo sdhfksjdhfskhskjgh  /mnt/oc1/testfile
 server01:~# cat /mnt/oc1/testfile
 sdhfksjdhfskhskjgh
 server01:~# ls -lh /mnt/oc1/
 insgesamt 0
 drwxr-xr-x 2 root root 3,9K 13. Jul 16:53 lost+found
 -rw-r--r-- 1 root root   19 14. Jul 20:48 testfile
 
 
 SERVER02
 server02:~# ls -lh /mnt/oc1/
 insgesamt 0
 drwxr-xr-x 2 root root 3,9K 13. Jul 16:34 lost+found
 
 
 server03 same result
 
 The config file is the same on all three servers. I made it and copied
 it with the gui program on all three servers. So all server have the
 absolutly same file.
 
 What did I miss? Can somebody give me a hint? Or did I misunderstand
 the
 ocfs2?
 
 thanks
 Alexander
 
 
 --
 
 Alexander Nagel
 E-mail: alexan...@acwn.de
 Homepage:
 http://www.acwn.de/
 http://www.standspur-kadaver.de/
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] some beginner questions

2010-07-14 Thread Ulf Zimmermann
As Sunil already wrote: iSCSI, FC attached SAN, Firewire (not really 
recommended for production).


 -Original Message-
 From: Alexander Nagel [mailto:alexan...@acwn.de]
 Sent: Wednesday, July 14, 2010 12:44 PM
 To: Ulf Zimmermann
 Cc: Ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] some beginner questions
 
 Hi,
 
 Am 14.07.2010 21:08, schrieb Ulf Zimmermann:
  OCFS2 requires shared storage. Is /dev/sdb a shared device?
 thanks for your quick response, and that clarify the situation.
 /dev/sdb is a single harddisk in the servers, it is not a shared
 storage. I thought that ocfs2 would work then like a single disk.
 What type of storage must this shared device be?
 
 thanks
 Alexander
 
 
 
  -Original Message-
  From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
  boun...@oss.oracle.com] On Behalf Of Alexander Nagel
  Sent: Wednesday, July 14, 2010 12:03 PM
  To: Ocfs2-users@oss.oracle.com
  Subject: [Ocfs2-users] some beginner questions
 
  Hi,
 
  I'am new to ocfs2 filesystem and I have some questions about it.
 
  I installed three server according to the user guide from
  http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.4/ocfs2-
 1_4-
  usersguide.pdf
 
  For every single server I have a working ocfs2 partition
  /dev/sdb1 on /mnt/oc1 type ocfs2 (rw,_netdev,heartbeat=local)
 
  As I understand the ocfs2 system I can use now these partition as a
  single data storage. But it doesn't work. When I create a file or
  directory on the ocfs2 partition, it doesn't appear on the other
  servers.
 
  SERVER01:
  server01:~# echo sdhfksjdhfskhskjgh  /mnt/oc1/testfile
  server01:~# cat /mnt/oc1/testfile
  sdhfksjdhfskhskjgh
  server01:~# ls -lh /mnt/oc1/
  insgesamt 0
  drwxr-xr-x 2 root root 3,9K 13. Jul 16:53 lost+found
  -rw-r--r-- 1 root root   19 14. Jul 20:48 testfile
 
 
  SERVER02
  server02:~# ls -lh /mnt/oc1/
  insgesamt 0
  drwxr-xr-x 2 root root 3,9K 13. Jul 16:34 lost+found
 
 
  server03 same result
 
  The config file is the same on all three servers. I made it and
 copied
  it with the gui program on all three servers. So all server have the
  absolutly same file.
 
  What did I miss? Can somebody give me a hint? Or did I misunderstand
  the
  ocfs2?
 
  thanks
  Alexander
 
 
  --
 
  Alexander Nagel
  E-mail: alexan...@acwn.de
  Homepage:
  http://www.acwn.de/
  http://www.standspur-kadaver.de/
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
 
 --
 
 Alexander Nagel
 E-mail: alexan...@acwn.de
 Homepage:
 http://www.acwn.de/
 http://www.standspur-kadaver.de/

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] df showing wrong size

2010-06-28 Thread Ulf Zimmermann
Make sure you don't have deleted files, which are still open. You can use lsof 
to find those.


From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Garcia, Raymundo
Sent: Sunday, June 27, 2010 11:18 PM
To: ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] df showing wrong size

Hello... it was put under my attention that a partition we have in one of our 
production system was displaying wrong size with df command 123 GB... but 
in fact the size of all the files is a mere 15 GB What is going on? Shall 
we use ocfs.fsck to fix that? Is strange...

Thanks for any comment

Raymundo Garcia



The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Info on Version Upgrade

2010-06-16 Thread Ulf Zimmermann
We have been running the -55 kernel for a long time on our Oracle servers, 
there is one issue for us known, which is a memory leak in the kernel in 
conjunction with HP management agents, but otherwise -55 has been ok for us. 
But there have been plenty of fixes in newer kernels, but also traps. When we 
did try to upgrade we ran into kernel panics which got triggered by network 
drivers and we rolled back.


From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran
Sent: Wednesday, June 16, 2010 8:15 AM
To: Martin Eddy
Cc: Ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Info on Version Upgrade

As far as ocfs2 is concerned, the current version of ocfs2 1.2
is ocfs2 1.2.9. You will find the packages for your kernel on
oss.oracle.com. The news section has the list of changes/bugs fixed.

asmlib also has some updates. You can review the fixes to see
whether an upgrade is warranted.

For all other qs, ping Oracle support and/or Red Hat support.

Sunil

On 06/16/2010 08:02 AM, Martin Eddy wrote:
I hope this question can be posted here, if not I do apologize.

I have inherited a Production system running at a client site. It is a RAC 
system 2 cluster node presently running on an HP EVA 8000 with

 *   Oracle 10.2.0.3
 *   Redhat 4.5 Kernel 2.6.9-55(updates to 4.8 kernel remains at 2.6.9-55)
 *   OCFS2

*   OCFS2-tools-1.2.7-1.e14.i386
*   OCFS2console-1.2.7-e14.i386
*   OCFS2-2.6.9-55.ELsmp-1.2.8-2.e14.i386

 *   ASM

*   oracelasm-support-2.0.3-1.i386
*   oracleasm-2.6.9-55.ELsmp-2.0.3-1.i386
*   oracleasmlib-2.0.2-1.i386

There was a kernel crash on the weekend and I suspect a request will be to 
update the kernel. There has been talks around upgrading Oracle to 10.2.0.4 for 
some time but the client has not been willing to move on it. I am looking for 
some professional insight on what can be upgraded vs what should  be upgraded, 
just doing the OCFS2 and ASM vs both these as well as the RDBMS. Also the HBA 
drivers would need to be upgraded. Any info or opinions would be greatly 
appreciated.
Any additional info needed please do not hesitate to ask.


Thanks


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Why OCFS2 with RAC

2010-06-16 Thread Ulf Zimmermann
 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of David Johle
 Sent: Wednesday, June 16, 2010 10:37 AM
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] Why OCFS2 with RAC
 
 I have been a user of OCFS2 for quite some time now (2 years or so)
 and a user of Oracle RAC for several years as well.  My usage of
 these two is completely independent though, the cluster filesystem is
 for application level usage (web servers, etc.) only because RAC can
 manage its own shared storage directly.  And that brings me to my
 observation/question...

We have been running Oracle RAC on top of OCFS1 and OCFS2 now for close 
to over 7 years I think.

We had throughout that time only 2 times major issues, which in both 
cases were quickly solved with help of Sunil and other OCFS developers.

 
 I keep seeing a lot of messages on the list relating to the use of
 ocfs2 on systems running RAC--often with problems in the OCFS2 camp,
 but that's just the nature of this list.  What I'm trying to
 understand is why one would even consider adding OCFS2 to the systems
 running RAC.  To me that is just adding complexity to critical
 systems, and complexity tyipcally results in decreased
 reliability/availability in the long term.

For us it was the only choice because with the amount of databases our
DBAs were not willing to go to raw storage (when ASM wasn't available yet
I believe original). OCFS2 has worked pretty well for us.

 
 I have iSCSI targets, available on multiple paths, presented as block
 devices with device-mapper-multipath.  My data  flash recovery are
 then managed by ASM, which directly uses those block devices.  For my
 OCR  voting disks, one could do the same with block devices,
 although I know there were some issues with using block devices on
 certain versions, but raw devices are another option there.  In my
 case, I simply use libraw to present the iSCSI based block devices as
 raw devices, and that's what they use.
 
 Having RAC directly deal with the shared storage eliminates a lot of
 filesystem level overhead and removes a potential outside force that
 could unexpectedly bring DB nodes down (i.e. ocfs2 cluster stack
 fencing itself).  Not to mention the reduction in system
 administration resources for manging the filesystem.
 
 One place I could see a use for the OCFS2 is a shared ORACLE_HOME
 among nodes, but that has its own pros  cons which can be debated on
 some other mailing list :)
 
 
 So I'm curious, what benefits are there to having the OCFS2 available
 on the RAC system, moreso related to using it for CRS and DB storage
 purposes?
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


[Ocfs2-users] fsck.ocfs2 using huge amount of memory?

2010-05-20 Thread Ulf Zimmermann
We are setting up 2 new EL5 U4 machines to replace our current database servers 
running our demo environment. We use 3Par SANs and their snap clone options. 
The current production system we snap  clone from is EL4 U5 with ocfs2 1.2.9, 
the new servers have ocfs2 1.4.3 installed. Part of the refresh process is to 
run fsck.ocfs2 on the volume to recover, but right now as I am trying to run it 
on our 700GB volume it shows a virtual memory size of 21.9GB, resident of 10GB 
and it is killing the machine with swapping (24GB physical memory).

Can anyone enlighten what is going on?

Ulf.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?

2010-05-20 Thread Ulf Zimmermann
Correction, kernel modules are 1.4.4, the tools and console is 1.4.3.


 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
 Sent: Thursday, May 20, 2010 6:00 PM
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
 
 We are setting up 2 new EL5 U4 machines to replace our current database
 servers running our demo environment. We use 3Par SANs and their snap
 clone options. The current production system we snap  clone from is EL4
 U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part
 of the refresh process is to run fsck.ocfs2 on the volume to recover,
 but right now as I am trying to run it on our 700GB volume it shows a
 virtual memory size of 21.9GB, resident of 10GB and it is killing the
 machine with swapping (24GB physical memory).
 
 Can anyone enlighten what is going on?
 
 Ulf.
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?

2010-05-20 Thread Ulf Zimmermann
And upgrading to kernel modules 1.4.7, tools 1.4.4 didn't change the memory 
part:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 
29532 root  18   0 21.9g  10g4 D 21.1 45.0   0:15.24 fsck.ocfs2 
 


 -Original Message-
 From: Ulf Zimmermann
 Sent: Thursday, May 20, 2010 6:06 PM
 To: Ulf Zimmermann; ocfs2-users@oss.oracle.com
 Subject: RE: fsck.ocfs2 using huge amount of memory?
 
 Correction, kernel modules are 1.4.4, the tools and console is 1.4.3.
 
 
  -Original Message-
  From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
  boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
  Sent: Thursday, May 20, 2010 6:00 PM
  To: ocfs2-users@oss.oracle.com
  Subject: [Ocfs2-users] fsck.ocfs2 using huge amount of memory?
 
  We are setting up 2 new EL5 U4 machines to replace our current
 database
  servers running our demo environment. We use 3Par SANs and their snap
  clone options. The current production system we snap  clone from is
 EL4
  U5 with ocfs2 1.2.9, the new servers have ocfs2 1.4.3 installed. Part
  of the refresh process is to run fsck.ocfs2 on the volume to recover,
  but right now as I am trying to run it on our 700GB volume it shows a
  virtual memory size of 21.9GB, resident of 10GB and it is killing the
  machine with swapping (24GB physical memory).
 
  Can anyone enlighten what is going on?
 
  Ulf.
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 works like standalone

2010-05-05 Thread Ulf Zimmermann
OCFS needs shared storage, your /dev/sda sounds like local storage, not shared.


From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of v...@ghs.l.google.com
Sent: Thursday, March 18, 2010 11:16 AM
To: ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] OCFS2 works like standalone

I have installed OCFS2 on two nodes SuSE 10.
Seems all works superb and nice from the first sight.
But,
/dev/sda ocfs2 rac1 is not sharing through net (port ) with rac0.

On both nodes I have 500Mb /dev/sda disks that are mounted (and are ocfs2). But 
they did not share the content with each other (files and folders in it). So 
when I am creating the file in one node I am expecting to receive this file in 
another node, but it does not appeared. So how to make OCFS share the same disk 
between both nodes (mounted.ocfs2 -f - shows only one node that handle with 
this disk



1.  Two nodes are connected with interconnect 1Gb cards.

2.  netstat on two nodes says that they are listening on 

3.  I can go with telnet from one node to another on  port. (connection 
established and then closing with ^ character - so that works)

4.  Both nodes configured well, see below (there is rac1, - the rac0 has 
analogue result)

5.  ocfs2console - configure nodes shows this two nodes + Propagate was 
performed + the device  is mounted in mounpoin

6.  On both nodes I have 500Mb /dev/sda disks that are mounted (and are 
ocfs2). But they did not share the content: files and folders in it.So when I 
am creating the file in one node I am expecting to receive this file in another 
node, but it does not appeared. So how to make OCFS share the same disk on 
between both nodes (mounted.ocfs2 -f - shows only one node that handle with 
this disk)


rac1:/var/log # modinfo ocfs2
filename:   /lib/modules/2.6.16.21-0.8-default/kernel/fs/ocfs2/ocfs2.ko
author: Oracle
license:GPL
description:OCFS2 1.2.1-SLES Tue Apr 25 14:46:36 PDT 2006 (build sles)
version:1.2.1-SLES
vermagic:   2.6.16.21-0.8-default 586 REGPARM gcc-4.1
supported:  yes
depends:ocfs2_nodemanager,ocfs2_dlm,jbd,configfs
srcversion: B45E2E0A0B86D1E2295CD6B
rac1:/var/log #


rac1:/var/log # vi /etc/ocfs/cluster.conf
node:
ip_port = 
ip_address = 192.168.56.121
number = 0
name = rac1
cluster = ocfs2
node:
ip_port = 
ip_address = 192.168.56.101
number = 1
name = rac0
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2


rac1:~ # netstat -anlp | grep 
tcp0  0 0.0.0.0:http://0.0.0.0:0.0.0.0:*  
 LISTEN  -
rac1:~ #


rac1:~ # /etc/rc.d/o2cb status
Module configfs: Loaded
Filesystem configfs: Mounted
Module ocfs2_nodemanager: Loaded
Module ocfs2_dlm: Loaded
Module ocfs2_dlmfs: Loaded
Filesystem ocfs2_dlmfs: Mounted
Checking cluster ocfs2: Online
Checking heartbeat: Active
rac1:~ #


rac1:~ # /etc/rc.d/ocfs2 status
Active OCFS2 mountpoints:  /mnt/u01
rac1:~ #


rac1:~ # mounted.ocfs2 -f
DeviceFS Nodes
/dev/sda  ocfs2  rac1

gmesg says:
ocfs2_dlm: Nodes in domain (6BC17BABF90444138BFD125263D82586): 0
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,0) on (node 0, slot 0)

SeSe Linux 10
#uname -r
2.6.16.21-0.8-defaults

Thank in advance
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 Multipath Configuration

2010-03-18 Thread Ulf Zimmermann
On our 3Par SAN we use O2CB_HEARTBEAT_THRESHOLD of 76 (180 seconds). This is 
the time needed for the controller to fully recover in case of a crash or 
software upgrade. Multipath is configured with a polling_interval of 10, 
no_path_retry of 60. With these settings we are able to survive SAN switch 
crash (has happened), SAN controller crash (has happened) and SAN controller 
upgrades.


 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Elliott Perrin
 Sent: Thursday, March 18, 2010 6:58 PM
 To: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] OCFS2 Multipath Configuration
 
 I have seen node fences occur on all of our OCFS2 systems when the
 O2CB_HEARTBEAT_THRESHOLD is set less than 31 on both FC and iSCSI
 connected systems. In some testing I have done I found that all nodes
 would panic with O2CB_HEARTBEAT_THRESHOLD set to any value less than 15
 and we would see single node fencing with values between 15 and 30.
 With the O2CB_HEARTBEAT_THRESHOLD set to 31 or greater we do not see a
 fence occur when fabric disruption occurs, again on both iSCSI and FC.
 
 This is with a multipath configuration that includes
 
 polling_interval 5
 no_path_retry 5
 features 1 queue_if_no_path
 failback immediate
 selector round-robin 0
 
 as some of the configuration variables used in the setup.
 
 As long as the output of mount (not mounted.ocfs2) shows that your file
 system mount is from a /dev/mapper created multipath device then OCFS2
 is using the multipath provided device.
 
 
  -Original Message-
  From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
  boun...@oss.oracle.com] On Behalf Of Sunil Mushran
  Sent: Thursday, March 18, 2010 2:11 PM
  To: David Johle
  Cc: ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] OCFS2 Multipath Configuration
 
  Yeah.. mounted is a bit dumb. In the next release, it will recognize
  /dev/mapper devices. We still need to teach it to handle multipathing
  fully.
 
  David Johle wrote:
   I'm not sure about why mounted.ocfs2 is showing both the dm and the
   sd devices for the same volume.  But this could all be very similar
   to a problem I've experienced with OCFS2  mount finding the right
   device with multipathing.
  
   See the following thread for some more insight:
   http://oss.oracle.com/pipermail/ocfs2-users/2009-March/003391.html
  
  
  
   At 04:12 PM 3/15/2010, you wrote:
  
   Date: Wed, 16 Dec 2009 14:20:45 -0500
   From: Nyburg, Daryl daryl.nyb...@utoledo.edu
   Subject: [Ocfs2-users] OCFS2 Multipath Configuration
   To: ocfs2-users@oss.oracle.com
  
   Hello All,
   I am having some problems configuring OCFS2 to use only the
   multipath device name.  We have been doing failover testing with
 our
   ISCI SAN and as soon as we unplug one NIC the following messages
  appear
   in /var/log/messages and the system reboots.
  
  
   Dec 16 12:39:03 mcprac01 kernel: (56,6):o2hb_write_timeout:172
  ERROR:
   Heartbeat write timeout to device dm-34 after 12 milliseconds
   Dec 16 12:39:03 mcprac01 kernel: (56,6):o2hb_stop_all_regions:1967
   ERROR: stopping heartbeat on all active regions.
   Dec 16 12:39:03 mcprac01 kernel: ocfs2 is very sorry to be fencing
  this
   system by restarting
  
  
   RHEL 5.3 Kernel 2.6.18-128.el5
   OCFS2 Versions
 ocfs2console-1.4.3-1.el5
 ocfs2-2.6.18-128.el5-1.4.2-1.el5
 ocfs2-tools-1.4.3-1.el5
  
   Why does this show all device paths? Is there anyway to tell OCFS2
  to
   ignore the /dev/sd* devices ?
   $ mounted.ocfs2 -d
   DeviceFS UUID
  Label
   /dev/sdf1 ocfs2  2eaddbd4-fac6-4c83-a86d-357215730b23
   /dev/dm-25  ocfs2  2eaddbd4-fac6-4c83-a86d-357215730b23
   /dev/sdaj1   ocfs2  2eaddbd4-fac6-4c83-a86d-357215730b23
   /dev/sdd1ocfs2  199f76a6-280a-46c6-812e-50712170a823
   /dev/dm-32  ocfs2  199f76a6-280a-46c6-812e-50712170a823
   /dev/sdai1   ocfs2  199f76a6-280a-46c6-812e-50712170a823
  
  
   ___
   Ocfs2-users mailing list
   Ocfs2-users@oss.oracle.com
   http://oss.oracle.com/mailman/listinfo/ocfs2-users
  
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Fencing options

2010-01-13 Thread Ulf Zimmermann
 -Original Message-
 
 Questions:
 Can we set up redundant heartbeat ip connections?  Can we also add a
 disk heartbeat?  If it truly is network connectivity, can we set the
 timeout to be more lenient? And can we change the fencing to something
 other than machine reset? Eg unmount the volume, change it to read
 only, etc?

There is a network and a disk heartbeat afik. The timeouts are controlled via 
/etc/sysconfig/o2cb 
(On RedHat at least, not sure if Suse follows the same way). In there you have:

  # O2CB_ENABELED: 'true' means to load the driver on boot.
  O2CB_ENABLED=true

  # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
  O2CB_BOOTCLUSTER=dbtest

  # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
  O2CB_HEARTBEAT_THRESHOLD=76

  # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered 
dead.
  O2CB_IDLE_TIMEOUT_MS=3

  # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
  O2CB_KEEPALIVE_DELAY_MS=2000

  # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
  O2CB_RECONNECT_DELAY_MS=2000

The above values is what we use on our clusters, we got three 2-node, one 
4-node and one 6-node cluster.
These are all running RedHat EL4 on HP hardware (DL360 g4, g5 or DL380 g5).

 
 Thanks...
 
 Angelo
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Persistent lun problem

2009-10-21 Thread Ulf Zimmermann
Look at device-mapper-multipath

Regards, Ulf.

-
OPENLANE Inc., T: 650-532-6382, F: 650-532-6441
4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025
-

 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Pedro Figueira
 Sent: 10/01/2008 11:40
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] Persistent lun problem
 
 Dear all
 
 I'm investigating the possible solutions for persistent device names
to
 mount OCFS2 disks, without using LVM2 that is not supported. In RH4 if
 you reboot the server, add or remove LUNS the device files
 (traditionally /dev/sdX) can change, so you can mount the filesystem
on
 the wrong directory.
 
 From what I can see there are some solution like the multipath or
udev
 to create persistent device files. However the solution can be more
 simple if in the /etc/fstab file one indicate the label of the OCFS2
 volume and not the device file.
 
 I've tried this with OCFS1.2 with good results but my question is if
 it's supported (i.e if mounting an OCFS2 volume by label is supported)
 and if there is any problem with this approach. Will it work on future
 OCFS2 releases?
 
 Best regards and thanks
 
 Pedro Figueira
 
 CONFIDENCIAL NOTICE:
 This message, as well as any existing attached files, is confidential
 and intended exclusively for the individual(s) named as addressees. If
 you are not the intended recipient, you are kindly requested not to
 make any use whatsoever of its contents and to proceed to the
 destruction of the message, thereby notifying the sender.
 DISCLAIMER:
 The sender of this message can NOT ensure the security of its
 electronic transmission and consequently does not accept liability for
 any fact, which may interfere with the integrity of its content.
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] migration methods (ocfs - ocfs2)

2009-10-21 Thread Ulf Zimmermann
 
 No, the fscat tools can only read certain unmounted file systems.
 They cannot write. You can use it to copy data from ocfs
 to ocfs2 on a box running the 2.6 kernel (sles9/10, el4/5).

And when we tried to use fscat about a year ago to read our OCFS volumes on a 
RHEL4 machine it wasn't working. The source code itself had to be updated and 
as far I know that has not happened. I had email contact with Joel Becker and 
he said he would look into it, but it never happened.

 
 Mehmet Can ÖNAL wrote:
 
  Hi everyone;
 
 
 
  we have a production system with 6 nodes of RAC upon ocfs file
 system.
  We will upgrade our production system that we prefer and (also
 should)
  to migrate our data to ocfs2. Both methods of migration, fscat and
  backup/restore could be useful for us but one thing that we could not
  find is
 
  1)   If we are using fscp fort he migration method can we use
 fscp
  to pass data from ocfs to ocfs2. is fscp biwise, it can copy either
  ocfs to ocfs2 and also ocfs2 to ocfs?
 
  2)   For backup/restore the doubt is the same. We can restore
 ocfs
  backup to an ocfs2, could we also restore ocfs2 backup to an ocfs
 volume?
 
  Thanx a lot

Ulf Zimmermann | Senior System Architect

OPENLANE
4600 Bohannon Drive, Suite 100
Menlo Park, CA 94025

O: 650-532-6382  M: (510) 396-1764  F: (510) 580-0929

Email: u...@openlane.com | Web: www.openlane.com

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 mount points not automatically mounting onserver reboot

2009-08-06 Thread Ulf Zimmermann
Isn't your problem that you are setting the filesystem type to ocfs2_oracw 
and ocfs2_oragrid ? Unless this has changed after 1.2.9, it should be ocfs2.


From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of McKinley, Reid
Sent: Thursday, August 06, 2009 12:07 PM
To: McKinley, Reid; Srinivas Eeda
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 mount points not automatically mounting 
onserver reboot

I should mention that the OCFS2 mount points mount fine when the manual mount 
is done after the reboot.
Cmds for manual mounting:

mount -o datavolume,nointr,_netdev,noatime -t ocfs2 /dev/mapper/mpath0 /u02
mount -o datavolume,nointr,_netdev,noatime -t ocfs2 /dev/mapper/mpath1 /u03


They just do not auto mount when rebooting.

/etc/fstab entries look like this:

/dev/mapper/mpath0  /u02  ocfs2_oracw 
datavolume,nointr,_netdev  0 0
/dev/mapper/mpath1  /u03  ocfs2_oragrid 
datavolume,nointr,_netdev  0 0




From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of McKinley, Reid
Sent: Thursday, August 06, 2009 2:47 PM
To: Srinivas Eeda
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 mount points not automatically mounting 
onserver reboot

Thanks, Srini.

We cannot locate any other applicable msgs in the dmesg for this.  Here is a 
portion.

SCSI device sds: drive cache: write back
 sds: unknown partition table
SCSI device sdt: 57688320 512-byte hdwr sectors (29536 MB)
sdt: Write Protect is off
sdt: Mode Sense: 6b 00 00 08
SCSI device sdt: drive cache: write back
 sdt: unknown partition table
SCSI device sdu: 57688320 512-byte hdwr sectors (29536 MB)
sdu: Write Protect is off
sdu: Mode Sense: 6b 00 00 08
SCSI device sdu: drive cache: write back
 sdu: unknown partition table
SCSI device sdv: 57688320 512-byte hdwr sectors (29536 MB)
sdv: Write Protect is off
sdv: Mode Sense: 6b 00 00 08
SCSI device sdv: drive cache: write back
 sdv: unknown partition table
SCSI device sdw: 57688320 512-byte hdwr sectors (29536 MB)
sdw: Write Protect is off
sdw: Mode Sense: 6b 00 00 08
SCSI device sdw: drive cache: write back
 sdw: unknown partition table
SCSI device sdx: 57688320 512-byte hdwr sectors (29536 MB)
sdx: Write Protect is off
sdx: Mode Sense: 6b 00 00 08
SCSI device sdx: drive cache: write back
 sdx: unknown partition table
Hangcheck: starting hangcheck timer 0.9.0 (tick is 1 seconds, margin is 10 
seconds).
Hangcheck: Using monotonic_clock().

Here is the additional info.

[r...@servername01 rc6.d]# ls
K02avahi-daemonK20oracleasm   K74ntpd K89netplugd
K02avahi-dnsconfd  K25sshdK75netfsK89rdisc
K02haldaemon   K30sendmailK80kdumpK90network
K03rhnsd   K35smb K85mdmonitorK92ip6tables
K03yum-updatesdK35winbind K85mdmpdK92iptables
K05atd K50netconsole  K85messagebus   K95kudzu
K05saslauthd   K50snmpd   K85rpcgssd  K96init.crs
K10cupsK50snmptrapd   K85rpcidmapdK97sysstat
K10psacct  K50xinetd  K86nfslock  K99lvm2-monitor
K10Tivoli_lcf1 K60crond   K87irqbalance   K99microcode_ctl
K10xfs K69rpcsvcgssd  K87mcstrans S00killall
K15httpd   K72autofs  K87multipathd   S01reboot
K19ocfs2   K73ypbind  K87portmap
K20nfs K74lm_sensors  K87restorecond
K20o2cbK74nscdK88syslog
[r...@chastoemgc01 rc6.d]#


From: Srinivas Eeda [mailto:srinivas.e...@oracle.com]
Sent: Thursday, August 06, 2009 1:52 PM
To: McKinley, Reid
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 mount points not automatically mounting on 
server reboot

Reid,

that error about stackglue module is harmless and can be ignored.
Are you seeing any other errors associated with mounts failing in dmesg? Is the 
storage up by then? Can you list ls /etc/rcrunlevel.d/

thanks,
--Srini



McKinley, Reid wrote:
Our ocfs2 mount points will not mount on server reboot.
This is a critical issue because we are storing the Oracle Clusterware OCR and 
Voting files on a OCFS2 mount point.

We receive this error in the system log:

modprobe: FATAL: Module ocfs2_stackglue not found.

/etc/fstab entries look like this:

/dev/mapper/mpath0  /u02  ocfs2_oracw 
datavolume,nointr,_netdev  0 0
/dev/mapper/mpath1  /u03  ocfs2_oragrid 
datavolume,nointr,_netdev  0 0

We are using the following config:

[r...@servername02 ~]# uname -r
2.6.18-92.el5
[r...@servername02 ~]# rpm -qa|grep -i ocfs2
ocfs2-tools-devel-1.4.2-1.el5
ocfs2-2.6.18-92.el5-1.4.2-1.el5
ocfs2-tools-1.4.2-1.el5
ocfs2console-1.4.2-1.el5

We have the ocfs and oc2b services configured to restart:

[r...@servername02 ~]# chkconfig --list |grep ocfs
ocfs2   0:off   1:off   2:on3:on4:on5:on6:off
[r...@servername02 ~]# chkconfig 

Re: [Ocfs2-users] OCFS2 FS with BACKUP Tools/Vendors

2009-04-02 Thread Ulf Zimmermann
 From: ocfs2-users-boun...@oss.oracle.com 
 [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Daniel Keisling
 Sent: Thursday, April 02, 2009 1:22 PM
 To: Bumpass, Brian; ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] OCFS2 FS with BACKUP Tools/Vendors

 I use HP Data Protector.  OCFS2 is supported in v6.0.

How do you do that? I don't see that support in 6.0 nor 6.1


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] mounting mpath devices

2009-02-27 Thread Ulf Zimmermann
 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Masaryk Kevin D
 Sent: 02/24/2009 13:21
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] mounting mpath devices
 
 I'm seeing some strange behavior from OCFS2 while trying to mount
 mpath/dm-multipath devices under RHEL5. Sometimes I can mount the
 EMC-connected, dual-pathed volumes just fine and sometimes I get
 device
 busy errors. I've tried mounting by label and also by explicit
 /dev/mapper/mpathXpY name with the same unpredictable behavior. I've
 also noticed that sometimes when a device is successfully mounted on
 all
 nodes, each node may return different output from the mount command
 regarding the device mounted; e.g. /dev/mapper/mpath0p1 vs.
/dev/dm-17.
 
 One consistent aspect I have noticed whenever I receive the device
 busy error is that the /dev/dm-X names don't match up on each node. I
 also see that ocfs2console refers to each device by the /dev/dm-X name
 instead of the /dev/mapper/XX name.
 
 I guess my question is simply: Are dm-multipath devices supported
under
 OCFS2? Are multipathed devices not recommended with OCFS2? Any
 documentation available on this?

We are using OCFS2 with device-mapper-multipath (4 paths) and we use the
/dev/mapper/XX names.

Names match across machines although we did not particular paid
attention to make it the same.

Regards, Ulf.

-
OPENLANE Inc., T: 650-532-6382, F: 650-532-6441
4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025
-


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] strange node reboot in RAC environment

2009-02-03 Thread Ulf Zimmermann
 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Pedro Figueira
 Sent: 02/03/2009 09:07
 To: ocfs2-users@oss.oracle.com
 Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
 Subject: [Ocfs2-users] strange node reboot in RAC environment
 
 Hi all
 
 We have a 4 Oracle RAC with the following versions of software
 versions:
 
 Oracle and clusterware version 10.2.0.4
 Red Hat Enterprise Linux AS release 4 with kernel version 2.6.9-
 55.ELlargesmp
 ocfs2-tools-1.2.4-1
 ocfs2-2.6.9-55.ELlargesmp-1.2.5-2
 ocfs2console-1.2.4-1
 timeout parameters:
   Heartbeat dead threshold: 31
   Network idle timeout: 1
   Network keepalive delay: 5000
   Network reconnect delay: 2000
 
 Until later last year the cluster was rock solid (hundreds). From
 January forward all the servers started to reboot synchronized but the
 strange thing is that there are no log messages in /var/log/messages,
 so we don't know if this a ocfs2 related problem. This reboots seems be
 related with the backup process (maybe extra load?). Other reboots only
 affect 2 out of 4 nodes.

As ocfs2 will print out messages to the console and they might not get capture 
by anything,
I recommend to setup the virtual serial of iLO and use something like conserver 
to attach
a console to that virtual serial. I do this for all our OCFS hosts and have a 
log of anything
going on, including BIOS screen. If ocfs2 is fencing because of I/O issues, it 
will show there.

 
 Last night we updated the firmware and drivers from HP of the DL580G4
 server and today we had another reboot (now with the following messages
 in /var/log/messages):
 
 NODE 1:
 --
 Feb  3 14:12:52 grid2db1 kernel: o2net: connection to node grid2db4
 (num 3) at 10.0.2.52: has been idle for 10.0 seconds, shutting it
 down.
 Feb  3 14:12:52 grid2db1 kernel: (0,0):o2net_idle_timer:1418 here are
 some times that might help debug the situation: (tmr 1233670362.97595
 now 1233670372.96280 dr 1233670362.97580 adv
 1233670362.97604:1233670362.97604 func (c77ed98a:504)
 1233670067.138220:1233670067.138233)
 Feb  3 14:12:52 grid2db1 kernel: o2net: no longer connected to node
 grid2db4 (num 3) at 10.0.2.52:
 Feb  3 14:16:26 grid2db1 syslogd 1.4.1: restart.
 Feb  3 14:16:26 grid2db1 syslog: syslogd startup succeeded
 
 NODE 4:
 --
 Feb  3 14:12:46 grid2db4 kernel: (20,2):o2hb_write_timeout:269 ERROR:
 Heartbeat write timeout to device sdl after 6 milliseconds
 Feb  3 14:12:46 grid2db4 kernel: Heartbeat thread (20) printing last 24
 blocking operations (cur = 18):
 Feb  3 14:16:27 grid2db4 syslogd 1.4.1: restart.
 Feb  3 14:16:27 grid2db4 syslog: syslogd startup succeeded
 
 Other reboots simple don't log any error message.
 
 So my question is if it's possible this reboots are triggers by OCFS2
 and how to debug this problem? Should I change the timeout parameters?
 
 We are also planning to upgrade to OCFS2 1.2.9-1 and OCFS2 Tools 1.2.7-
 1 and latest distro kernel, any catch?
 
 Best regards and thanks for any answer.
 
 Pedro Figueira
 Serviço de Estrangeiros e Fronteiras
 Direcção Central de Informática
 Departamento de Produção
 Telefone: + 351 217 115 153
 
 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Sunil Mushran
 Sent: sábado, 31 de Janeiro de 2009 15:59
 To: Carl Benson
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] one node rejects connection from new node
 
 Nodes can be added to an online cluster. The instructions are listed
 in the user's guide.
 
 On Jan 31, 2009, at 7:53 AM, Carl Benson cben...@fhcrc.org wrote:
 
  Sunil,
 
  Thank you for responding. I will try o2cb_ctl on Monday, when I have
  physical access to hit Reset in case one or more nodes lock up.
 
  If there really is a requirement to restart the cluster on wilson1
  every time
  I add a new node (and I have five or six more nodes to add), that is
  too
  bad. Wilson1 is a 24x7 production system.
 
  --Carl Benson
 
  Sunil Mushran wrote:
  Could be that the cluster was already online on wilson1 when you
  propagated the cluster.conf to all nodes. If so, restart the cluster
  on that node.
 
  To add a node to an online cluster, you need to use the o2cb_ctl
  command. Details are in the 1.4 user's guide.
 
 
  Carl J. Benson wrote:
 
  Hello.
 
  I have three systems that share an ocfs2 filesystem, and I'm
  trying to add a fourth system.
 
  These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default.
  All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9
 
  cluster.conf looks like this:
  node:
 ip_port = 
 ip_address = 140.107.170.116
 number = 0
 name = merlot1
 cluster = ocfs2
 
  node:
 ip_port = 
 ip_address = 140.107.158.54
 number = 1
 name = 

Re: [Ocfs2-users] ocfs2 hangs during webserver usage

2009-01-28 Thread Ulf Zimmermann
 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of David Johle
 Sent: 01/28/2009 10:12
 To: jmose...@corp.xanadoo.com
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] ocfs2 hangs during webserver usage
 
 At 06:32 PM 1/27/2009, jmose...@corp.xanadoo.com wrote:
 As others have indicated, I don't think that's going to work very
 well.
 You've got two different nodes trying to write to the same file
 constantly.
 I would keep each server's log on a locally mounted file system, or
 simply
 keep the logs on the OCFS2 filesystem, but have each node write to
 different log files.
 
 Yeah, that makes parsing access_logs slightly more of a problem for
 producing hit reports, etc, but I think you'll notice performance
 improve.
 
 
 Yes, parsing logs is just one good reason for having unified log
 files -- one of the motivations for using OCFS2 even.  If our
 statistics program can handle multiple files, then at least having
 them in a shared directory would be useful.
 
 Another major area this would affect is web site issue
 troubleshooting which outputs to log files (not the access logs but
 others).  I can only imagine the complexity of having to deal with
 locating specific logging information for a site user who is having
 trouble by going to 5 different nodes to dig through locally stored
 log files.  Or worse yet, trying to correlate actions of multiple
 users who are each hitting different nodes!
 
 On that note, these other logs are written to by our aplications
 running under Tomcat.  I really am not seeing any similar lags for
 those processes, only from apache.  The only big difference I can see
 between them is the I/O pattern -- apache is usually 1 line per
 request as they are serviced, java web apps are more bursts of
 numerous lines, but not every request.  There is still a non-trivial
 amount of logging happening for these java apps though, so I am
 surprised.  In fact, Tomcat itself is configured to log each request
 with the processing time (used to produce user response time
 statistics), but those shared logs don't seem to be a point of
 contention like the apache access logs.
 
 For informational purposes, here are some line counts for logs on our
 main web site yesterday:
1577860 access log
   1361 error log
4887437 web app log
 340164 processing time log
6806822 total
 
 So only about 20% of the requests are handled by Tomcat.  The web app
 log actually writes 3x as many lines, but overall it's less data
 (373M vs. 428M) and fewer actual write operations.  This could
 explain why it is not/less prone to these write delays.

1.5 million hits for access log is not that much and you should be able
to use
separate files and then combine it into 1 before processing. The tools
are out
there for that. Another option is to send Apache logs to syslog, which
means you
have now 1 process receiving and writing the logfiles.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Another node is heartbeating in ourslot! errorswith LUN removal/addition

2008-12-05 Thread Ulf Zimmermann
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:ocfs2-users-
 [EMAIL PROTECTED] On Behalf Of Brian Kroth
 Sent: 12/05/2008 06:11
 To: Daniel Keisling
 Cc: ocfs2-users@oss.oracle.com; Joel Becker
 Subject: Re: [Ocfs2-users] Another node is heartbeating in ourslot!
 errorswith LUN removal/addition
 
 Just for clarity, can you post the proper sequence you're now using to
 take SAN based snapshots?  I'd like to try this on a new cluster I'm
 setting up.
 
 Thanks,
 Brian
 

Here is how we do backup and refresh of development databases from our
production database. The SANs involved in this are 3Par E200 and S400
using Rcopy and SnapClone.

Production database gets put into backup mode
Execute either Rcopy refresh or SnapClone Refresh on S400 (Rcopy for
Dev, SnapClone for Backup)
Take production database out of backup mode

For backups we continue with:

Running fsck to replay journal
Mounting and unmounting volume on backup server to clear the dirty flag,
just running fsck will not do that.
We reset at this point the UUID and label of the volume to not run into
issues we want to mount 2 different version of the snapclone
Running one more time fsck to ensure no errors
Mount volume
Recover database via log files
Clean shutdown of database
Backup Database in cold state
Unmount volume

For development database refresh:

Rcopy above refreshed a master volume on E200 SAN
We shutdown development database X (we got several copies) and unmount
volume on all nodes in the RAC cluster
Run SnapClone refresh command on SAN
Run fsck from one node to replay journal
Mount and unmount volume on one node to clear dirty flag
Reset UUID and label
Run fsck one more time
Mount volume on all RAC nodes again
Recover database from logs and modify name to the development name
At this point after other script run to modify contents in the database
(email addresses, phone numbers, etc)
And voila database is ready for use by developers.



Ulf Zimmermann | Senior System Architect

OPENLANE
4600 Bohannon Drive, Suite 100
Menlo Park, CA 94025

O: 650-532-6382  M: (510) 396-1764  F: (510) 580-0929

Email: [EMAIL PROTECTED] | Web: www.openlane.com

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Node reboot during network outage

2008-07-07 Thread Ulf Zimmermann
We are running a bonded interface too with two separate switches behind
it. We are using two Cisco 2960 for this interconnect, they run separate
vlans for different oracle clusters but we do not run spanning tree on
them. A failover for us takes about 100ms and we simulated a switch
failure by turning one off. We also had an Ethernet port in a machine
itself fail and it used the second port almost immediately.


Ulf Zimmermann | Senior System Architect

OPENLANE
4600 Bohannon Drive, Suite 100
Menlo Park, CA 94025

O: 650-532-6382  M: (510) 396-1764  F: (510) 580-0929

Email: [EMAIL PROTECTED] | Web: www.openlane.com


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:ocfs2-users-
 [EMAIL PROTECTED] On Behalf Of Sunil Mushran
 Sent: 04/22/2008 10:20
 To: Mick Waters
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Node reboot during network outage
 
 The issue is not the time the switch takes to reboot. The issue is the
 amount of time the secondary switch takes to find a unique path.
 
 http://en.wikipedia.org/wiki/Spanning_tree_protocol
 
 Mick Waters wrote:
  Thanks Sunil,
 
  The network switch is brand new but has a fairly complex
 configuration due to us running a number of VLANs - however, we have
 found that it has always taken quite a while to reboot.
 
  I'll try increasing the idle timeout as suggested and let you know
 what happens.  However, surely this is only treating the symptoms of
 what is, after all, a contrived scenario.  Rebooting the switch is
 supposed to test what would happen if we had a real network outage.
 What if the switch were to stay down?
 
  My issue is that we have an alternative route via the other NIC in
 the bond and the other switch.  The affected nodes in cluster
shouldn't
 fence because they should still be able to see all of the other nodes
 in the cluster via this other route.
 
  Does this make sense?
 
  Regards,
 
  Mick.
 
  -Original Message-
  From: Sunil Mushran [mailto:[EMAIL PROTECTED]
  Sent: 22 April 2008 17:40
  To: Mick Waters
  Cc: ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] Node reboot during network outage
 
  The interface died at 14:25:44 and recovered at 14:27:43.
  That's two minutes.
 
  One solution is to increase o2cb_idle_timeout to  2mins.
 
  Better solution would be to look into your router setup to determine
 why it is taking 2 minutes for the router to reconfigure.
 
  Mick Waters wrote:
 
  Hi, my company is in the process of moving our web and database
  servers to new hardware.  We have a HP EVA 4100 SAN which is being
  used by two database servers running in an Oracle 10g cluster and
 that
  works fine.  We have gone to extreme lengths to ensure high
  availability.  The SAN has twin disk arrays, twin controllers, and
 all
  servers have dual fibre interfaces.  Networking is (should be)
  similarly redundant with bonded NICs connected in two-switch
  configuration, two firewalls and so on.
 
  We also want to share regular Linux filesystems between our servers
 -
  HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and we
  chose OCFS2 (1.2.8) to manage the cluster.
 
  As stated, each server in the 4 node cluster has a bonded interface
  set up as bond0 in a two-switch configuration (each NIC in the bond
 is
  connected to a different switch).  Because this is a two-switch
  configuration, we are running the bond in active-standby mode and
 this
  works just fine.
 
  Our problem occurred when we were doing failover testing where we
  simulated the loss of one of the network switches by powering it
 off.
  The result was that the servers rebooted and this make a mockery of
  our attempts at a HA solution.
 
  Here is a short section from /var/log/messages following a reboot
of
  one of the switches to simulate an outage:
 
 

 --
   Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup
  interface eth0 is now down Apr 22 14:25:44 mtkws01p1 kernel: bnx2:
  eth0 NIC Link is Down Apr 22 14:26:13 mtkws01p1 kernel: o2net:
  connection to node mtkdb01p2 (num 1) at 10.1.3.50: has been
idle
  for 30.0 seconds, shutting it down.
  Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here
  are some times that might help debug the situation: (tmr
  1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv
  1208870743.673433:1208870743.673434 func (97690d75:2)
  1208870697.670758:1208870697.670760)
  Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to
node
  mtkdb01p2 (num 1) at 10.1.3.50:
  Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000
 Mbps
  full duplex Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0:
backup
  interface eth0 is now up Apr 22 14:28:35 mtkws01p1 kernel:
  (5234,9):dlm_do_master_request:1418
  ERROR: link to 1 went down!
  Apr 22 14:28:35 mtkws01p1 kernel:
(5234,9):dlm_get_lock_resource:995
  ERROR: status = -107
  Apr 22 14:28:35 mtkws01p1

[Ocfs2-users] OCFS2 and Cloning

2008-02-25 Thread Ulf Zimmermann
I am working currently on cloning on a regular basis our production
OCFS2 volumes to our test environment. For the database (Oracle 10G R2
RAC) we put it into backup mode, then execute a Snapclone on our 3Par
SAN. Then we use RemoteCopy and SnapClone to our development 3Par SAN.

To recover the OCFS2 volume I got through the following steps:

Stop database
umount /export/volume name
Log into SAN to refresh Snapclone
fsck.ocfs -y /dev/mapper/volume name
mount /export/volume name
umount /export/volume name
tunefs.ocfs -U /dev/mapper/volume name
tunefs.ocfs -L /export/volume name /dev/mapper/volume name
mount /export/volume name
Go through steps to recover and rename database
Start database

This seems to work, although I am curiously why I have to mount/umount
the volume in between fsck and tunefs. The fsck obviously will go
through and recover the journal but unless I mount/umount the volume
once, tunefs will come back with dirty file system.

Are there any other steps I should be doing or does this sequence look
ok?

Regards, Ulf.

-
ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441
4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025
-


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] Anyone have an idea how to find file i/othroughput?

2008-02-19 Thread Ulf Zimmermann
I will look at it. In the meanwhile I did find at least one of the
standby processes reading in bursts every 60-70 seconds like 400MB in
14.xx seconds from control01.ctl, even that file is only 94MB large.

 -Original Message-
 From: Andrew Phillips [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 19, 2008 01:43
 To: Ulf Zimmermann
 Cc: ocfs2-users@oss.oracle.com
 Subject: RE: [Ocfs2-users] Anyone have an idea how to find file
 i/othroughput?
 
 Ulf,
 
   Have you considered using systemtap? There is a recipe here that
could
 be used to find out whats going on;
 
 http://sourceware.org/systemtap/wiki/WSDeviceMonitor?highlight=%28%
 28WarStories%29%29
 
I'm not sure how well that would work with ocfs2. Unlike dtrace,
 systemtap can be more uneven in coverage. Its also something that
 requires a bit of fiddling (installing debuginfo packages).
 
The recipe above traps vfs_read and vfs_write so should work as a
 first stab at identifying the process id thats causing the I/O.
 
I'd also advise some thought if its to be used on a production
 environment. Having said that, I've used it on a production oracle RAC
 database server and found it very valuable.
 
I don't recall you mentioning the distribution, but RH, CentOS, and
 oracle's version of CentOS should all work.
 
As always, read the instructions on the label, etc...
 
  Andy
 
 
 On Mon, 2008-02-18 at 23:14 -0800, Ulf Zimmermann wrote:
  Forgot to mention, this remote server is just Oracle. It has one
standby
  database and one local database, the local one is suppose to be
idle,
  i.e. nothing connecting to it, besides once in a while for available
  check.
 
  While the primary database of the standby was down, I saw less disk
read
  access, but every 5 minutes for about 60 seconds I would see
  50-60MB/sec. After the primary came back up, read access is as high
as
  160MB/sec.
 
  We are only seeing it on this single node of the remote standby. The
  local standby (on EXT3) is not doing the same thing.
 
   -Original Message-
   From: Sunil Mushran [mailto:[EMAIL PROTECTED]
   Sent: Monday, February 18, 2008 19:28
   To: Ulf Zimmermann
   Cc: ocfs2-users@oss.oracle.com
   Subject: Re: [Ocfs2-users] Anyone have an idea how to find file
i/o
   throughput?
  
   If a userspace process is behind the io surge, then strace should
  help.
   But determining the process may require a bit of trial and error.
  
   Ulf Zimmermann wrote:
We got a remote Oracle 10g R2 standby running on OCFS2. Initial
when
  we
started the standby, read I/O was  5MB/sec on average. Since
then
  it
has grown to over 40MB/sec (longer average, it peaks much
higher).
  Here
is a graph showing this:
   
http://www.alameda.net/~ulf/dbphx01.png
   
We also have a local standby running (on EXT3) which is not
showing
  the
same symptom. I am trying to find where all these reads are
  happening.
Anyone have an idea how to figure that out on Linux?
   
Ulf.
   
   
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
   
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
 


 In order to protect our email recipients, Betfair Group use SkyScan
from
 MessageLabs to scan all Incoming and Outgoing mail for viruses.
 



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


[Ocfs2-users] OCFS2 DLM problems

2008-01-23 Thread Ulf Zimmermann
Hello everyone, once again.

We are running into a problem, which has shown now 2 times, possible 3
(once the systems looked different.)

The environment is 6 HP DL360/380 g5 servers with eth0 being the public
interface, eth1 and bond0 (eth2 and eth3) used for clusterware and bond0
also used for OCFS2. The bond0 interface is in active/passive mode.
There are no network errors counters showing and even during the problem
we can communicate via the bond0 interface. This setup has been running
for more then 2 months but last Wednesday morning and today again, we
had 2 nodes causing locking problems. The problem starts with messages
like this:

Jan 23 03:20:44 dbprd01 kernel: o2net: no longer connected to node
dbprd02 (num 1) at 192.168.202.2:
Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459
ERROR: status = -107
Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR:
status = -107
Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459
ERROR: status = -107
Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR:
status = -107

Jan 23 03:20:44 dbprd02 kernel: (5096,0):o2net_sendpage:868 ERROR:
sendpage of size 24 to node dbprd01 (num 0) at 192.168.202.1: failed
with -11
Jan 23 03:20:44 dbprd02 kernel: o2net: no longer connected to node
dbprd01 (num 0) at 192.168.202.1:

After these there are plenty of more messages, such as
dlm_wait_for_node_death, dlm_send_remote_convert_request on dbprd02
and dlm_send_proxy_ast_msg, dlm_flush_asts on dbprd01.

We are currently running OCFS2 1.2.5, the kernel is EL4 Update 5 x86_64
(2.6.9-55.ELsmp).

I see there is one bug fixed in 1.2.6/1.2.7 related to DLM and I was
wondering if the above problem could be related to it or if this is
something different.


Ulf Zimmermann | Senior System Architect

ATC-Onlane, Inc.
4600 Bohannon Drive, Suite 100
Menlo Park, CA 94025

O: 650-532-6382  M: (510) 396-1764  F: (510) 580-0929

Email: [EMAIL PROTECTED] | Web: www.atc-onlane.com

DISCLAIMER:
This e-mail and any attachments are confidential and also may be
privileged. If you are not the named recipient, or have otherwise
received this communication in error, please delete it from your inbox,
notify the sender immediately, and do not disclose its contents to any
other person, use them for any purpose, or store or copy them in any
medium. Thank you for your cooperation.



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] OCFS2 DLM problems

2008-01-23 Thread Ulf Zimmermann
Currently running 1.2.5-1 so we should upgrade. Is there any explanation
how this bug gets triggered? We are trying to understand why we are
suddenly hitting this bug, as this has been running for several months
without being triggered.

-Original Message-
From: Sunil Mushran [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 23, 2008 9:58 AM
To: Ulf Zimmermann
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 DLM problems

1.2.5-what?

If you are not on 1.2.5-6, upgrade to that. It could be you are hitting
the
following issue addressed in that release.

r3033 tcp - Retry sendpage() if it returns EAGAIN (bugzilla#896)

No, don't upgrade to 1.2.7. We just discovered an issue in it and will
be releasing 1.2.8 shortly.

Ulf Zimmermann wrote:
 Hello everyone, once again.

 We are running into a problem, which has shown now 2 times, possible 3
 (once the systems looked different.)

 The environment is 6 HP DL360/380 g5 servers with eth0 being the
public
 interface, eth1 and bond0 (eth2 and eth3) used for clusterware and
bond0
 also used for OCFS2. The bond0 interface is in active/passive mode.
 There are no network errors counters showing and even during the
problem
 we can communicate via the bond0 interface. This setup has been
running
 for more then 2 months but last Wednesday morning and today again, we
 had 2 nodes causing locking problems. The problem starts with messages
 like this:

 Jan 23 03:20:44 dbprd01 kernel: o2net: no longer connected to node
 dbprd02 (num 1) at 192.168.202.2:
 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459
 ERROR: status = -107
 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR:
 status = -107
 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459
 ERROR: status = -107
 Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR:
 status = -107

 Jan 23 03:20:44 dbprd02 kernel: (5096,0):o2net_sendpage:868 ERROR:
 sendpage of size 24 to node dbprd01 (num 0) at 192.168.202.1:
failed
 with -11
 Jan 23 03:20:44 dbprd02 kernel: o2net: no longer connected to node
 dbprd01 (num 0) at 192.168.202.1:

 After these there are plenty of more messages, such as
 dlm_wait_for_node_death, dlm_send_remote_convert_request on
dbprd02
 and dlm_send_proxy_ast_msg, dlm_flush_asts on dbprd01.

 We are currently running OCFS2 1.2.5, the kernel is EL4 Update 5
x86_64
 (2.6.9-55.ELsmp).

 I see there is one bug fixed in 1.2.6/1.2.7 related to DLM and I was
 wondering if the above problem could be related to it or if this is
 something different.


 Ulf Zimmermann | Senior System Architect

 ATC-Onlane, Inc.
 4600 Bohannon Drive, Suite 100
 Menlo Park, CA 94025

 O: 650-532-6382  M: (510) 396-1764  F: (510) 580-0929

 Email: [EMAIL PROTECTED] | Web: www.atc-onlane.com

 DISCLAIMER:
 This e-mail and any attachments are confidential and also may be
 privileged. If you are not the named recipient, or have otherwise
 received this communication in error, please delete it from your
inbox,
 notify the sender immediately, and do not disclose its contents to any
 other person, use them for any purpose, or store or copy them in any
 medium. Thank you for your cooperation.



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
   

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] OCFS2 DLM problems

2008-01-23 Thread Ulf Zimmermann
It looks like around 3:20am we had about 800 to 1,200 packets per second
coming in per node. But the packet size was not large, looks like less
then 1Mbit/sec. 4 of the nodes are connected to our front end
application servers and they would be pretty much idle at 3am. Our first
customers usual do not login until just about then (East coast people
starting to get to the dealer ships) and only in small numbers. We did
not have much batch processing happening on the 5. and 6. node.

We are planning on upgrading to 1.2.5-6 tonight but people here want to
know more why it suddenly now happens.

 -Original Message-
 From: Sunil Mushran [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, January 23, 2008 1:07 PM
 To: Ulf Zimmermann
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] OCFS2 DLM problems
 
 Depends on the net traffic I guess. The error returned asks the user
 to retry and the older code wasn't. AFAIR, we have never encountered
 this in our main test cluster.
 
 Ulf Zimmermann wrote:
  Currently running 1.2.5-1 so we should upgrade. Is there any
explanation
  how this bug gets triggered? We are trying to understand why we are
  suddenly hitting this bug, as this has been running for several
months
  without being triggered.
 
  -Original Message-
  From: Sunil Mushran [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, January 23, 2008 9:58 AM
  To: Ulf Zimmermann
  Cc: ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] OCFS2 DLM problems
 
  1.2.5-what?
 
  If you are not on 1.2.5-6, upgrade to that. It could be you are
hitting
  the
  following issue addressed in that release.
 
  r3033 tcp - Retry sendpage() if it returns EAGAIN (bugzilla#896)
 
  No, don't upgrade to 1.2.7. We just discovered an issue in it and
will
  be releasing 1.2.8 shortly.
 
  Ulf Zimmermann wrote:
 
  Hello everyone, once again.
 
  We are running into a problem, which has shown now 2 times,
possible 3
  (once the systems looked different.)
 
  The environment is 6 HP DL360/380 g5 servers with eth0 being the
 
  public
 
  interface, eth1 and bond0 (eth2 and eth3) used for clusterware and
 
  bond0
 
  also used for OCFS2. The bond0 interface is in active/passive mode.
  There are no network errors counters showing and even during the
 
  problem
 
  we can communicate via the bond0 interface. This setup has been
 
  running
 
  for more then 2 months but last Wednesday morning and today again,
we
  had 2 nodes causing locking problems. The problem starts with
messages
  like this:
 
  Jan 23 03:20:44 dbprd01 kernel: o2net: no longer connected to node
  dbprd02 (num 1) at 192.168.202.2:
  Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459
  ERROR: status = -107
  Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR:
  status = -107
  Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459
  ERROR: status = -107
  Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR:
  status = -107
 
  Jan 23 03:20:44 dbprd02 kernel: (5096,0):o2net_sendpage:868 ERROR:
  sendpage of size 24 to node dbprd01 (num 0) at 192.168.202.1:
 
  failed
 
  with -11
  Jan 23 03:20:44 dbprd02 kernel: o2net: no longer connected to node
  dbprd01 (num 0) at 192.168.202.1:
 
  After these there are plenty of more messages, such as
  dlm_wait_for_node_death, dlm_send_remote_convert_request on
 
  dbprd02
 
  and dlm_send_proxy_ast_msg, dlm_flush_asts on dbprd01.
 
  We are currently running OCFS2 1.2.5, the kernel is EL4 Update 5
 
  x86_64
 
  (2.6.9-55.ELsmp).
 
  I see there is one bug fixed in 1.2.6/1.2.7 related to DLM and I
was
  wondering if the above problem could be related to it or if this is
  something different.
 
 
  Ulf Zimmermann | Senior System Architect
 
  ATC-Onlane, Inc.
  4600 Bohannon Drive, Suite 100
  Menlo Park, CA 94025
 
  O: 650-532-6382  M: (510) 396-1764  F: (510) 580-0929
 
  Email: [EMAIL PROTECTED] | Web: www.atc-onlane.com
 
  DISCLAIMER:
  This e-mail and any attachments are confidential and also may be
  privileged. If you are not the named recipient, or have otherwise
  received this communication in error, please delete it from your
 
  inbox,
 
  notify the sender immediately, and do not disclose its contents to
any
  other person, use them for any purpose, or store or copy them in
any
  medium. Thank you for your cooperation.
 
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
 
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
 

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] Missing something basic...

2007-10-17 Thread Ulf Zimmermann
You need shared storage to use OCFS, not local storage on each server. 

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:ocfs2-users-
 [EMAIL PROTECTED] On Behalf Of Benjamin Smith
 Sent: Wednesday, October 17, 2007 18:00
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] Missing something basic...
 
 I'm stumped. I'm doing some research on clustered file systems to be
 deployed
 over winter break, and am testing on spare machines first.
 
 I have two identically configured computers, each with a 10 GB
 partition, /dev/hda2. I intend to combine these two LAN/RAID1 style to
 represent 10 GB of redundant cluster storage, so that if either
machine
 fails, computing can resume with reasonable efficiency.
 
 These machines are called cluster1 and cluster2, and are currently
on
 a
 local Gb LAN. They are running CentOS 4.4 (recompile of RHEL 4.4) I've
set
 up
 SSH RSA keys so that I can ssh directly from either to the other
without
 passwords, though I use a non-standard port, defined in ssh_config and
 sshd_config.
 
 I've installed the RPMs without incident. I've set up a cluster called
 ocfs2
 with nodes cluster1 and cluster2, with the corresponding LAN IP
 addresses. I've confirmed that configuration changes populate to
cluster2
 when I push the appropriate button in the X11 ocfs2console on
cluster1.
 I've
 checked the firewall(s) to allow inbound TCP to port  connections
on
 both
 machines, and verified this with nmap. I've also tried turning off
 iptables
 completely. On cluster1, I've formatted and mounted partition oracle
 to /meda/cluster using the ocfs2console and I can r/w to this
partition
 with
 other applications. There's about a 5-second delay when
 mounting/unmounting,
 and the FAQ reflects that this is normal. SELinux is completely off.
 
 Questions:
 
 1) How do I get this oracle partition to show/mount on host
cluster2,
 and
 subsequent systems added to the cluster? Should I be expecting a
/dev/*
 block
 device to mount, or is there some other program I should be using,
similar
 to
 smbmount?
 
 2) How do I get this /dev/hda2 (aka oracle) on cluster1 to combine
 (RAID1
 style) with /dev/hda2 on cluster2, so that if either host goes down I
 still
 have a complete FS to work from? Am I mis-understanding the abilities
and
 intentions of OCFS2? Do I need to do something with NBD, GNBD, ENDB,
or
 similar? If so, what's the recommended approach?
 
 Thanks,
 
 -Ben
 
 --
 This message has been scanned for viruses and
 dangerous content by MailScanner, and is
 believed to be clean.
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] Cluster setup

2007-10-12 Thread Ulf Zimmermann
You have Oracle people telling us not to use bonding.

 -Original Message-
 From: Sunil Mushran [mailto:[EMAIL PROTECTED]
 Sent: Thursday, October 11, 2007 15:28
 To: Ulf Zimmermann
 Cc: Randy Ramsdell; ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Cluster setup
 
 How is this a fs problem?
 
 Ulf Zimmermann wrote:
  We don't and when we were investigating why we had on the ProCurve
  4108gl reassembly problems, we were specific asked if we are doing
  bonding or VLAN tagging (neither we were doing). Just looks like the
  ProCurve are loosing packets without telling so. We switched in
Cisco
  2960G-48 with Jumbo Frames now and haven't had any reassembly
timeouts
  since then. Global Cache timeout has gone down significant. Each
  Interconnect for Oracle 10G has its own Cisco 2960G-48 now.
 
 
  -Original Message-
  From: Sunil Mushran [mailto:[EMAIL PROTECTED]
  Sent: Thursday, October 11, 2007 15:13
  To: Ulf Zimmermann
  Cc: Randy Ramsdell; ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] Cluster setup
 
  Use network bonding.
 
  Ulf Zimmermann wrote:
 
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:ocfs2-users-
  [EMAIL PROTECTED] On Behalf Of Alexei_Roudnev
  Sent: Thursday, October 11, 2007 11:10
  To: Sunil Mushran; Randy Ramsdell
  Cc: ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] Cluster setup
 
  I explained you:
  1 - single heartbeat interface IS A BUG for me.
 
 
  I haven't really followed the whole discussion but that point
above
 
  did
 
  just come to my mind a few days ago when we replaced our HP
ProCurve
  4108gl used for 3 separate Interconnects on 10g, where only 1 also
  carries the OCFS2 heartbeat. So if that switch dies, OCFS2 will go
 
  down
 
  while Oracle 10g could survive (if OCFS2 wouldn't die).
 
  I have to agree that is a bad design at this point. Heartbeat
should
  also be on at least 2 links for OCFS2.
 
  Ulf.
 
 
 

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] Cluster setup

2007-10-11 Thread Ulf Zimmermann
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:ocfs2-users-
 [EMAIL PROTECTED] On Behalf Of Alexei_Roudnev
 Sent: Thursday, October 11, 2007 11:10
 To: Sunil Mushran; Randy Ramsdell
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Cluster setup
 
 I explained you:
 1 - single heartbeat interface IS A BUG for me.

I haven't really followed the whole discussion but that point above did
just come to my mind a few days ago when we replaced our HP ProCurve
4108gl used for 3 separate Interconnects on 10g, where only 1 also
carries the OCFS2 heartbeat. So if that switch dies, OCFS2 will go down
while Oracle 10g could survive (if OCFS2 wouldn't die).

I have to agree that is a bad design at this point. Heartbeat should
also be on at least 2 links for OCFS2.

Ulf.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] Cluster setup

2007-10-11 Thread Ulf Zimmermann
We don't and when we were investigating why we had on the ProCurve
4108gl reassembly problems, we were specific asked if we are doing
bonding or VLAN tagging (neither we were doing). Just looks like the
ProCurve are loosing packets without telling so. We switched in Cisco
2960G-48 with Jumbo Frames now and haven't had any reassembly timeouts
since then. Global Cache timeout has gone down significant. Each
Interconnect for Oracle 10G has its own Cisco 2960G-48 now.

 -Original Message-
 From: Sunil Mushran [mailto:[EMAIL PROTECTED]
 Sent: Thursday, October 11, 2007 15:13
 To: Ulf Zimmermann
 Cc: Randy Ramsdell; ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Cluster setup
 
 Use network bonding.
 
 Ulf Zimmermann wrote:
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:ocfs2-users-
  [EMAIL PROTECTED] On Behalf Of Alexei_Roudnev
  Sent: Thursday, October 11, 2007 11:10
  To: Sunil Mushran; Randy Ramsdell
  Cc: ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] Cluster setup
 
  I explained you:
  1 - single heartbeat interface IS A BUG for me.
 
 
  I haven't really followed the whole discussion but that point above
did
  just come to my mind a few days ago when we replaced our HP ProCurve
  4108gl used for 3 separate Interconnects on 10g, where only 1 also
  carries the OCFS2 heartbeat. So if that switch dies, OCFS2 will go
down
  while Oracle 10g could survive (if OCFS2 wouldn't die).
 
  I have to agree that is a bad design at this point. Heartbeat should
  also be on at least 2 links for OCFS2.
 
  Ulf.
 
 

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] 6 node cluster with unexplained reboots

2007-08-16 Thread Ulf Zimmermann
 -Original Message-
 From: Mark Fasheh [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 15, 2007 16:49
 To: Ulf Zimmermann
 Cc: Sunil Mushran; ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
 
 On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote:
  Index 22: took 10003 ms to do waiting for write completion
  *** ocfs2 is very sorry to be fencing this system by restarting ***
 
  There were no SCSI errors on the console or logs around the time of
this
  reboot.
 
 It looks like the write took too long - as a first step, you might
want to
 up the disk heartbeat timeouts on those systems. Run:
 
 $ /etc/init.d/o2cb configure
 
 on each node to do that. That won't hide any hardware problems, but if
the
 problem is just a latency to get the write to disk, it'd help tune it
 away.
   --Mark

Ok, we had now 4 reboots, plus 2 more by my own action, which were by
OCFS2 fencing. As said in previous emails we were seeing some SCSI
errors and although device-mapper-multipath seems to take care of it,
sometimes the 10 second configured in multipath.conf and the default
timings of o2cb are colliding.

On the two clusters we have run into this, I have now replaced several
fibre cables and it seems we also have 1 bad port on one of the fibre
channel switches. Swapped first cable, still problems. Swapped SPF,
still problem, moved node to another port from where the SPF was swapped
from, 0 errors.

Now I am still concerned about the timing of device-mapper-multipath and
o2cb. O2cb is currently set to the default of:

Specify heartbeat dead threshold (=7) [7]: 
Specify network idle timeout in ms (=5000) [1]: 
Specify network keepalive delay in ms (=1000) [5000]: 
Specify network reconnect delay in ms (=2000) [2000]:

So the timeout I seem to hit is the 10,000 of network idle timeout? Even
this timeout occurs on the disk? What values would you recommend I
should set this to?

Another question in case someone can answer this. If I get a syslog
entries like:

Aug 16 00:44:33 dbprd01 kernel: SCSI error : 1 0 0 1 return code =
0x2
Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector
346452448
Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing
path 8:144.
Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector
346452456
Aug 16 00:44:33 dbprd01 kernel: SCSI error : 1 0 1 1 return code =
0x2
Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector
1469242384
Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing
path 8:208.
Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector
1469242392
Aug 16 00:44:33 dbprd01 multipathd: 8:144: mark as failed
Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 3
Aug 16 00:44:33 dbprd01 multipathd: 8:208: mark as failed
Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 2

Does this actually errors out all the way or does the request still go
to one of the remaining paths? If this request doesn't error out,
because it was able to still fulfill it via the 2 remaining paths, then
it is really just the timing between device-mapper-multipath recovering
this request through the remain paths and our o2cb settings. If not, we
might still have another problem. We have seen many such errors but only
had like 8 reboots, all I think attributed to fencing now.

Regards, Ulf.



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] 6 node cluster with unexplained reboots

2007-08-15 Thread Ulf Zimmermann
 -Original Message-
 From: Mark Fasheh [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 15, 2007 16:49
 To: Ulf Zimmermann
 Cc: Sunil Mushran; ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
 
 On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote:
  Index 22: took 10003 ms to do waiting for write completion
  *** ocfs2 is very sorry to be fencing this system by restarting ***
 
  There were no SCSI errors on the console or logs around the time of
this
  reboot.
 
 It looks like the write took too long - as a first step, you might
want to
 up the disk heartbeat timeouts on those systems. Run:
 
 $ /etc/init.d/o2cb configure
 
 on each node to do that. That won't hide any hardware problems, but if
the
 problem is just a latency to get the write to disk, it'd help tune it
 away.
   --Mark

The SAN is a 3Par E200, which does write into cache on its two
controllers, then acknowledges a write and then writes it actually to
disk. I have not found any reason for this delay yet, so sofar I am
stumped why it had such a long delay writing.

Ulf.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] 6 node cluster with unexplained reboots

2007-08-15 Thread Ulf Zimmermann
 -Original Message-
 From: Mark Fasheh [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 15, 2007 17:50
 To: Ulf Zimmermann
 Cc: Sunil Mushran; ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
 
 On Wed, Aug 15, 2007 at 05:43:14PM -0700, Ulf Zimmermann wrote:
  The SAN is a 3Par E200, which does write into cache on its two
  controllers, then acknowledges a write and then writes it actually
to
  disk. I have not found any reason for this delay yet, so sofar I am
  stumped why it had such a long delay writing.
 
 Are you saying that the controllers are doing write-back caching? If
 they're
 in that sort of mode, you need to change it to write-through for a
 clustered
 environment.
   --Mark

The controller getting the request mirrors the request to the second
controller (in this case there is only 1, there can be up to 7 other).
Then it acknowledges the request and writes it to disk. Each controller
has double batteries to be able to finish any pending writes. If a
controller fails, it will only acknowledge the write after it is
physical on the disk. This is part of the 3Par operation. I have
submitted a request to 3Par to check the extensive logs they generate to
see if there is anything which can explain this write delay. 

The previous reboots we had, for which we have no console logs, may have
been OCFS2 fencing or something else, all of which happened while the
cluster has been pretty much idle, while this time there was activity
(import). Monday's reboot was the first since the initial 4 reboots. I
wished OCFS2 would still log more then just on the console so we had
evidence on the other reboots.

Ulf.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] 6 node cluster with unexplained reboots

2007-08-15 Thread Ulf Zimmermann
 -Original Message-
 From: Mark Fasheh [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 15, 2007 18:04
 To: Alexei_Roudnev
 Cc: Ulf Zimmermann; Sunil Mushran; ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
 
 On Wed, Aug 15, 2007 at 05:52:49PM -0700, Alexei_Roudnev wrote:
  ANY SCSI controller can quitly delay IO for 10 - 20 seconds, without
 errors
  and explanationbs. 10 seconds threshold in OCFSv2 will never work
 properly.
 
 That has nothing to do with what I'm asking him.
 
 Ulf was described his controller thusly:
 
   does write into cache on its two controllers, then acknowledges
a
write and then writes it actually to disk.
 
 I'm keying in on the part where it acknowledges a write (presumably to
the
 host os) and _then_ pushes that write out to the disk. In general,
that's
 the wrong order ;)
 
 
 Anyway, getting back to the task of trying to fix someone's problem, I
 admit
 that I don't really know whether it's possible for a controller to do
 writeback caching, I'm just trying to clarify what's going on, that's
all.
   --Mark

I primary posted the messages just as a follow up for now. Waiting for
3Par to tell me if they have anything in the logs before I decide on
further progression, i.e. raising the write timeout or not. The first 4
reboots we had, which may or may not have been OCFS2, happened on our
3Par S400 which has 16GB of cache per controller. The last reboot for
which I do have the console messages (thanks HP for iLO and virtual
serial plus Conserver :-) ), happened on our E200, which has 8GB of
cache per controller.

We also have some SCSI errors on some nodes and I am currently awaiting
a maintance window to replace two FC cables to see if that clears up the
errors.

As you can see, all kind of things unfortunately going on. And I am
official on vacation right now too. Sigh.

Ulf.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] 6 node cluster with unexplained reboots

2007-08-13 Thread Ulf Zimmermann
One node of our 4-node cluster rebooted last night:

(11,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
dm-1 after 12000 milliseconds
Heartbeat thread (11) printing last 24 blocking operations (cur = 22):
Heartbeat thread stuck at waiting for write completion, stuffing current
time in
to that blocker (index 22)
Index 23: took 0 ms to do checking slots
Index 0: took 1 ms to do waiting for write completion
Index 1: took 1997 ms to do msleep
Index 2: took 0 ms to do allocating bios for read
Index 3: took 0 ms to do bio alloc read
Index 4: took 0 ms to do bio add page read
Index 5: took 0 ms to do submit_bio for read
Index 6: took 8 ms to do waiting for read completion
Index 7: took 0 ms to do bio alloc write
Index 8: took 0 ms to do bio add page write
Index 9: took 0 ms to do submit_bio for write
Index 10: took 0 ms to do checking slots
Index 11: took 0 ms to do waiting for write completion
Index 12: took 1992 ms to do msleep
Index 13: took 0 ms to do allocating bios for read
Index 14: took 0 ms to do bio alloc read
Index 15: took 0 ms to do bio add page read
Index 16: took 0 ms to do submit_bio for read
Index 17: took 7 ms to do waiting for read completion
Index 18: took 0 ms to do bio alloc write
Index 19: took 0 ms to do bio add page write
Index 20: took 0 ms to do submit_bio for write
Index 21: took 0 ms to do checking slots
Index 22: took 10003 ms to do waiting for write completion
*** ocfs2 is very sorry to be fencing this system by restarting ***

There were no SCSI errors on the console or logs around the time of this
reboot.

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:ocfs2-users-
 [EMAIL PROTECTED] On Behalf Of Ulf Zimmermann
 Sent: Monday, July 30, 2007 11:11
 To: Sunil Mushran
 Cc: ocfs2-users@oss.oracle.com
 Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots
 
 Too early to call. Management made the call This hardware seems to
have
 been stable, lets use it.
 
  -Original Message-
  From: Sunil Mushran [mailto:[EMAIL PROTECTED]
  Sent: Monday, July 30, 2007 11:07
  To: Ulf Zimmermann
  Cc: ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
 
  So are you suggesting the reason was bad hardware?
  Or, is it too early to call?
 
  Ulf Zimmermann wrote:
   I have serial console setup with logging via conserver but so far
no
   further crash. We also swapped hardware a bit around (another 4
node
   cluster with DL360g5 was working without crash for several weeks,
we
   swapped those 4 nodes in for the first 4 in the 6 node cluster).
  
  
   -Original Message-
   From: Sunil Mushran [mailto:[EMAIL PROTECTED]
   Sent: Monday, July 30, 2007 10:21
   To: Ulf Zimmermann
   Cc: ocfs2-users@oss.oracle.com
   Subject: Re: [Ocfs2-users] 6 node cluster with unexplained
reboots
  
   Do you have a netconsole setup? If not, set it up. That will
 capture
  
   the
  
   real reason for the reset. Well, it typically does.
  
   Ulf Zimmermann wrote:
  
   We just installed a new cluster with 6 HP DL380g5, dual single
 port
  
   Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches
to
 a
  
   3Par
  
   S400. We are using the 3Par recommended config for the Qlogic
 driver
  
   and
  
   device-mapper-multipath giving us 4 paths to the SAN. We do see
 some
  
   SCSI
  
   errors where DM-MP is failing a path after get a 0x2000 error
from
 the
  
   SAN
  
   controller, but the path gets puts back in service in less then
10
   seconds.
  
   This needs to be fixed but I don't think it is what is causing
our
  
   reboots. 2 of the nodes rebooted once while being idle (ocfs2 and
   clusterware were running, no db) and one node rebooted while idle
  
   (another
  
   node was copying using fscat our 9i db from ocfs1 to the ocfs2
data
   volume) and once while some load was put on it via the upgraded
10g
   database. In all cases it is as if someone a hardware reset
button.
 No
   kernel panic (at least not one leading to a stop with visable
  
   message), we
  
   can get a dirty write cache for the internal cciss controller.
  
   The only messages we get on the nodes are when the crashed node
is
  
   already in reset and it missed its ocfs2 heartbeat (set to the
 default
  
   of
  
   7), followed later by crs moving the vip.
  
   Any hints on trouble shooting this would be appreciated.
  
   Regards, Ulf.
  
  
   --
   Sent from my BlackBerry Wireless Handheld
  
  
  
  
  


  
   ___
   Ocfs2-users mailing list
   Ocfs2-users@oss.oracle.com
   http://oss.oracle.com/mailman/listinfo/ocfs2-users
  
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users

RE: [Ocfs2-users] 6 node cluster with unexplained reboots

2007-07-30 Thread Ulf Zimmermann
I have serial console setup with logging via conserver but so far no
further crash. We also swapped hardware a bit around (another 4 node
cluster with DL360g5 was working without crash for several weeks, we
swapped those 4 nodes in for the first 4 in the 6 node cluster).

 -Original Message-
 From: Sunil Mushran [mailto:[EMAIL PROTECTED]
 Sent: Monday, July 30, 2007 10:21
 To: Ulf Zimmermann
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
 
 Do you have a netconsole setup? If not, set it up. That will capture
the
 real reason for the reset. Well, it typically does.
 
 Ulf Zimmermann wrote:
  We just installed a new cluster with 6 HP DL380g5, dual single port
 Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches to a
3Par
 S400. We are using the 3Par recommended config for the Qlogic driver
and
 device-mapper-multipath giving us 4 paths to the SAN. We do see some
SCSI
 errors where DM-MP is failing a path after get a 0x2000 error from the
SAN
 controller, but the path gets puts back in service in less then 10
 seconds.
 
  This needs to be fixed but I don't think it is what is causing our
 reboots. 2 of the nodes rebooted once while being idle (ocfs2 and
 clusterware were running, no db) and one node rebooted while idle
(another
 node was copying using fscat our 9i db from ocfs1 to the ocfs2 data
 volume) and once while some load was put on it via the upgraded 10g
 database. In all cases it is as if someone a hardware reset button. No
 kernel panic (at least not one leading to a stop with visable
message), we
 can get a dirty write cache for the internal cciss controller.
 
  The only messages we get on the nodes are when the crashed node is
 already in reset and it missed its ocfs2 heartbeat (set to the default
of
 7), followed later by crs moving the vip.
 
  Any hints on trouble shooting this would be appreciated.
 
  Regards, Ulf.
 
 
  --
  Sent from my BlackBerry Wireless Handheld
 
 
 

 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] Adding new nodes to OCFS2?

2007-07-09 Thread Ulf Zimmermann
I actually did the below command (for node 2 and 3), it added it to the
/etc/ocfs2/cluster.conf but as far I could tell, it didn't allow me to
actually mount then on node 2 or 3. But as I had to do some other work
(resize another volume) I ended up rebooting and got them added that
way.

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:ocfs2-users-
 [EMAIL PROTECTED] On Behalf Of Nuno Fernandes
 Sent: Monday, July 09, 2007 02:58
 To: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Adding new nodes to OCFS2?
 
 Do:
 
 o2cb_ctl -C -i -n dbcl1n5 -t node -a number=4 -a
ip_address=192.168.201.5
 -a
 ip_port= -a cluster=dbcl1
 
 on all nodes.  It automaticaly updates cluster.conf e add on-the-fly
the
 nodes.
 
 Check faq and make sure you understand all command line options.
 
 Rgds
 ./npf
 
 On Sunday 08 July 2007 04:18:51 Ulf Zimmermann wrote:
  I looked around, found older post which seems not applicable
anymore. I
  have a cluster of 2 nodes right now, which has 3 OCFS2 file systems.
All
  the file systems were formatted with 4 node slots. I added the two
news
  nodes (by hand, by ocfs2console and o2cb_ctl), so my
  /etc/ofcfs/cluster.conf looks right:
 
  node:
  ip_port = 
  ip_address = 192.168.201.1
  number = 0
  name = dbcl1n1
  cluster = dbcl1
 
  node:
  ip_port = 
  ip_address = 192.168.201.2
  number = 1
  name = dbcl1n2
  cluster = dbcl1
 
  node:
  ip_port = 
  ip_address = 192.168.201.3
  number = 2
  name = dbcl1n3
  cluster = dbcl1
 
  node:
  ip_port = 
  ip_address = 192.168.201.4
  number = 3
  name = dbcl1n4
  cluster = dbcl1
 
  cluster:
  node_count = 4
  name = dbcl1
 
  But is there a way to get node 0 and 1 to dynamically accept the
  addition of node 2 and 3? Everything I find seems to indicate I have
to
  unmount, run /etc/init.d/ocfs2 stop, /etc/init.d/o2cb restart and
then
  /etc/init.d/ocfs2 start. Is there no way of telling o2cb there are
two
  new nodes? Like a /etc/init.d/o2cb reconfigure?
 
 
-
  ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441
  4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025
 
-
 
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  http://oss.oracle.com/mailman/listinfo/ocfs2-users
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


[Ocfs2-users] Adding new nodes to OCFS2?

2007-07-07 Thread Ulf Zimmermann
I looked around, found older post which seems not applicable anymore. I
have a cluster of 2 nodes right now, which has 3 OCFS2 file systems. All
the file systems were formatted with 4 node slots. I added the two news
nodes (by hand, by ocfs2console and o2cb_ctl), so my
/etc/ofcfs/cluster.conf looks right:

node:
ip_port = 
ip_address = 192.168.201.1
number = 0
name = dbcl1n1
cluster = dbcl1

node:
ip_port = 
ip_address = 192.168.201.2
number = 1
name = dbcl1n2
cluster = dbcl1

node:
ip_port = 
ip_address = 192.168.201.3
number = 2
name = dbcl1n3
cluster = dbcl1

node:
ip_port = 
ip_address = 192.168.201.4
number = 3
name = dbcl1n4
cluster = dbcl1

cluster:
node_count = 4
name = dbcl1

But is there a way to get node 0 and 1 to dynamically accept the
addition of node 2 and 3? Everything I find seems to indicate I have to
unmount, run /etc/init.d/ocfs2 stop, /etc/init.d/o2cb restart and then
/etc/init.d/ocfs2 start. Is there no way of telling o2cb there are two
new nodes? Like a /etc/init.d/o2cb reconfigure?

-
ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441
4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025
-

 

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] Hi

2007-05-07 Thread Ulf Zimmermann

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:ocfs2-users-
 [EMAIL PROTECTED] On Behalf Of Sunil Mushran
 Sent: 05/07/2007 10:47
 To: Alexei_Roudnev
 Cc: Ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Hi
 
 None of what you have written allows you to use our resources to
 spread your opinions as official recommendation.
 
 Alexei_Roudnev wrote:
 
  Oracle itself have not a SINGLE opinion (to be curious, I hear a
strong
  recommendation against OCFSv2 from oracle support, which I can not
agree
  with), so we can't treat your recommendations as official as well -
you
 are
  interested in OCFSv2 while users are not (users are interested in
making
 our
  data centers run smoothly). The only _official_ thing is
_certification
  matrix_.
 
  - Original Message -
  Alexei,
  While you are free to use this forum to share your opinions, do not
  couch these opinions as official recommendations. When push comes to
  shove, we are helping users not you. We develop, build, distribute
  the software, not you. So it may serve to community better if you
  let us offer the official recommendations and not you.
 
  Sunil

Just to add some comments from a user of Oracle 9i with OCFSv1 on RedHat
AS2.1 who tried to upgrade to EL4 and OCFSv2 and failed miserable:

Oracle support pretty much told us the problems we were running into are
problems of OCFSv2 and they weren't really willing to help us. The
feeling we were getting was that two Oracle departments (the one writing
the Database RAC engine and the one writing OCFSv2) are fighting with
each other.

In general I have a very low opinion of Oracle and their quality of code
and tools. Like patch revision numbering? Does not exist. Patch tools
suppose to patch all machines in clusters? You wish. Decent error
messages? They never heard about that.

We ended up with staying on AS2.1 and OCFSv1 for now and just migrating
our data to a new SAN.

Regards, Ulf.

-
ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441
4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025
-

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


RE: [Ocfs2-users] Some questions about ocfs2

2007-04-25 Thread Ulf Zimmermann
 -Original Message-
 From: Sunil Mushran [mailto:[EMAIL PROTECTED]
 Sent: 04/25/2007 12:16
 To: Ulf Zimmermann
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Some questions about ocfs2
 
 What's the blocksize?

Block Size Bits: 12   Cluster Size Bits: 14

 
 Ulf Zimmermann wrote:
  -Original Message-
  From: Sunil Mushran [mailto:[EMAIL PROTECTED]
  Sent: 04/25/2007 10:31
  To: Ulf Zimmermann
  Cc: ocfs2-users@oss.oracle.com
  Subject: Re: [Ocfs2-users] Some questions about ocfs2
 
  # debugfs.ocfs2 -R stats -h /dev/sdy2 | grep Cluster Size
  Block Size Bits: 12   Cluster Size Bits: 17
  12 = 4K
  17 = 128K
 
  Have you tried stracing the process?
  # strace -tt -T  -o /tmp/strace.out  ...
 
 
  Yes, strace shows shows most time is spent in lstat64 ( 99%), where
  average execution time on ext3 is  60 usecs/call while on the ocfs2
  volume it is  500 usecs/call.
 
 
  Ulf Zimmermann wrote:
 
  Is there a way to see how a file system was formatted, i.e. the
 
  block
 
  size and cluster size? I currently have a 2TB file system, of
which
  about 840GB are in use by around 9 million image files. Average
size
 
  of
 
  these images is 60-100KB. Currently our production servers still
 
  have
 
  separate file systems on ext3 and we are doing nightly rsync from
 
  there
 
  to this ocfs2 volume. This currently takes ~6 hours, which seems a
 
  tad
 
  slow. The system spends most time during writing files which have
  changed on the production servers, with high I/O wait.
 
  The SAN this ocfs2 volume is on is pretty much idle, I only see up
 
  to
 
  about 20MB/sec traffic and the two nodes which have this volume
 
  mounted
 
  have a private GigE interconnect setup for cluster.conf.
 
  Any tips on how to debug where this slowness comes from? Or even
  suggestion to use another cluster file system for a scenario like
 
  this.

Regards, Ulf.

-
ATC-Onlane Inc., T: 650-532-6382, F: 650-532-6441
4600 Bohannon Drive, Suite 100, Menlo Park, CA 94025
-

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users