Re: [Ocfs2-users] [Ocfs2-devel] size increase

2015-03-17 Thread Sunil Mushran
This is because you are specifying a 128k cluster size. Refer to man
mkfs.ocfs2 for more.
On Mar 17, 2015 8:04 PM, Umarzuki Mochlis umarz...@gmail.com wrote:

 Hi,

 What I meant by total size is output of 'du -hs'

 I can see output of fdisk on mpath1 of ocfs2 LUN similar to logical
 volume of ext4 partition (255 head  63 sectors)

 It is a 2 nodes ocfs cluster.

 2015-03-18 10:50 GMT+08:00 Xue jiufei xuejiu...@huawei.com:
  Hi Umarzuki,
  What is the meaning of total size, file size or disk usage?
  If you means the disk usage, I think maybe the difference of
  cluster size(the minimum allocation unit) is the case.
  Have you notice the cluster size or block size of your ocfs2
  and ext4 filesystem?
 
  Thanks,
  Xuejiufei
 

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 “Heartbeat generation mismatch on device” error when mounting iscsi target

2015-02-09 Thread Sunil Mushran
If ps aux|grep o2hb does not return anything, means you are using local
heartbeat.

That means you have mismatching ocfs2.conf file. And I suspect the node
where this is failing is the one that has the bad ocfs2.conf file. Compare
the config files from all the nodes and ensure it is the same. Or you could
simply replace the one on the failing node from another node. The file
should be the same everywhere. Remember to restart the cluster on that node.

On Mon, Feb 9, 2015 at 2:27 PM, Danijel Krmar 
danijel.kr...@activecollab.com wrote:

 No, nothing there:
 $ ps aux | grep o2hb
 root  5724  0.0  0.0   8320   888 pts/0S+   22:30
 http://airmail.calendar/2015-02-09%2022:30:00%20GMT+1   0:00
 http://airmail.calendar/2015-02-09%2000:00:00%20GMT+1 grep --color o2hb

 Still the same error if i try to mount the iSCSI disk:
 o2hb_check_own_slot:590 ERROR: Heartbeat generation mismatch on device
 (sdb): expected(2:0x2f32486d4c54730a, 0x54d926d7),
 ondisk(2:0xb016e6a72676a791, 0x54d926d7)

 As said, there are no such problems on other machines, just this one. I
 can’t get my head around this Heartbeat generation mismatch” error
 message.

 --
 Danijel Krmar
 A51 D.O.O.
 Novi Sad
 https://www.activecollab.com/

 On February 9, 2015 at 8:09:06 PM, Sunil Mushran (sunil.mush...@gmail.com)
 wrote:

 On node 2, do:
 ps aux | grep o2hb

 I suspect you have multiple o2hb threads running. If so, restart the o2cb
 cluster on that node.

 On Mon, Feb 9, 2015 at 10:08 AM, Danijel Krmar 
 danijel.kr...@activecollab.com wrote:

   As said in the title, when I want to mount a iSCSI target on one
 machine I get the following error:

 (o2hb-3F92114867,7826,3):o2hb_check_own_slot:590 ERROR: Heartbeat generation 
 mismatch on device (sdb): expected(2:0xa0cf28215b4b1ed3, 0x54d8a036), 
 ondisk(2:0xb016e6a72676a791, 0x54d8a037)

  The same iSCSI target is working on other machines.

 Any idea what this error means?

  --
 Danijel Krmar
  A51 D.O.O.
  Novi Sad
  https://www.activecollab.com/

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] How to unlock a bloked resource? Thanks

2014-09-10 Thread Sunil Mushran
What is the output of the commands? The protocol is supposed to do the
unlocking on its own. See what is it blocked on. It could be that the node
that has the lock cannot unlock it because it cannot flush the journal to
disk.

On Tue, Sep 9, 2014 at 7:55 PM, Guozhonghua guozhong...@h3c.com wrote:

  Hi All:



 As we test with two node in one OCFS2 cluster.

 The cluster is hang up may be for dead lock.

 We use the debugfs.ocfs tool founding that one resource is holding by one
 node who has it for long time and another node can still wait for the
 resource.

 So the cluster is hang up.



 debugfs.ocfs2 -R fs_locks -B /dev/dm-0

 debugfs.ocfs2 -R dlm_locks LOCKID_XXX /dev/dm-0



 How to unlock the lock held by the node? Is there some commands to unlock
 the resource?



 Thanks.

 -
 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
 邮件!
 This e-mail and its attachments contain confidential information from H3C,
 which is
 intended only for the person or entity whose address is listed above. Any
 use of the
 information contained herein in any way (including, but not limited to,
 total or partial
 disclosure, reproduction, or dissemination) by persons other than the
 intended
 recipient(s) is prohibited. If you receive this e-mail in error, please
 notify the sender
 by phone or email immediately and delete it!

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] FSCK may be failing and corrupting my disk???

2014-03-24 Thread Sunil Mushran
fsck cannot determine which of the two inodes is incorrect. In such cases,
fsck makes a copy of one of the inodes (with data) and asks the user to
delete the bad file after mounting.


On Sun, Mar 23, 2014 at 7:18 AM, Eric Raskin eras...@paslists.com wrote:

  I did some more research by running a fsck -fn.  Basically it is one
 inode that is wrong and needs to be cleared.  Is there a way to do that via
 debugfs?  If I can delete that one inode, then all the doubly-linked
 clusters will not be doubly linked any more and all of the errors will go
 away.

 Isn't that quicker than cloning a bad inode?


 On 03/22/2014 09:40 PM, Sunil Mushran wrote:

 Cloning the inode means inode + data. Let it finish.


 On Sat, Mar 22, 2014 at 3:44 PM, Eric Raskin eras...@paslists.com wrote:

  Hi:

 I am running a two-node Oracle VM Server 2.2.2 installation.   We were
 having some strange problems creating new virtual machines, so I shut down
 the systems and unmounted the OVS Repository (ocfs2 file system on
 Equallogic equipment).

 I ran a fsck -y first, which replayed the logs and said all was clean.
 But, I am pretty sure there are other issues, so I started an fsck -fy

 One of the messages I got was:

 Cluster 161213953 is claimed by the following inodes:
   76289548
   /running_pool/450_gebidb/System.img
 [DUP_CLUSTERS_CLONE] Inode (null) may be cloned or deleted to break the
 claim it has on its clusters. Clone inode (null) to break claims on
 clusters it shares with other inodes? y

 I then watched with an strace -p fsck process to see what was
 happening, since it was taking a long time with no messages.  I see:

 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0+\3H\26O}\306\374\0\0\0\0\0..., 4096,
 10465599488) = 4096
 pwrite64(3,
 GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096,
 10462699520) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0/\3H\26O}\302\374\0\0\0\0\0..., 4096,
 10465583104) = 4096
 pwrite64(3,
 GROUP01\0\300\17\0\4Q\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096,
 10462699520) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374\0\0\0\0\0..., 4096,
 10465558528) = 4096
 pwrite64(3,
 INODE01\0H\26O}\0\0L\0\0\0\0\0\24\346\17\0\0\0\0\0\0\0\0\0..., 4096,
 2686701568) = 4096
 pwrite64(3, GROUP01\0\300\17\0~\3\0#\0H\26O}\0\0\0\0\0n\0\1\0\0\0\0...,
 4096, 100940120064) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\7\0\0\0\0\0\0\6\0\30\0\0\0\0\0\0\0\0..., 4096,
 45056) = 4096
 pwrite64(3,
 GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096,
 10462699520) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0\272\2H\26O}\274\374\0\0\0\0\0..., 4096,
 10465558528) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374\0\0\0\0\0..., 4096,
 10465558528) = 4096
 pwrite64(3,
 GROUP01\0\300\17\0\4O\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096,
 10462699520) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096

 This is going on and on.  It looks like it is writing lots of entries to
 fix one duplicate inode???

 At this point, I have aborted the fsck, as I am worried that it is
 completely trashing our OVS repository disk.

 Can anybody shed some light on this before I restart the fsck?  We need
 to be back up and running ASAP!

 Thanks in advance!
 --

 ---
   Eric H. Raskin 914-765-0500 x120 914-765-0500%20x120  Professional
 Advertising Systems Inc. 914-765-0503 fax  200 Business Park Dr Suite 304
 eras...@paslists.com  Armonk, NY 10504 http://www.paslists.com

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users



 --

 ---
   Eric H. Raskin 914-765-0500 x120  Professional Advertising Systems Inc.
 914-765-0503 fax  200 Business Park Dr Suite 304 eras...@paslists.com  Armonk,
 NY 10504 http://www.paslists.com

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] FSCK may be failing and corrupting my disk???

2014-03-22 Thread Sunil Mushran
Cloning the inode means inode + data. Let it finish.


On Sat, Mar 22, 2014 at 3:44 PM, Eric Raskin eras...@paslists.com wrote:

  Hi:

 I am running a two-node Oracle VM Server 2.2.2 installation.   We were
 having some strange problems creating new virtual machines, so I shut down
 the systems and unmounted the OVS Repository (ocfs2 file system on
 Equallogic equipment).

 I ran a fsck -y first, which replayed the logs and said all was clean.
 But, I am pretty sure there are other issues, so I started an fsck -fy

 One of the messages I got was:

 Cluster 161213953 is claimed by the following inodes:
   76289548
   /running_pool/450_gebidb/System.img
 [DUP_CLUSTERS_CLONE] Inode (null) may be cloned or deleted to break the
 claim it has on its clusters. Clone inode (null) to break claims on
 clusters it shares with other inodes? y

 I then watched with an strace -p fsck process to see what was happening,
 since it was taking a long time with no messages.  I see:

 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0+\3H\26O}\306\374\0\0\0\0\0..., 4096,
 10465599488) = 4096
 pwrite64(3,
 GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096,
 10462699520) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0/\3H\26O}\302\374\0\0\0\0\0..., 4096,
 10465583104) = 4096
 pwrite64(3,
 GROUP01\0\300\17\0\4Q\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096,
 10462699520) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374\0\0\0\0\0..., 4096,
 10465558528) = 4096
 pwrite64(3,
 INODE01\0H\26O}\0\0L\0\0\0\0\0\24\346\17\0\0\0\0\0\0\0\0\0..., 4096,
 2686701568) = 4096
 pwrite64(3, GROUP01\0\300\17\0~\3\0#\0H\26O}\0\0\0\0\0n\0\1\0\0\0\0...,
 4096, 100940120064) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\7\0\0\0\0\0\0\6\0\30\0\0\0\0\0\0\0\0..., 4096,
 45056) = 4096
 pwrite64(3,
 GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096,
 10462699520) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0\272\2H\26O}\274\374\0\0\0\0\0..., 4096,
 10465558528) = 4096
 pwrite64(3,
 EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374\0\0\0\0\0..., 4096,
 10465558528) = 4096
 pwrite64(3,
 GROUP01\0\300\17\0\4O\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096,
 10462699520) = 4096
 pwrite64(3,
 INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096,
 90112) = 4096

 This is going on and on.  It looks like it is writing lots of entries to
 fix one duplicate inode???

 At this point, I have aborted the fsck, as I am worried that it is
 completely trashing our OVS repository disk.

 Can anybody shed some light on this before I restart the fsck?  We need to
 be back up and running ASAP!

 Thanks in advance!
 --

 ---
   Eric H. Raskin 914-765-0500 x120  Professional Advertising Systems Inc.
 914-765-0503 fax  200 Business Park Dr Suite 304 eras...@paslists.com  Armonk,
 NY 10504 http://www.paslists.com

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] How to break out the unstop loop in the recovery thread? Thanks a lot.

2013-11-01 Thread Sunil Mushran
It is encountering scsi errrors reading the device. Fixing that will fix
the issue.

If you want to stop the logging, I don't believe there is a method right
now. But i could be trivially added.
Allow user to disable mlog(ML_ERROR) logging.



On Thu, Oct 31, 2013 at 7:38 PM, Guozhonghua guozhong...@h3c.com wrote:

  Hi everyone,



 I have one OCFS2 issue.

 The OS is Ubuntu, using linux kernel is 3.2.50.

 There are three node in the OCFS2 cluster, and all the node is using the
 iSCSI SAN of HP 4330 as the storage.

 As the storage restarted, there were two node restarted for fence without
 heartbeating writting on to the storage.

 But the last one does not restart, and it still write error message into
 syslog as below:



 Oct 30 02:01:01 server177 kernel: [25786.227598]
 (ocfs2rec,14787,13):ocfs2_read_journal_inode:1463 ERROR: status = -5

 Oct 30 02:01:01 server177 kernel: [25786.227615]
 (ocfs2rec,14787,13):ocfs2_replay_journal:1496 ERROR: status = -5

 Oct 30 02:01:01 server177 kernel: [25786.227631]
 (ocfs2rec,14787,13):ocfs2_recover_node:1652 ERROR: status = -5

 Oct 30 02:01:01 server177 kernel: [25786.227648]
 (ocfs2rec,14787,13):__ocfs2_recovery_thread:1358 ERROR: Error -5 recovering
 node 2 on device (8,32)!

 Oct 30 02:01:01 server177 kernel: [25786.227670]
 (ocfs2rec,14787,13):__ocfs2_recovery_thread:1359 ERROR: Volume requires
 unmount.

 Oct 30 02:01:01 server177 kernel: [25786.227696] sd 4:0:0:0: [sdc]
 Unhandled error code

 Oct 30 02:01:01 server177 kernel: [25786.227707] sd 4:0:0:0: [sdc]
 Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK

 Oct 30 02:01:01 server177 kernel: [25786.227726] sd 4:0:0:0: [sdc] CDB:
 Read(10): 28 00 00 00 13 40 00 00 08 00

 Oct 30 02:01:01 server177 kernel: [25786.227792] end_request: recoverable
 transport error, dev sdc, sector 4928

 Oct 30 02:01:01 server177 kernel: [25786.227812]
 (ocfs2rec,14787,13):ocfs2_read_journal_inode:1463 ERROR: status = -5

 Oct 30 02:01:01 server177 kernel: [25786.227830]
 (ocfs2rec,14787,13):ocfs2_replay_journal:1496 ERROR: status = -5

 Oct 30 02:01:01 server177 kernel: [25786.227848]
 (ocfs2rec,14787,13):ocfs2_recover_node:1652 ERROR: status = -5


 ...

 Oct 30 06:48:41 server177 kernel: [43009.457816] sd 4:0:0:0: [sdc]
 Unhandled error code

 Oct 30 06:48:41 server177 kernel: [43009.457826] sd 4:0:0:0: [sdc]
 Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK

 Oct 30 06:48:41 server177 kernel: [43009.457843] sd 4:0:0:0: [sdc] CDB:
 Read(10): 28 00 00 00 13 40 00 00 08 00

 Oct 30 06:48:41 server177 kernel: [43009.457911] end_request: recoverable
 transport error, dev sdc, sector 4928

 Oct 30 06:48:41 server177 kernel: [43009.457930]
 (ocfs2rec,14787,9):ocfs2_read_journal_inode:1463 ERROR: status = -5

 Oct 30 06:48:41 server177 kernel: [43009.457946]
 (ocfs2rec,14787,9):ocfs2_replay_journal:1496 ERROR: status = -5

 Oct 30 06:48:41 server177 kernel: [43009.457960]
 (ocfs2rec,14787,9):ocfs2_recover_node:1652 ERROR: status = -5

 Oct 30 06:48:41 server177 kernel: [43009.457975]
 (ocfs2rec,14787,9):__ocfs2_recovery_thread:1358 ERROR: Error -5 recovering
 node 2 on device (8,32)!

 Oct 30 06:48:41 server177 kernel: [43009.457996]
 (ocfs2rec,14787,9):__ocfs2_recovery_thread:1359 ERROR: Volume requires
 unmount.

 Oct 30 06:48:41 server177 kernel: [43009.458021] sd 4:0:0:0: [sdc]
 Unhandled error code

 Oct 30 06:48:41 server177 kernel: [43009.458031] sd 4:0:0:0: [sdc]
 Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK

 Oct 30 06:48:41 server177 kernel: [43009.458049] sd 4:0:0:0: [sdc] CDB:
 Read(10): 28 00 00 00 13 40 00 00 08 00

 Oct 30 06:48:41 server177 kernel: [43009.458117] end_request: recoverable
 transport error, dev sdc, sector 4928

 Oct 30 06:48:41 server177 kernel: [43009.458137]
 (ocfs2rec,14787,9):ocfs2_read_journal_inode:1463 ERROR: status = -5

 Oct 30 06:48:41 server177 kernel: [43009.458153]
 (ocfs2rec,14787,9):ocfs2_replay_journal:1496 ERROR: status = -5

 Oct 30 06:48:41 server177 kernel: [43009.458168]
 (ocfs2rec,14787,9):ocfs2_recover_node:1652 ERROR: status = -5


 .

 .. The same log message as before, and the syslog is very large, it
 can occupy all the capacity remains on the disk...



 So as the syslog file size increases quikly, and is very large and it
 occupy all the capacity of the system directory / remains.

 So the host is blocked and not any response.



 According to the log as before, In the function __ocfs2_recovery_thread,
 there may be an un-stop loop which result in the super-large syslog file.

 __ocfs2_recovery_thread

 {

 

 while (rm-rm_used) {

………

status = ocfs2_recover_node(osb, node_num, slot_num);

 skip_recovery:

 if (!status) {

 

Re: [Ocfs2-users] How do I check fragmentation amount?

2013-11-01 Thread Sunil Mushran
debugfs.ocfs2 -R frag filespec DEVICE will show you the fragmentation
level on an inode basis. You could run that for all inodes and figure out
the value for the entire volume.


On Fri, Nov 1, 2013 at 3:00 PM, Andy ary...@allantgroup.com wrote:

 How can I check the amount on fragmentation on an OCFS2 volume?

 Thanks,

 Andy

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 tuning, fragmentation and localalloc option. Cluster hanging during mix read+write workloads

2013-08-06 Thread Sunil Mushran
If the storage connectivity is not stable, then dlm issues are to be
expected.
In this case, the processes are all trying to take the readlock. One
possible
scenario is that the node holding the writelock is not able to relinquish
the lock
because it cannot flush the updated inodes to disk. I would suggest you look
into load balancing and how it affects the iscsi connectivity from the
hosts.


On Tue, Aug 6, 2013 at 2:51 PM, Gavin Jones gjo...@where2getit.com wrote:

 Hello Goldwyn,

 Thanks for taking a look at this.  So, then, it does seem to be DLM
 related.  We were running fine for a few weeks and then it came up
 again this morning and has been going on throughout the day.

 Regarding the DLM debugging, I allowed debugging for DLM_GLUE,
 DLM_THREAD, DLM_MASTER and DLM_RECOVERY.  However, I don't see any DLM
 logging output in dmesg or syslog --is there perhaps another way to
 get at the actual DLM log?  I've searched around a bit but didn't find
 anything that made it clear.

 As for OCFS2 and iSCSI communications, they use the same physical
 network interface but different VLANs on that interface.  The
 connectionX:0 errors, then, seem to indicate an issue with the ISCSI
 connection.  The system logs and monitoring software don't show any
 warnings or errors about the interface going down, so the only thing I
 can think of is the connection load balancing on the SAN, though
 that's merely a hunch.  Maybe I should mail the list and see if anyone
 has a similar setup.

 If you could please point me in the right direction to make use of the
 DLM debugging via debugs.ocfs2, I would appreciate it.

 Thanks again,

 Gavin W. Jones
 Where 2 Get It, Inc.

 On Tue, Aug 6, 2013 at 4:16 PM, Goldwyn Rodrigues rgold...@suse.de
 wrote:
  Hi Gavin,
 
 
  On 08/06/2013 01:59 PM, Gavin Jones wrote:
 
  Hi Goldwyn,
 
  Apologies for the delayed reply.
 
  The hung Apache process / OCFS issue cropped up again, so I thought
  I'd pass along the contents of /proc/pid/stack of a few affected
  processes:
 
  gjones@slipapp02:~ sudo cat /proc/27521/stack
  gjones's password:
  [811663b4] poll_schedule_timeout+0x44/0x60
  [81166d56] do_select+0x5a6/0x670
  [81166fbe] core_sys_select+0x19e/0x2d0
  [811671a5] sys_select+0xb5/0x110
  [815429bd] system_call_fastpath+0x1a/0x1f
  [7f394bdd5f23] 0x7f394bdd5f23
  [] 0x
  gjones@slipapp02:~ sudo cat /proc/27530/stack
  [81249721] sys_semtimedop+0x5a1/0x8b0
  [815429bd] system_call_fastpath+0x1a/0x1f
  [7f394bdddb77] 0x7f394bdddb77
  [] 0x
  gjones@slipapp02:~ sudo cat /proc/27462/stack
  [81249721] sys_semtimedop+0x5a1/0x8b0
  [815429bd] system_call_fastpath+0x1a/0x1f
  [7f394bdddb77] 0x7f394bdddb77
  [] 0x
  gjones@slipapp02:~ sudo cat /proc/27526/stack
  [81249721] sys_semtimedop+0x5a1/0x8b0
  [815429bd] system_call_fastpath+0x1a/0x1f
  [7f394bdddb77] 0x7f394bdddb77
  [] 0x
 
 
  Additionally, in dmesg I see, for example,
 
  [774981.361149] (/usr/sbin/httpd,8266,3):ocfs2_unlink:951 ERROR: status
 =
  -2
  [775896.135467]
  (/usr/sbin/httpd,8435,3):ocfs2_check_dir_for_entry:2119 ERROR: status
  = -17
  [775896.135474] (/usr/sbin/httpd,8435,3):ocfs2_mknod:459 ERROR: status =
  -17
  [775896.135477] (/usr/sbin/httpd,8435,3):ocfs2_create:629 ERROR: status
 =
  -17
  [788406.624126] connection1:0: ping timeout of 5 secs expired, recv
  timeout 5, last rx 4491991450, last ping 4491992701, now 4491993952
  [788406.624138] connection1:0: detected conn error (1011)
  [788406.640132] connection2:0: ping timeout of 5 secs expired, recv
  timeout 5, last rx 4491991451, last ping 4491992702, now 4491993956
  [788406.640142] connection2:0: detected conn error (1011)
  [788406.928134] connection4:0: ping timeout of 5 secs expired, recv
  timeout 5, last rx 4491991524, last ping 4491992775, now 4491994028
  [788406.928150] connection4:0: detected conn error (1011)
  [788406.944147] connection5:0: ping timeout of 5 secs expired, recv
  timeout 5, last rx 4491991528, last ping 4491992779, now 4491994032
  [788406.944165] connection5:0: detected conn error (1011)
  [788408.640123] connection3:0: ping timeout of 5 secs expired, recv
  timeout 5, last rx 4491991954, last ping 4491993205, now 4491994456
  [788408.640134] connection3:0: detected conn error (1011)
  [788409.907968] connection1:0: detected conn error (1020)
  [788409.908280] connection2:0: detected conn error (1020)
  [788409.912683] connection4:0: detected conn error (1020)
  [788409.913152] connection5:0: detected conn error (1020)
  [788411.491818] connection3:0: detected conn error (1020)
 
 
  that repeats for a bit and then I see
 
  [1952161.012214] INFO: task /usr/sbin/httpd:27491 blocked for more
  than 480 seconds.
  [1952161.012219] echo 0  /proc/sys/kernel/hung_task_timeout_secs
  disables this 

Re: [Ocfs2-users] High inodes usage

2013-07-03 Thread Sunil Mushran
Hoe did you figure this out? Also, which version of the kernel are you
using?


On Wed, Jul 3, 2013 at 1:05 AM, Nicolas Michel
be.nicolas.mic...@gmail.comwrote:

 Hello guys,

 I'm using OCFS2 for a shared storage (on SAN). I just saw that the inode
 usage is really high although these filesystems are used for Oracle DATA
 storage. So there are really a few big files.

 I don't understand why the inode usage is so high with such few big files
 (As an example : one of the filesystem have 16 files and directories but
 the ~26 million of inodes are almost used!)

 My questions :
 - is the inode usage can be a problem in such a situation
 - if it is : how can I reduce their number used? Or increase the pool of
 available inodes?
 - why so many inodes are used with such a few files? I was sure that there
 were traditionaly one inode used for one file or one directory.

 --
 Nicolas MICHEL

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] High inodes usage

2013-07-03 Thread Sunil Mushran
That number is typically calculated. So it just could be bad arithmetic.
But that should not affect the other ops.


On Wed, Jul 3, 2013 at 12:40 PM, Nicolas Michel be.nicolas.mic...@gmail.com
 wrote:

 I don't know if it's the root cause of my problems or if it causes any
 problem at all. But I have some stability issues on the cluster so I'm
 investigating anything that could be suspect. My question is : is it a
 normal behavior to get inode usage with df -i showing high percentage like
 98, 99 or 100%? (a touch on the filesystem with 100% inode usage still
 create a file so I suppose it is not causing any problem but I found it
 weird).


 2013/7/3 Sunil Mushran sunil.mush...@gmail.com

 That is old. It just could be a minor bug is that release. Is it causing
 you any problems?


 On Wed, Jul 3, 2013 at 12:31 PM, Nicolas Michel 
 be.nicolas.mic...@gmail.com wrote:

 Hello Sunil,

 I checked the inode usage with df -i
 I can't check the kernel version running on the system now because I'm
 not at work but it's a SLES 10 SP2, so a pretty old kernel I suppose.

 Nicolas


 2013/7/3 Sunil Mushran sunil.mush...@gmail.com

 Hoe did you figure this out? Also, which version of the kernel are you
 using?


 On Wed, Jul 3, 2013 at 1:05 AM, Nicolas Michel 
 be.nicolas.mic...@gmail.com wrote:

 Hello guys,

 I'm using OCFS2 for a shared storage (on SAN). I just saw that the
 inode usage is really high although these filesystems are used for Oracle
 DATA storage. So there are really a few big files.

 I don't understand why the inode usage is so high with such few big
 files (As an example : one of the filesystem have 16 files and directories
 but the ~26 million of inodes are almost used!)

 My questions :
 - is the inode usage can be a problem in such a situation
 - if it is : how can I reduce their number used? Or increase the pool
 of available inodes?
 - why so many inodes are used with such a few files? I was sure that
 there were traditionaly one inode used for one file or one directory.

 --
 Nicolas MICHEL

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users





 --
 Nicolas MICHEL





 --
 Nicolas MICHEL
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6

2013-06-21 Thread Sunil Mushran
Can you dump the following using the 1.8 binary.
debugfs.ocfs2 -R stats /dev/mapper/.


On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann u...@openlane.com wrote:

  We have a production cluster of 6 nodes, which are currently running
 RHEL 5.8 with OCFS2 1.4.10. We snapclone these volumes to multiple
 destinations, one of them is a RHEL4 machine with OCFS2 1.2.9. Because of
 that the volumes are set so that we can read them there.

 ** **

 We are now trying to bring up a new server, this one has OEL 6.3 on it and
 it comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2
 –cloned-volume to reset the UUID, but when I try to change the label I get:
 

 ** **

 [root@co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP
 /dev/mapper/aucp_data_bk_2_x

 tunefs.ocfs2: Invalid name for a cluster while opening device
 /dev/mapper/aucp_data_bk_2_x

 ** **

 fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla
 for that:

 ** **

 [root@co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x ** **

 fsck.ocfs2 1.8.0

 *** glibc detected *** fsck.ocfs2: double free or corruption (fasttop):
 0x0197f320 ***

 === Backtrace: =

 /lib64/libc.so.6[0x3656475366]

 fsck.ocfs2[0x434c31]

 fsck.ocfs2[0x403bc2]

 /lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd]

 fsck.ocfs2[0x402879]

 === Memory map: 

 0040-0045 r-xp  fc:00 12489
 /sbin/fsck.ocfs2

 0064f000-00651000 rw-p 0004f000 fc:00 12489
 /sbin/fsck.ocfs2

 00651000-00652000 rw-p  00:00 0 

 0085-00851000 rw-p 0005 fc:00 12489
 /sbin/fsck.ocfs2

 0197e000-0199f000 rw-p  00:00 0
 [heap]

 3655c0-3655c2 r-xp  fc:00 8797
 /lib64/ld-2.12.so

 3655e1f000-3655e2 r--p 0001f000 fc:00 8797
 /lib64/ld-2.12.so

 3655e2-3655e21000 rw-p 0002 fc:00 8797
   /lib64/ld-2.12.so

 3655e21000-3655e22000 rw-p  00:00 0 

 365640-3656589000 r-xp  fc:00 8798
 /lib64/libc-2.12.so

 3656589000-3656788000 ---p 00189000 fc:00 8798
 /lib64/libc-2.12.so

 3656788000-365678c000 r--p 00188000 fc:00 8798
 /lib64/libc-2.12.so

 365678c000-365678d000 rw-p 0018c000 fc:00 8798
 /lib64/libc-2.12.so

 365678d000-3656792000 rw-p  00:00 0 

 3659c0-3659c16000 r-xp  fc:00 8802
 /lib64/libgcc_s-4.4.6-20120305.so.1

 3659c16000-3659e15000 ---p 00016000 fc:00 8802
 /lib64/libgcc_s-4.4.6-20120305.so.1

 3659e15000-3659e16000 rw-p 00015000 fc:00 8802
 /lib64/libgcc_s-4.4.6-20120305.so.1

 3d3e80-3d3e817000 r-xp  fc:00 12028
 /lib64/libpthread-2.12.so

 3d3e817000-3d3ea17000 ---p 00017000 fc:00 12028
  /lib64/libpthread-2.12.so

 3d3ea17000-3d3ea18000 r--p 00017000 fc:00 12028
 /lib64/libpthread-2.12.so

 3d3ea18000-3d3ea19000 rw-p 00018000 fc:00 12028
 /lib64/libpthread-2.12.so

 3d3ea19000-3d3ea1d000 rw-p  00:00 0 

 3e2660-3e26603000 r-xp  fc:00 426
 /lib64/libcom_err.so.2.1

 3e26603000-3e26802000 ---p 3000 fc:00 426
 /lib64/libcom_err.so.2.1

 3e26802000-3e26803000 r--p 2000 fc:00 426
 /lib64/libcom_err.so.2.1

 3e26803000-3e26804000 rw-p 3000 fc:00 426
 /lib64/libcom_err.so.2.1

 7fb063711000-7fb063714000 rw-p  00:00 0 

 7fb06371d000-7fb06372 rw-p  00:00 0 

 7fffd5b95000-7fffd5bb6000 rw-p  00:00 0
 [stack]

 7fffd5bc5000-7fffd5bc6000 r-xp  00:00 0
 [vdso]

 ff60-ff601000 r-xp  00:00 0
 [vsyscall]

 Abort (core dumped)

 ** **

 I think one of the main question is what is the “Invalid name for a
 cluster while trying to join the group” or “Invalid name for a cluster
 while opening device”. I am pretty sure that /etc/sysconfig/o2cb and
 /etc/ocfs2/cluster.conf is correct.

 ** **

 Ulf.

 ** **

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to set the o2cb heartbeat to global

2013-06-04 Thread Sunil Mushran
Support for global heartbeat was added in ocfs2-tools-1.8.


On Tue, Jun 4, 2013 at 8:31 AM, Vineeth Thampi vineeth.tha...@gmail.comwrote:

 Hi,

 I have added heartbeat mode as global, but when I do a mkfs and mount, and
 then check the mount, it says I am in local mode. Even
 /sys/kernel/config/cluster/ocfs2/heartbeat/mode says local. I am running
 CentOS with 3.x kernel, with ocfs2-tools-1.6.4-1118.

 mkfs -t ocfs2 -b 4K -C 1M -N 16 --cluster-stack=o2cb  /dev/sdb
 mount -t ocfs2 /dev/sdb /mnt -o
 noatime,data=writeback,nointr,commit=60,coherency=buffered

 ==
 node:
 ip_port = 
 ip_address = 10.81.2.108
 number = 1
 name = cam-st08
 cluster = ocfs2

 cluster:
 node_count = 2
 heartbeat_mode = global
 name = ocfs2
 ==

 root@cam-st07 log # mount | grep sdb
 /dev/sdb on /mnt type ocfs2
 (rw,_netdev,noatime,data=writeback,nointr,commit=60,coherency=buffered,heartbeat=local)

 Any help would be much appreciated.

 Thanks,

 Vineeth

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] What is the overhead/disk loss of formatting an ocfs2 filesystem?

2013-04-15 Thread Sunil Mushran
-N 16 means 16 journals. I think it defaults to 256M journals. So that's
4G. Do you plan to mount it on 16 nodes? If not, reduce that. Other options
is a smaller journal. But you have to be careful as a small journal could
limit your write thruput.


On Mon, Apr 15, 2013 at 1:37 PM, Jerry Smith jds...@sandia.gov wrote:

 Good afternoon,

 I have an OEL 6.3 box with a few ocfs2 mounts mounted locally, and was
 wondering what I should expect to lose via formatting etc from a disk
 usage standpoint.

 -bash-4.1$ df -h | grep ocfs2
 /dev/dm-15 12G  1.3G   11G  11% /ocfs2/redo0
 /dev/dm-13120G  4.2G  116G   4% /ocfs2/software-master
 /dev/dm-10 48G  4.1G   44G   9% /ocfs2/arch0
 /dev/dm-142.5T  6.7G  2.5T   1% /ocfs2/ora01
 /dev/dm-111.5T  5.7G  1.5T   1% /ocfs2/ora02
 /dev/dm-17100G  4.2G   96G   5% /ocfs2/ora03
 /dev/dm-12200G  4.3G  196G   3% /ocfs2/ora04
 /dev/dm-163.0T  7.3G  3.0T   1% /ocfs2/orabak01
 -bash-4.1$


 For example ora04 is 196GB total, but with zero usage it shows 4.3GB used:

 [root@oeldb10 ~]#df -h /ocfs2/ora04
 FilesystemSize  Used Avail Use% Mounted on
 /dev/dm-12200G  4.3G  196G   3% /ocfs2/ora04
 [root@oeldb10 ~]#find /ocfs2/ora04/ | wc -l
 3
 [root@oeldb10 ~]#find /ocfs2/ora04/ -exec du -sh {} \;
 0/ocfs2/ora04/
 0/ocfs2/ora04/lost+found
 0/ocfs2/ora04/db66snlux


 Filesystems formatted via

 mkfs -t ocfs2 -N 16 --fs-features=xattr,local -L ${device} ${device}

 Mount options

 [root@oeldb10 ~]#mount |grep ora04
 /dev/dm-12 on /ocfs2/ora04 type ocfs2
 (rw,_netdev,nointr,user_xattr,heartbeat=none)

 Thanks,

 --Jerry



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Significant Slowdown when writing and deleting files at the same time

2013-03-29 Thread Sunil Mushran
Are you mounting -o writeback?


On Fri, Mar 29, 2013 at 12:28 PM, Andy ary...@allantgroup.com wrote:

 I have been having performance issues from time to time on our
 production ocfs2 volumes, so I set up a test system to try to reproduce
 what I was seeing on the production systems.  This is what I found out:

 I have a 2 node test system sharing a 2TB volume with a journal size of
 256MB.  I can easily trigger the slowdown by starting to processes to
 write a 10GB file each, then I delete a different large file (7GB+)
 while the other processes are writing.  The slowdown is significant and
 very disruptive.  Not only did it take over 3 minutes to delete the
 file, every else with pause when entering that directory too.  A du
 command with stop and nfs access to that file system will think the
 server is not responding.  Under heavier amounts of writes, I have had a
 delete takes 13mins for a 8GB file, and NFS mounts return I/O errors.
 We often deal with large files, so this situation above is fairly common.

 I would like any ideas that would provide smoother performance of the
 OCFS2 volume and somehow eliminate the long pauses during deletes.

 Thanks,

 Andy

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] [OCFS2] Crash at o2net_shutdown_sc()

2013-03-01 Thread Sunil Mushran
 [ 1481.620253] o2hb: Unable to stabilize heartbeart on region
1352E2692E704EEB8040E5B8FF560997 (vdb)

What this means is that the device is suspect. o2hb writes are not hitting
the disk. vdb is accepting and
acknowledging the write but spitting out something else during the next
read. Heartbeat detects this and
aborts, as it should.

Then we hit a race during socket close that triggers the oops. Yes, that
needs to be fixed. But you also
need to fix vdb... what appears to be a virtual device.


On Fri, Mar 1, 2013 at 1:25 PM, richard -rw- weinberger 
richard.weinber...@gmail.com wrote:

 Hi!

 Using 3.8.1 OCFS2 crashes while joining nodes to the cluster.
 The cluster consists of 10 nodes, while node3 joins the kernel on node3
 crashes.
 (Somtimes later...)
 See dmesg below.
 Is this a known issue? I didn't test older kernels so far.

 node1:
 [ 1471.881922] o2dlm: Joining domain 1352E2692E704EEB8040E5B8FF560997
 ( 0 ) 1 nodes
 [ 1471.919522] JBD2: Ignoring recovery information on journal
 [ 1471.947027] ocfs2: Mounting device (253,16) on (node 0, slot 0)
 with ordered data mode.
 [ 1475.802497] o2net: Accepted connection from node node2 (num 1) at
 192.168.66.2:
 [ 1481.814048] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 8
 [ 1481.814955] o2net: No longer connected to node node2 (num 1) at
 192.168.66.2:
 [ 1482.468827] o2net: Accepted connection from node node3 (num 2) at
 192.168.66.3:
 [ 1511.904100] o2net: No connection established with node 1 after 30.0
 seconds, giving up.
 [ 1514.472995] o2net: Connection to node node3 (num 2) at
 192.168.66.3: shutdown, state 8
 [ 1514.473960] o2net: No longer connected to node node3 (num 2) at
 192.168.66.3:
 [ 1516.076044] o2net: Accepted connection from node node2 (num 1) at
 192.168.66.2:
 [ 1520.181430] o2dlm: Node 1 joins domain
 1352E2692E704EEB8040E5B8FF560997 ( 0 1 ) 2 nodes
 [ 1544.544030] o2net: No connection established with node 2 after 30.0
 seconds, giving up.
 [ 1574.624029] o2net: No connection established with node 2 after 30.0
 seconds, giving up.

 node2:
 [ 1475.613170] o2net: Connected to node node1 (num 0) at 192.168.66.1:
 [ 1481.620253] o2hb: Unable to stabilize heartbeart on region
 1352E2692E704EEB8040E5B8FF560997 (vdb)
 [ 1481.622489] o2net: No longer connected to node node1 (num 0) at
 192.168.66.1:
 [ 1515.886605] o2net: Connected to node node1 (num 0) at 192.168.66.1:
 [ 1519.992766] o2dlm: Joining domain 1352E2692E704EEB8040E5B8FF560997
 ( 0 1 ) 2 nodes
 [ 1520.017054] JBD2: Ignoring recovery information on journal
 [ 1520.07] ocfs2: Mounting device (253,16) on (node 1, slot 1)
 with ordered data mode.
 [ 1520.159590] mount.ocfs2 (2186) used greatest stack depth: 2568 bytes
 left

 node3:
 [ 1482.836865] o2net: Connected to node node1 (num 0) at 192.168.66.1:
 [ 1482.837542] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1484.840952] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1486.844994] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1488.848952] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1490.853052] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1492.857046] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1494.861042] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1496.865024] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1498.869021] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1500.873016] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1502.877056] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1504.881042] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1506.885040] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1508.888991] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1510.893077] o2net: Connection to node node2 (num 1) at
 192.168.66.2: shutdown, state 7
 [ 1512.843172] (mount.ocfs2,2179,0):dlm_request_join:1477 ERROR: Error
 -107 when sending message 510 (key 0x666c6172) to node 1
 [ 1512.845580] (mount.ocfs2,2179,0):dlm_try_to_join_domain:1653 ERROR:
 status = -107
 [ 1512.847778] (mount.ocfs2,2179,0):dlm_join_domain:1955 ERROR: status =
 -107
 [ 1512.849334] (mount.ocfs2,2179,0):dlm_register_domain:2214 ERROR:
 status = -107
 [ 1512.850921] (mount.ocfs2,2179,0):o2cb_cluster_connect:368 ERROR:
 status = -107
 [ 1512.852511] (mount.ocfs2,2179,0):ocfs2_dlm_init:3004 ERROR: status =
 -107
 [ 1512.854090] (mount.ocfs2,2179,0):ocfs2_mount_volume:1881 ERROR: status
 = -107
 [ 1512.855476] ocfs2: Unmounting device (253,16) on (node 0)
 [ 

Re: [Ocfs2-users] OCFS ..Inode contains a hole at offset...

2013-02-20 Thread Sunil Mushran
This is probably a directory. debugs.ocfs2 -R 'stat 52663' /dev/ will
dump the inode.

Are you sure fsck is fixing it? Does the output show this block getting
fixed?
If not, you may want to run fsck.ocfs2 v1.8. I think a fix code was added
for it.


On Wed, Feb 20, 2013 at 1:01 AM, Fiorenza Meini fme...@esseweb.eu wrote:

 Hi there,
 I have a partition formatted with ocfs2 (1.6.3) on a 2.6.37 Linux Kernel
 system. This partition is managed by a cluster (corosync/pacemaker).
 The backend of this ocfs2 partition is drbd on Lvm.

 I see this line in the messages log file:
 ocfs2_read_virt_blocks:871 ERROR: Inode #52663 contains a hole at offset
 69632

 The error is reported more than once and the offset is the same..

 When I do a check on this partition, errors are found and resolved, but
 in a short time the problems appears again.
 I can't understand at what level is the problem:
 * kernel ?
 * hardware ?
 * lvm + drbd ?

 There are tools that can be used to understand ?
 Any suggestion?

 Thanks and regards.

 Fiorenza
 --

 Fiorenza Meini
 Spazio Web S.r.l.

 V. Dante Alighieri, 10 - 13900 Biella
 Tel.: 015.2431982 - 015.9526066
 Fax: 015.2522600
 Reg. Imprese, CF e P.I.: 02414430021
 Iscr. REA: BI - 188936
 Iscr. CCIAA: Biella - 188936
 Cap. Soc.: 30.000,00 Euro i.v.

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs cluster node keeps rebooting

2013-01-14 Thread Sunil Mushran
1.2.5 is 6+ year old release. You may want to use something more current.


On Mon, Jan 14, 2013 at 12:06 PM, Bill Zha lfl200...@yahoo.com wrote:

 Hi Sunil and All,

 We have a 10 Redhat4.2-node OCFS cluster running on version 1.2.5-6.  One
 of the node started to rebooted almost everyday since last week.  The
 entire cluster had been stable for the past 1 year or so.  I captured the
 following console output, can you or someone had the similar issue let me
 know what the possible cause of these reboots?

 (25271,4):o2net_idle_timer:1426 here are some times that might help debug
 the situation: (tmr 1358156758.101016 now 1358156788.97593 dr
 1358156758.101008 adv 1358156758.101022:1358156758.101024 func
 (5d21e188:507) 1357953447.247097:1357953447.247100)
 (25267,4):o2net_idle_timer:1426 here are some times that might help debug
 the situation: (tmr 1358156758.666788 now 1358156788.663604 dr
 1358156760.666794 adv 1358156758.666793:1358156758.666795 func
 (5d21e188:505) 1357953453.107343:1357953453.107349)
 (25267,4):o2net_idle_timer:1426 here are some times that might help debug
 the situation: (tmr 1358156758.848933 now 1358156788.953367 dr
 1358156760.847939 adv 1358156758.848939:1358156758.848941 func
 (0e6eb1eb:505) 1357965605.352156:1357965605.352162)
 (25267,4):o2net_idle_timer:1426 here are some times that might help debug
 the situation: (tmr 1358156759.108373 now 1358156789.243003 dr
 1358156761.108392 adv 1358156759.108376:1358156759.108378 func
 (af22ae1f:502) 1357914301.741127:1357914301.741130)
 (25275,4):o2net_idle_timer:1426 here are some times that might help debug
 the situation: (tmr 1358156759.626366 now 1358156789.623629 dr
 1358156789.622319 adv 1358156759.626369:1358156759.626371 func
 (abd851aa:505) 1357965605.363679:1357965605.363685)
 (25275,4):o2net_idle_timer:1426 here are some times that might help debug
 the situation: (tmr 1358156759.656350 now 1358156789.913330 dr
 1358156761.656039 adv 1358156759.656354:1358156759.656355 func
 (0e6eb1eb:502) 1357907401.318584:1357907401.318587)
 (25275,4):o2net_idle_timer:1426 here are some times that might help debug
 the situation: (tmr 1358156759.663467 now 1358156790.203323 dr
 1358156761.662745 adv 1358156759.663470:1358156759.663472 func
 (7dcded64:502) 1357875986.764566:1357875986.764568)
 (25275,4):o2net_idle_timer:1426 here are some times that might help debug
 the situation: (tmr 1358156759.987324 now 1358156790.493342 dr
 1358156761.987117 adv 1358156759.987327:1358156759.987329 func
 (6bcd2bc6:502) 1357875995.47:1357875995.55)
 (25,7):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
 dm-14 after 18 milliseconds
 Heartbeat thread (25) printing last 24 blocking operations (cur = 11):
 Heartbeat thread stuck at msleep, stuffing current time into that blocker
 (index 11)
 Index 12: took 0 ms to do allocating bios for read
 Index 13: took 0 ms to do bio alloc read
 Index 14: took 0 ms to do bio add page read
 Index 15: took 0 ms to do bio add page read
 Index 16: took 0 ms to do submit_bio for read
 Index 17: took 0 ms to do waiting for read completion
 Index 18: took 0 ms to do bio alloc write
 Index 19: took 0 ms to do bio add page write
 Index 20: took 0 ms to do submit_bio for write
 Index 21: took 0 ms to do checking slots
 Index 22: took 0 ms to do waiting for write completion
 Index 23: took 100897 ms to do msleep
 Index 0: took 0 ms to do allocating bios for read
 Index 1: took 0 ms to do bio alloc read
 Index 2: took 0 ms to do bio add page read
 Index 3: took 0 ms to do bio add page read
 Index 4: took 0 ms to do submit_bio for read
 Index 5: took 0 ms to do waiting for read completion
 Index 6: took 0 ms to do bio alloc write
 Index 7: took 0 ms to do bio add page write
 Index 8: took 0 ms to do submit_bio for write
 Index 9: took 0 ms to do checking slots
 Index 10: took 0 ms to do waiting for write completion
 Index 11: took 313 ms to do msleep
 *** ocfs2 is very sorry to be fencing this system by restarting ***


 Thank you so much for your help!


 Bill

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] asynchronous hwclocks

2013-01-03 Thread Sunil Mushran
The fs does not care about time. It should have no effect on the cluster. 
However the apps may care and may behave erratically. 

On Jan 3, 2013, at 3:13 PM, Medienpark, Jakob Rößler 
roess...@medienpark.net wrote:

 Hello list,
 
 today I noticed huge differences between the hardware clocks in our cluster.
 Some details:
 
 root@www01:~# hwclock;date
 Do 03 Jan 2013 09:32:09 CET  -0.626096 seconds
 Do 3. Jan 09:34:54 CET 2013
 
 root@www02:~# hwclock;date
 Do 03 Jan 2013 09:32:09 CET  -0.626091 seconds
 Do 3. Jan 09:34:54 CET 2013
 
 root@www03:~# hwclock;date
 Do 03 Jan 2013 09:34:54 CET  -0.625820 seconds
 Do 3. Jan 09:34:54 CET 2013
 
 root@storage:~# hwclock;date
 Do 03 Jan 2013 08:34:54 CET  -0.641532 seconds
 Do 3. Jan 09:34:54 CET 2013
 
 The server 'storage' is the server which provides the iscsi device to 
 www01-03.
 Because the cluster was very unstable during load peaks, I want to ask 
 you what kind of effects it will have to ocfs2 if the hwclocks are 
 asynchronous like shown above.
 
 Thanks in advance
 
 Jakob
 
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Is this a valid configuration?

2012-12-05 Thread Sunil Mushran
This is normal. My only concern is the use of very old kernel/fs versions.


On Wed, Dec 5, 2012 at 3:08 AM, Neil campbell.n...@hotmail.com wrote:

 Anyone?

 

 On 2012-11-28 00:47:56 + neil campbell campbell.n...@hotmail.com
 wrote:

 
 
  Hi list,
 
  I am running OCFS2 1.2.9-9.bug13439173 on RHEL 4 Kernel 2.6.9-89
 
  # modinfo ocfs2
 
  filename:   /lib/modules/2.6.9-89.0.26.ELsmp/kernel/fs/ocfs2/ocfs2.ko
  license:GPL
  author: Oracle
  version:1.2.9 CF6A7A44EA2581415F3D612
  description:OCFS2 1.2.9 Mon Dec  5 14:27:38 EST 2011 (build
  e5c3135c8cbf75f2620ff4c782d634f1)
  depends:ocfs2_nodemanager,ocfs2_dlm,jbd,debugfs
  vermagic:   2.6.9-89.0.26.ELsmp SMP gcc-3.4
 
  #
 
  I just have some reservations about whether the following configuration,
  where I have mount points of different file system types over an initial
  mount point (/d0) would cause any issues?
 
  LUN1LUN2LUN3  LUN4
  ||   | |
  ||   | |
  /d0 (ext3)   /d0/app (ext3)  /d0/ocfs (ocfs2)  /d0/app/html (ocfs2)
 
 
  Many thanks,
  Neil
 
 
  ___
  Ocfs2-users mailing list
  Ocfs2-users@oss.oracle.com
  https://oss.oracle.com/mailman/listinfo/ocfs2-users
 


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ls taking ages on a directory containing 900000 files

2012-12-04 Thread Sunil Mushran
strace -p PID -ttt -T

Attach and get some timings. The simplest guess is that the system lacks
memory to cache all the inodes
and thus has to hit disk (and more importantly take cluster locks) for the
same inode repeatedly. The user
guide has a section in NOTES explaining this.



On Tue, Dec 4, 2012 at 8:54 AM, Amaury Francois
amaury.franc...@digora.comwrote:

  Hello,

 ** **

 We are running OCFS2 1.8 and on a kernel UEK2. An ls on a directory
 containing approx. 1 million of files  is very long (1H). The features we
 have activated on the filesystem are the following : 

 ** **

 [root@pa-oca-app10 ~]# debugfs.ocfs2 -R stats /dev/sdb1

 Revision: 0.90

 Mount Count: 0   Max Mount Count: 20

 State: 0   Errors: 0

 Check Interval: 0   Last Check: Fri Nov 30 19:30:17 2012

 Creator OS: 0

 Feature Compat: 3 backup-super strict-journal-super

 Feature Incompat: 32592 sparse extended-slotmap inline-data
 metaecc xattr indexed-dirs refcount discontig-bg clusterinfo

 Tunefs Incomplete: 0

 Feature RO compat: 1 unwritten

 Root Blknum: 5   System Dir Blknum: 6

 First Cluster Group Blknum: 3

 Block Size Bits: 12   Cluster Size Bits: 12

 Max Node Slots: 8

 Extended Attributes Inline Size: 256

 Label: exchange2

 UUID: 2375EAF4E4954C4ABB984BDE27AC93D5

 Hash: 2880301520 (0xabade9d0)

 DX Seeds: 1678175851 1096448356 79406012 (0x6406ee6b 0x415a7964
 0x04bba3bc)

 Cluster stack: o2cb

 Cluster name: appcluster

 Cluster flags: 1 Globalheartbeat

 Inode: 2   Mode: 00   Generation: 3567595533 (0xd4a5300d)

 FS Generation: 3567595533 (0xd4a5300d)

 CRC32: 0c996202   ECC: 0819

 Type: Unknown   Attr: 0x0   Flags: Valid System Superblock

 Dynamic Features: (0x0)

 User: 0 (root)   Group: 0 (root)   Size: 0

 Links: 0   Clusters: 5242635

 ctime: 0x508eac6b 0x0 -- Mon Oct 29 17:18:51.0 2012

 atime: 0x0 0x0 -- Thu Jan  1 01:00:00.0 1970

 mtime: 0x508eac6b 0x0 -- Mon Oct 29 17:18:51.0 2012

 dtime: 0x0 -- Thu Jan  1 01:00:00 1970

 Refcount Block: 0

 Last Extblk: 0   Orphan Slot: 0

 Sub Alloc Slot: Global   Sub Alloc Bit: 65535

 ** **

 ** **

 May inline-data or xattr be the source of the problem ?

 ** **

 Thank you. 

 ** **

 [image: Description : Description : Description :
 cid:image001.png@01CD01F3.35091200]

 * *

 *Amaury FRANCOIS*   •  *Ingénieur*

 Mobile +33 (0)6 88  12 62 54

 *amaury.franc...@digora.com *

 * *

 *Siège Social – 66 rue du Marché Gare – 67200 STRASBOURG*

 Tél : 0 820 200 217 - +33 (0)3 88 10 49 20 

 [image: Description : test]

 ** **

 ** **

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

image002.jpgimage001.png___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ls taking ages on a directory containing 900000 files

2012-12-04 Thread Sunil Mushran
1.5 ms per inode. Times 900K files equals 22 mins.

Large dirs are a problem is all file systems. The degree of problem
depends on the overhead. An easy solution around is to shard the
files into multilevel dirs. Like a 2 level structure of a 1000 files in
1000 dirs. Or, a 3 level structure with even fewer files per dir.

Or you could use the other approach suggested. Avoids stat()
by disabling color-ls. Or just use plain find.


On Tue, Dec 4, 2012 at 3:16 PM, Erik Schwartz schwartz.eri...@gmail.comwrote:

 Amaury, you can see in strace output that it's performing a stat on
 every file.

 Try simply:

   $ /bin/ls

 My guess is you're using a system where ls is aliased to use options
 that are more expensive.

 Best regards -

 Erik


 On 12/4/12 5:12 PM, Amaury Francois wrote:
  The strace looks like this (on all files) :
 
 
 
  1354662591.755319
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P069_F01589.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001389
 
  1354662591.756775
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P035_F01592.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001532
 
  1354662591.758376
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P085_F01559.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001429
 
  1354662591.759873
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P027_F01569.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001377
 
  1354662591.761317
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P002_F01581.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001420
 
  1354662591.762804
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P050_F01568.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001345
 
  1354662591.764216
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P089_F01567.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001541
 
  1354662591.765828
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P010_F01594.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001358
 
  1354662591.767252
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P045_F01569.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001396
 
  1354662591.768715
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P036_F01592.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.002072
 
  1354662591.770854
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P089_F01568.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001722
 
  1354662591.772643
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P009_F01600.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001281
 
  1354662591.773992
  lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P022_F01583.txt,
  {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001413
 
 
 
  We are using a 32 bits architecture, can it be the cause of the kernel
  not having enough memory ? Any possibility to change this behavior ?
 
 
 
  Description : Description : Description :
 cid:image001.png@01CD01F3.35091200
 
 
 
  * *
 
  *Amaury FRANCOIS*   •  *Ingénieur*
 
  Mobile +33 (0)6 88 12 62 54
 
  *amaury.franc...@digora.com mailto:amaury.franc...@digora.com***
 
  * *
 
  *Siège Social – 66 rue du Marché Gare – 67200 STRASBOURG*
 
  Tél : 0 820 200 217 - +33 (0)3 88 10 49 20
 
 
 
  Description : test
 
 
 
 
 
  *De :*Sunil Mushran [mailto:sunil.mush...@gmail.com]
  *Envoyé :* mardi 4 décembre 2012 18:29
  *À :* Amaury Francois
  *Cc :* ocfs2-users@oss.oracle.com
  *Objet :* Re: [Ocfs2-users] ls taking ages on a directory containing
  90 files
 
 
 
  strace -p PID -ttt -T
 
 
 
  Attach and get some timings. The simplest guess is that the system lacks
  memory to cache all the inodes
 
  and thus has to hit disk (and more importantly take cluster locks) for
  the same inode repeatedly. The user
 
  guide has a section in NOTES explaining this.
 
 
 
 
 
  On Tue, Dec 4, 2012 at 8:54 AM, Amaury Francois
  amaury.franc...@digora.com mailto:amaury.franc...@digora.com wrote:
 
  Hello,
 
 
 
  We are running OCFS2 1.8 and on a kernel UEK2. An ls on a directory
  containing approx. 1 million of files  is very long (1H). The features
  we have activated on the filesystem are the following :
 
 
 
  [root@pa-oca-app10 ~]# debugfs.ocfs2 -R stats /dev/sdb1
 
  Revision: 0.90
 
  Mount Count: 0   Max Mount Count: 20
 
  State: 0   Errors: 0
 
  Check Interval: 0   Last Check: Fri Nov 30 19:30:17 2012
 
  Creator OS: 0
 
  Feature Compat: 3 backup-super strict-journal-super
 
  Feature Incompat: 32592 sparse extended-slotmap inline-data
  metaecc xattr indexed-dirs refcount discontig-bg clusterinfo
 
  Tunefs Incomplete: 0
 
  Feature RO compat: 1 unwritten
 
  Root Blknum: 5   System Dir Blknum: 6
 
  First Cluster Group Blknum: 3
 
  Block Size Bits: 12   Cluster Size Bits: 12
 
  Max Node Slots: 8
 
  Extended Attributes Inline Size: 256
 
  Label: exchange2
 
  UUID: 2375EAF4E4954C4ABB984BDE27AC93D5
 
  Hash: 2880301520

Re: [Ocfs2-users] Huge Problem ocfs2

2012-11-09 Thread Sunil Mushran
IO error on channel means the system cannot talk to the block device. The
problem
is in the block layer. Maybe a loose cable or a setup problem.
dmesg should show errors.


On Fri, Nov 9, 2012 at 10:46 AM, Laurentiu Gosu l...@easic.ro wrote:

  Hi,
 I'm using ocfs2 cluster in a production environment since almost 1 year.
 During this time i had to run a fsck.ocfs2 few months ago due to some
 errors but they were fixed.
 Now i have a big problem: I'm not able to mount the volume on any of the
 nodes. I stopped all nodes except one. Some output bellow:
 *mount /mnt/ocfs2**
 **mount.ocfs2: I/O error on channel while trying to determine heartbeat
 information**
 **
 **fsck.ocfs2 /dev/mapper/volgr1-lvol0**
 **fsck.ocfs2 1.6.3**
 **fsck.ocfs2: I/O error on channel while initializing the DLM**
 **
 **fsck.ocfs2 -n /dev/mapper/volgr1-lvol0**
 **fsck.ocfs2 1.6.3**
 **Checking OCFS2 filesystem in /dev/mapper/volgr1-lvol0:**
 **  Label:  SAN**
 **  UUID:   B4CF8D4667AF43118F3324567B90A987**
 **  Number of blocks:   2901788672**
 **  Block size: 4096**
 **  Number of clusters: 45340448**
 **  Cluster size:   262144**
 **  Number of slots:10**
 **
 **journal recovery: I/O error on channel while looking up the journal
 inode for slot 0**
 **fsck encountered unrecoverable errors while replaying the journals and
 will not continue*


 Can you give me some hints on how to debug the problem?

 Thank you,
 Laurentiu.

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Huge Problem ocfs2

2012-11-09 Thread Sunil Mushran
If global bitmap is gone. then the fs is unusable. But you can extract data
using
the rdump command in debugfs.ocfs. The success depends on how much of the
device is still usable.


On Fri, Nov 9, 2012 at 5:50 PM, Marian Serban mar...@easic.ro wrote:

  I tried hacking the fsck.ocfs2 source code by not considering metaecc
 flag. Then I ran into

 journal recovery: Bad magic number in inode while looking up the journal
 inode for slot 0

 fsck encountered unrecoverable errors while replaying the journals and
 will not continue

 After bypassing journal replay function, I got

 Pass 0a: Checking cluster allocation chains
 pass0: Bad magic number in inode while looking up the global bitmap inode
 fsck.ocfs2: Bad magic number in inode while performing pass 0


 Does it mean the filesystem is destroyed completely?




 On 10.11.2012 02:54, Marian Serban wrote:

 That's the kernel:

 Linux ro02xsrv003.bv.easic.ro 2.6.39.4 #6 SMP Mon Dec 12 12:09:49 EET
 2011 x86_64 x86_64 x86_64 GNU/Linux

 Anyway, I tried disabling the metaecc feature, no luck.

 [root@ro02xsrv003 ~]# tunefs.ocfs2 --fs-features=nometaecc
 /dev/mapper/volgr1-lvol0
 tunefs.ocfs2: I/O error on channel while opening device
 /dev/mapper/volgr1-lvol0

 These are the last lines of strace corresponding to the tunefs.ocfs
 command:



 open(/sys/fs/ocfs2/cluster_stack, O_RDONLY) = 4
 fstat(4, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
 0x7f54aad05000
 read(4, o2cb\n, 4096) = 5
 close(4)= 0
 munmap(0x7f54aad05000, 4096)= 0
 open(/sys/fs/o2cb/interface_revision, O_RDONLY) = 4
 read(4, 5\n, 15)  = 2
 read(4, , 13) = 0
 close(4)= 0
 stat(/sys/kernel/config, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
 statfs(/sys/kernel/config, {f_type=0x62656570, f_bsize=4096, f_blocks=0,
 f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255,
 f_frsize=4096}) = 0
 open(/dev/mapper/volgr1-lvol0, O_RDONLY) = 4
 ioctl(4, BLKSSZGET, 0x7fffce711454) = 0
 close(4)= 0
 pread(3, 
 \0\0\v\25\37\1\200\200\202@\21\2\30\26\0\0\0,\17\272\241\4\340\210\311\377\17\300\327\332\373\17...,
 4096, 532480) = 4096
 close(3)= 0
 write(2, tunefs.ocfs2, 12tunefs.ocfs2)= 12
 write(2, : , 2: )   = 2
 write(2, I/O error on channel, 20I/O error on channel)= 20
 write(2,  , 1 )= 1
 write(2, while opening device \/dev/mappe..., 47while opening device
 /dev/mapper/volgr1-lvol0) = 47
 write(2, \r\n, 2





 On 10.11.2012 02:06, Sunil Mushran wrote:

 It's either that or a check sum problem. Disable metaecc. Not sure which
 kernel you are running.
 We had fixed few problems few years ago around this. If your kernel is
 older, then it could be
 a known issue.


 On Fri, Nov 9, 2012 at 12:50 PM, Marian Serban mar...@easic.ro wrote:

 Hi Sunil,

 Thank you for answering. Unfortunately, it doesn't seem like it's a
 hardware problem. There's no way a cable can be loose because it's iSCSI
 over 1G Ethernet (copper wires) environment. Also I performed dd
 if=/dev/ of=/dev/null and first 16GB or so are fine. Dmesg shows no
 errors.


 Also tried with debugfs.ocfs2:


 [root@ro02xsrv003 ~]# debugfs.ocfs2  /dev/mapper/volgr1-lvol0
 debugfs.ocfs2 1.6.3
 debugfs: ls
 ls: Bad magic number in inode '.'
 debugfs: slotmap
 slotmap: Bad magic number in inode while reading slotmap system file
 debugfs: stats
 Revision: 0.90
 Mount Count: 0   Max Mount Count: 20
 State: 0   Errors: 0
 Check Interval: 0   Last Check: Fri Nov  9 14:35:53 2012
 Creator OS: 0
 Feature Compat: 3 backup-super strict-journal-super
 Feature Incompat: 16208 sparse extended-slotmap inline-data
 metaecc xattr indexed-dirs refcount discontig-bg
 Tunefs Incomplete: 0
 Feature RO compat: 7 unwritten usrquota grpquota
 Root Blknum: 129   System Dir Blknum: 130
 First Cluster Group Blknum: 64
 Block Size Bits: 12   Cluster Size Bits: 18
 Max Node Slots: 10
 Extended Attributes Inline Size: 256
 Label: SAN
 UUID: B4CF8D4667AF43118F3324567B90A987
 Hash: 3698209293 (0xdc6e320d)
 DX Seed[0]: 0x9f4a2bb7
 DX Seed[1]: 0x501ddac0
 DX Seed[2]: 0x6034bfe8
 Cluster stack: classic o2cb
 Inode: 2   Mode: 00   Generation: 1093568923 (0x412e899b)
 FS Generation: 1093568923 (0x412e899b)
 CRC32: 46f2d360   ECC: 04d4
 Type: Unknown   Attr: 0x0   Flags: Valid System Superblock
 Dynamic Features: (0x0)
 User: 0 (root)   Group: 0 (root)   Size: 0
 Links: 0   Clusters: 45340448
 ctime: 0x4ee67f67 -- Tue Dec 13 00:25:43 2011
 atime: 0x0 -- Thu Jan  1 02:00:00 1970
 mtime

Re: [Ocfs2-users] Huge Problem ocfs2

2012-11-09 Thread Sunil Mushran
Yes that should be enough for that. But that won't help if the real problem
is device related.

What does debugfs.ocfs2 -R ls -l / return? If that errors, means the root
dir is gone. Maybe
best to look into your backups.


On Fri, Nov 9, 2012 at 6:01 PM, Marian Serban mar...@easic.ro wrote:

  Nope, rdump doesn't work either.

 debugfs: rdump -v / /tmp
 Copying to /tmp/
 rdump: Bad magic number in inode while reading inode 129
 rdump: Bad magic number in inode while recursively dumping inode 129


 Could you please confirm that it's enough to just force the return value
 of 0 at ocfs2_validate_meta_ecc in order to bypass the ECC checks?




 On 10.11.2012 03:55, Sunil Mushran wrote:

 If global bitmap is gone. then the fs is unusable. But you can extract
 data using
 the rdump command in debugfs.ocfs. The success depends on how much of the
 device is still usable.


 On Fri, Nov 9, 2012 at 5:50 PM, Marian Serban mar...@easic.ro wrote:

  I tried hacking the fsck.ocfs2 source code by not considering metaecc
 flag. Then I ran into

 journal recovery: Bad magic number in inode while looking up the journal
 inode for slot 0

 fsck encountered unrecoverable errors while replaying the journals and
 will not continue

  After bypassing journal replay function, I got

 Pass 0a: Checking cluster allocation chains
 pass0: Bad magic number in inode while looking up the global bitmap inode
 fsck.ocfs2: Bad magic number in inode while performing pass 0


 Does it mean the filesystem is destroyed completely?




 On 10.11.2012 02:54, Marian Serban wrote:

 That's the kernel:

 Linux ro02xsrv003.bv.easic.ro 2.6.39.4 #6 SMP Mon Dec 12 12:09:49 EET
 2011 x86_64 x86_64 x86_64 GNU/Linux

 Anyway, I tried disabling the metaecc feature, no luck.

 [root@ro02xsrv003 ~]# tunefs.ocfs2 --fs-features=nometaecc
 /dev/mapper/volgr1-lvol0
 tunefs.ocfs2: I/O error on channel while opening device
 /dev/mapper/volgr1-lvol0

 These are the last lines of strace corresponding to the tunefs.ocfs
 command:



 open(/sys/fs/ocfs2/cluster_stack, O_RDONLY) = 4
 fstat(4, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
 = 0x7f54aad05000
 read(4, o2cb\n, 4096) = 5
 close(4)= 0
 munmap(0x7f54aad05000, 4096)= 0
 open(/sys/fs/o2cb/interface_revision, O_RDONLY) = 4
 read(4, 5\n, 15)  = 2
 read(4, , 13) = 0
 close(4)= 0
 stat(/sys/kernel/config, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
 statfs(/sys/kernel/config, {f_type=0x62656570, f_bsize=4096,
 f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0},
 f_namelen=255, f_frsize=4096}) = 0
 open(/dev/mapper/volgr1-lvol0, O_RDONLY) = 4
 ioctl(4, BLKSSZGET, 0x7fffce711454) = 0
 close(4)= 0
 pread(3, 
 \0\0\v\25\37\1\200\200\202@\21\2\30\26\0\0\0,\17\272\241\4\340\210\311\377\17\300\327\332\373\17...,
 4096, 532480) = 4096
 close(3)= 0
 write(2, tunefs.ocfs2, 12tunefs.ocfs2)= 12
 write(2, : , 2: )   = 2
 write(2, I/O error on channel, 20I/O error on channel)= 20
 write(2,  , 1 )= 1
 write(2, while opening device \/dev/mappe..., 47while opening device
 /dev/mapper/volgr1-lvol0) = 47
 write(2, \r\n, 2





 On 10.11.2012 02:06, Sunil Mushran wrote:

 It's either that or a check sum problem. Disable metaecc. Not sure which
 kernel you are running.
 We had fixed few problems few years ago around this. If your kernel is
 older, then it could be
 a known issue.


 On Fri, Nov 9, 2012 at 12:50 PM, Marian Serban mar...@easic.ro wrote:

 Hi Sunil,

 Thank you for answering. Unfortunately, it doesn't seem like it's a
 hardware problem. There's no way a cable can be loose because it's iSCSI
 over 1G Ethernet (copper wires) environment. Also I performed dd
 if=/dev/ of=/dev/null and first 16GB or so are fine. Dmesg shows no
 errors.


 Also tried with debugfs.ocfs2:


 [root@ro02xsrv003 ~]# debugfs.ocfs2  /dev/mapper/volgr1-lvol0
 debugfs.ocfs2 1.6.3
 debugfs: ls
 ls: Bad magic number in inode '.'
 debugfs: slotmap
 slotmap: Bad magic number in inode while reading slotmap system file
 debugfs: stats
 Revision: 0.90
 Mount Count: 0   Max Mount Count: 20
 State: 0   Errors: 0
 Check Interval: 0   Last Check: Fri Nov  9 14:35:53 2012
 Creator OS: 0
 Feature Compat: 3 backup-super strict-journal-super
 Feature Incompat: 16208 sparse extended-slotmap inline-data
 metaecc xattr indexed-dirs refcount discontig-bg
 Tunefs Incomplete: 0
 Feature RO compat: 7 unwritten usrquota grpquota
 Root Blknum: 129   System Dir Blknum: 130
 First Cluster Group Blknum: 64
 Block Size Bits: 12   Cluster Size Bits: 18
 Max Node Slots: 10
 Extended Attributes Inline Size: 256

Re: [Ocfs2-users] HA-OCFS2?

2012-09-13 Thread Sunil Mushran
cfs != storage

You need to get a highly available storage that is concurrently accessible
from multiple nodes.

ocfs2 will allow multiple nodes to concurrently access the same storage.
With posix semantics.
If a node dies, the remaining nodes will pause to recover and then continue
functioning. The
dead node can then restart and rejoin the cluster.

On Thu, Sep 13, 2012 at 5:02 PM, Eric epretori...@yahoo.com wrote:

 Is it possible to create a highly-available OCFS2 cluster (i.e., A storage
 cluster that mitigates the single point of failure [SPoF] created by
 storing an OCFS2 volume on a single LUN)?

 The OCFS2 Project Page makes this claim...

  OCFS2 is a general-purpose shared-disk cluster file system for Linux
 capable of providing both *high performance* and *high availability*.

 ...but without backing-up the claim of high availability storage (at
 either the HDD- or the node-level).

 I've found a couple of articles hinting at using Linux Multipathing or
 DRBD but very little detailed information about either.

 TIA,
 Eric Pretorious
 Truckee, CA

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Ocfs2-users Digest, Vol 105, Issue 4

2012-09-12 Thread Sunil Mushran
On Wed, Sep 12, 2012 at 9:45 AM, Asanka Gunasekera 
asanka_gunasek...@yahoo.co.uk wrote:

 Load O2CB driver on boot (y/n) [y]:
 Cluster stack backing O2CB [o2cb]:
 Cluster to start on boot (Enter none to clear) [ocfs2]:
 Specify heartbeat dead threshold (=7) [31]:
 Specify network idle timeout in ms (=5000) [3]:
 Specify network keepalive delay in ms (=1000) [2000]:
 Specify network reconnect delay in ms (=2000) [2000]:
 Writing O2CB configuration: OK
 Loading filesystem configfs: OK
 Mounting configfs filesystem at /sys/kernel/config: OK
 Loading filesystem ocfs2_dlmfs: OK
 Mounting ocfs2_dlmfs filesystem at /dlm: OK
 Starting O2CB cluster ocfs2: Failed
 Cluster ocfs2 created
 Node ocfsn1 added
 o2cb_ctl: Internal logic failure while adding node ocfsn2

 Stopping O2CB cluster ocfs2: OK



Something wrong with your cluster.conf. Overlapping node numbers, maybe.



 abd in the messages I time to time get below and I saw in a post that I
 can ignore this.

 modprobe: FATAL: Module ocfs2_stackglue not found.



Yes, this is harmless.
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] test inode bit failed -5

2012-08-31 Thread Sunil Mushran
nfsd encountered an error reading the device. So something in the io path
below the
fs encountered a problem. If it just happened once, then you can ignore it.

On Fri, Aug 31, 2012 at 2:23 AM, Hideyasu Kojima hid.koj...@ms.scsk.jpwrote:

 Hi
 I using ocfs2 cluster as NFS Server.

 Only once,I got a bellow error,and write error from NFS Client.
 What happend?

 kernel: (nfsd,12870,0):ocfs2_get_suballoc_slot_bit:2096 ERROR: read
 block 24993224 failed -5
 kernel: (nfsd,12870,0):ocfs2_test_inode_bit:2207 ERROR: get alloc slot
 and bit failed -5
 kernel: (nfsd,12870,0):ocfs2_get_dentry:96 ERROR: test inode bit failed -5

 I currently use kernel 2.6.18-164.el5
 OCFS2 : 1.4.7
 ocfs2-tool: 1.4.4

 Thanks.
 --


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Issue with files and folder ownership

2012-08-29 Thread Sunil Mushran
I would recommend pacemaker if the distribution you are using has all the
bits.
Manual building gets messy. Suse based distros have all the bits required
for ocfs2+pacemaker.

On Tue, Aug 28, 2012 at 10:40 PM, Emilien Macchi 
emilien.mac...@stackops.com wrote:

 Hi,

 On Wed, Aug 29, 2012 at 7:25 AM, Sunil Mushran sunil.mush...@gmail.comwrote:

 Isn't the mount point is local to the machine?


 I use iSCSI for the Block device and I mount the device (/dev/sdc1) at
 /var/lib/nova/instances.

 I've formated /dev/sdc1 in OCFS2 FS.

 Should I use Pacemaker to manage OCFS2 ?

 Thanks,

 -Emilien



 On Tue, Aug 28, 2012 at 10:14 PM, Emilien Macchi 
 emilien.mac...@stackops.com wrote:

 Hi,

 On Wed, Aug 29, 2012 at 12:36 AM, Sunil Mushran sunil.mush...@gmail.com
  wrote:

 Permissions on the mount point should be local to a machine.


 That's unthinkable if you consider that's a cluster FS which respects
 POSIX rules.


 -Emilien



 AFAIK.

 On Mon, Aug 27, 2012 at 3:08 AM, Emilien Macchi 
 emilien.mac...@stackops.com wrote:

 Hi,


 I'm working on a two nodes cluster with the goal to store virtual
 machines managed by OpenStack services and KVM Hypervisor. I also use 
 iSCSI
 Multi-Pathing for the block device.

 My cluster is running and I can mount the device (/dev/sdd1).

 I'm having some problems with POSIX rights :

- *chmod* on a file or folder is working.
- *chown* on a file or folder is not working as I want : I'm
trying to change the ownership of */var/lib/nova/instances* which
is my mount point, but when I do that, the ownership setting is not 
 applied
on the second node.

 I can't use yet OpenStack + KVM because the mount point should have
 the nova user as POSIX owner.

 Here is my *cluster.conf* :
 http://paste.openstack.org/show/oPQR5pjZETz7xSAR04so/
 And my mount point :
 */dev/sdd1 on /var/lib/nova/instances type ocfs2
 (rw,_netdev,heartbeat=local)*


 In advance thank you for your help.


 Best regards

 --
 Emilien Macchi
 *System Engineer*
 *www.stackops.com

 | *emilien.mac...@stackops.com**  *|* skype:emilien.macchi*
 * http://www.stackops.com
 *

 *

  ADVERTENCIA LEGAL 
 Le informamos, como destinatario de este mensaje, que el correo
 electrónico y las comunicaciones por medio de Internet no permiten 
 asegurar
 ni garantizar la confidencialidad de los mensajes transmitidos, así como
 tampoco su integridad o su correcta recepción, por lo que STACKOPS
 TECHNOLOGIES S.L. no asume responsabilidad alguna por tales 
 circunstancias.
 Si no consintiese en la utilización del correo electrónico o de las
 comunicaciones vía Internet le rogamos nos lo comunique y ponga en nuestro
 conocimiento de manera inmediata. Este mensaje va dirigido, de manera
 exclusiva, a su destinatario y contiene información confidencial y sujeta
 al secreto profesional, cuya divulgación no está permitida por la ley. En
 caso de haber recibido este mensaje por error, le rogamos que, de forma
 inmediata, nos lo comunique mediante correo electrónico remitido a nuestra
 atención y proceda a su eliminación, así como a la de cualquier documento
 adjunto al mismo. Asimismo, le comunicamos que la distribución, copia o
 utilización de este mensaje, o de cualquier documento adjunto al mismo,
 cualquiera que fuera su finalidad, están prohibidas por la ley.

 * PRIVILEGED AND CONFIDENTIAL 
 We hereby inform you, as addressee of this message, that e-mail and
 Internet do not guarantee the confidentiality, nor the completeness or
 proper reception of the messages sent and, thus, STACKOPS TECHNOLOGIES 
 S.L.
 does not assume any liability for those circumstances. Should you not 
 agree
 to the use of e-mail or to communications via Internet, you are kindly
 requested to notify us immediately. This message is intended exclusively
 for the person to whom it is addressed and contains privileged and
 confidential information protected from disclosure by law. If you are not
 the addressee indicated in this message, you should immediately delete it
 and any attachments and notify the sender by reply e-mail. In such case,
 you are hereby notified that any dissemination, distribution, copying or
 use of this message or any attachments, for any purpose, is strictly
 prohibited by law.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users





 --
 Emilien Macchi
 *System Engineer*
 *www.stackops.com

 | *emilien.mac...@stackops.com**  *|* skype:emilien.macchi*
 * http://www.stackops.com
 *

 *

  ADVERTENCIA LEGAL 
 Le informamos, como destinatario de este mensaje, que el correo
 electrónico y las comunicaciones por medio de Internet no permiten asegurar
 ni garantizar la confidencialidad de los mensajes transmitidos, así como
 tampoco su integridad o su correcta recepción, por lo que STACKOPS
 TECHNOLOGIES

Re: [Ocfs2-users] Issue with OCFS2 mount

2012-08-29 Thread Sunil Mushran
Forgot to add that this issue is limited to metaecc. So you could avoid the
issue in your
same setup by not enabling metaecc on the volume. And last I checked mkfs
did not
enable it by default.

On Mon, Aug 27, 2012 at 10:35 AM, Sunil Mushran sunil.mush...@gmail.comwrote:

 So you are running into a bug that has been fixed in 2.6.36. Upgrade to
 that version,
 if not something more current.

 $ git describe --tags 13ceef09
 v2.6.35-rc3-14-g13ceef0

 commit 13ceef099edd2b70c5a6f3a9ef5d6d97cda2e096
 Author: Jan Kara j...@suse.cz
 Date:   Wed Jul 14 07:56:33 2010 +0200

 jbd2/ocfs2: Fix block checksumming when a buffer is used in several
 transactions

 OCFS2 uses t_commit trigger to compute and store checksum of the just
 committed blocks. When a buffer has b_frozen_data, checksum is computed
 for it instead of b_data but this can result in an old checksum being
 written to the filesystem in the following scenario:

 1) transaction1 is opened
 2) handle1 is opened
 3) journal_access(handle1, bh)
 - This sets jh-b_transaction to transaction1
 4) modify(bh)
 5) journal_dirty(handle1, bh)
 6) handle1 is closed
 7) start committing transaction1, opening transaction2
 8) handle2 is opened
 9) journal_access(handle2, bh)
 - This copies off b_frozen_data to make it safe for transaction1
 to commit.
   jh-b_next_transaction is set to transaction2.
 10) jbd2_journal_write_metadata() checksums b_frozen_data
 11) the journal correctly writes b_frozen_data to the disk journal
 12) handle2 is closed
 - There was no dirty call for the bh on handle2, so it is never
 queued for
   any more journal operation
 13) Checkpointing finally happens, and it just spools the bh via
 normal buffer
 writeback.  This will write b_data, which was never triggered on and
 thus
 contains a wrong (old) checksum.

 This patch fixes the problem by calling the trigger at the moment data
 is
 frozen for journal commit - i.e., either when b_frozen_data is created
 by
 do_get_write_access or just before we write a buffer to the log if
 b_frozen_data does not exist. We also rename the trigger to t_frozen as
 that better describes when it is called.

 Signed-off-by: Jan Kara j...@suse.cz
 Signed-off-by: Mark Fasheh mfas...@suse.com
 Signed-off-by: Joel Becker joel.bec...@oracle.com


 On Mon, Aug 27, 2012 at 5:10 AM, Rory Kilkenny 
 rory.kilke...@ticoon.comwrote:

  # uname -a
 Linux FILEt1 2.6.34.7-0.7-desktop #1 SMP PREEMPT 2010-12-13 11:13:53
 +0100 x86_64 x86_64 x86_64 GNU/Linux

 # modinfo ocfs2
 filename:   /lib/modules/2.6.34.7-0.7-desktop/kernel/fs/ocfs2/ocfs2.ko
 license:GPL
 author: Oracle
 version:1.5.0
 description:OCFS2 1.5.0
 srcversion: B13569B35F99D43FA80D129
 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager
 vermagic:   2.6.34.7-0.7-desktop SMP preempt mod_unload modversions

 # mkfs.ocfs2 --version
 mkfs.ocfs2 1.4.3




 On 12-08-24 5:44 PM, Sunil Mushran sunil.mush...@gmail.com wrote:

 What is the version of the kernel, ocfs2 and ocfs2 tools?

 uname -a
 modinfo ocfs2
 mkfs.ocfs2 --version

 On Fri, Aug 24, 2012 at 1:09 PM, Rory Kilkenny rory.kilke...@ticoon.com
 wrote:

 We have an HP P2000 G3 Storage array, fiber connected.  The storage array
 has a RAID5 array broken into 2 physical OCFS2 volumes (A  B).

 A  B are both mounted and formatted as NTFS.

 One of the volumes is NFS mounted.

 Every couple of months or so we start getting tons of errors on the NFS
 mounted volume:


 Aug 24 09:48:13 FILEt2 kernel: [2234285.848940]
 (ocfs2_wq,13844,7):ocfs2_block_check_validate:443 ERROR: CRC32 failed:
 stored: 0, computed 1467126086.  Applying ECC.
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849252]
 (ocfs2_wq,13844,7):ocfs2_block_check_validate:457 ERROR: Fixed CRC32
 failed: stored: 0, computed 3828104806
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849256]
 (ocfs2_wq,13844,7):ocfs2_validate_extent_block:903 ERROR: Checksum failed
 for extent block 1169089
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849261]
 (ocfs2_wq,13844,7):__ocfs2_find_path:1861 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849264]
 (ocfs2_wq,13844,7):ocfs2_find_leaf:1958 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849267]
 (ocfs2_wq,13844,7):ocfs2_find_new_last_ext_blk:6655 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849270]
 (ocfs2_wq,13844,7):ocfs2_do_truncate:6900 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849274]
 (ocfs2_wq,13844,7):ocfs2_commit_truncate:7556 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849280]
 (ocfs2_wq,13844,7):ocfs2_truncate_for_delete:593 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849284]
 (ocfs2_wq,13844,7):ocfs2_wipe_inode:769 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849287]
 (ocfs2_wq,13844,7):ocfs2_delete_inode:1067

Re: [Ocfs2-users] Issue with OCFS2 mount

2012-08-24 Thread Sunil Mushran
What is the version of the kernel, ocfs2 and ocfs2 tools?

uname -a
modinfo ocfs2
mkfs.ocfs2 --version

On Fri, Aug 24, 2012 at 1:09 PM, Rory Kilkenny rory.kilke...@ticoon.comwrote:

  We have an HP P2000 G3 Storage array, fiber connected.  The storage
 array has a RAID5 array broken into 2 physical OCFS2 volumes (A  B).

 A  B are both mounted and formatted as NTFS.

 One of the volumes is NFS mounted.

 Every couple of months or so we start getting tons of errors on the NFS
 mounted volume:


 Aug 24 09:48:13 FILEt2 kernel: [2234285.848940]
 (ocfs2_wq,13844,7):ocfs2_block_check_validate:443 ERROR: CRC32 failed:
 stored: 0, computed 1467126086.  Applying ECC.
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849252]
 (ocfs2_wq,13844,7):ocfs2_block_check_validate:457 ERROR: Fixed CRC32
 failed: stored: 0, computed 3828104806
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849256]
 (ocfs2_wq,13844,7):ocfs2_validate_extent_block:903 ERROR: Checksum failed
 for extent block 1169089
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849261]
 (ocfs2_wq,13844,7):__ocfs2_find_path:1861 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849264]
 (ocfs2_wq,13844,7):ocfs2_find_leaf:1958 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849267]
 (ocfs2_wq,13844,7):ocfs2_find_new_last_ext_blk:6655 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849270]
 (ocfs2_wq,13844,7):ocfs2_do_truncate:6900 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849274]
 (ocfs2_wq,13844,7):ocfs2_commit_truncate:7556 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849280]
 (ocfs2_wq,13844,7):ocfs2_truncate_for_delete:593 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849284]
 (ocfs2_wq,13844,7):ocfs2_wipe_inode:769 ERROR: status = -5
 Aug 24 09:48:13 FILEt2 kernel: [2234285.849287]
 (ocfs2_wq,13844,7):ocfs2_delete_inode:1067 ERROR: status = -5


 If we pull all the data off, destroy the volume, rebuilt it, and copy our
 data back, all works fine; for a while.

 This issue does not happen on the non NFS mounted volume. I am currently
 assuming the issue is with NFS and how we have it configured (which to the
 best of my knowledge is default).

 Has anyone had a similar experience and be able to share some insight and
 knowledge on any tricks with NFS and OCFS2 volumes?

 Thanks in advance.



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and util_file

2012-08-23 Thread Sunil Mushran
You are probably mounting the volume with the datavolume option. Instead
use the
init.ora param, filesystemio_options for force odirect and mount the volume
without
the datavolume option. This is documented in the user's guide.

On Thu, Aug 23, 2012 at 8:14 AM, Maki, Nancy nancy.m...@suny.edu wrote:

 We are getting an error ORA-29284 when using utl_file.get_line to read an
 OCFS2 file of larger than 3896 characters. Has anyone encountered this
 before?  We are at OCFS2 2.6 running on OEL 5.6.

 ** **

 Thanks,

 Nancy

 ** **

 [image: circle] http://www.suny.edu/**

 *Nancy Maki*
 *Manager of Database Services*

 Office of Information  Technology
 The State University of New York
 State University Plaza - Albany, New York 12246
 Tel: 518.320.1213   Fax: 518.320.1550

 eMail:  nancy.m...@suny.edu
 *Be a part of Generation SUNY: 
 **Facebook*http://www.facebook.com/generationsuny
 * - **Twitter* http://www.twitter.com/generationsuny* - 
 **YouTube*http://www.youtube.com/generationsuny
 

 ** **

 ** **

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

image001.gif___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] OCFS2 and util_file

2012-08-23 Thread Sunil Mushran
On Thu, Aug 23, 2012 at 10:58 AM, Maki, Nancy nancy.m...@suny.edu wrote:

 By default we mount all our OCFS2 volumes with datavolume.  To be more
 specific, the volume that we are having the issue with is not a database
 volume but a shared drive for developers to read and write other types of
 files.  Would it be appropriate to remove the datavolume mount option from
 this particular volume only and leave it on our database volumes?



Yes. datavolume was only meant for db volumes. Other volumes have never
needed it.
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] null pointer dereference

2012-08-21 Thread Sunil Mushran
You may want to run a full fsck on the fs.

fsck.ocfs2 -fy /dev/

On Tue, Aug 21, 2012 at 12:49 AM, Pawel pzl...@mp.pl wrote:

 Hi,
 After upgrading ocfs2 my cluster is instable.

 At least ones per week I can see:
 kernel panic: Null pointer dereference  at 00048
 o2dlm_blocking_ast_wrapper + 0x8/0x20 [ocfs2_stack_o2cb]
 stack:
 dlm_do_local_bast [ocfs2_dlm]
 dlm_lookup_lockers [ocfs2_dlm]
 dlm_proxy_ast_handler
 add_timer
 ..

 After that sometimes deadlock happens on another nodes. Entire cluster
 restart solve the issue.
 I see in log:
 (dlm_thread,7227,3):dlm_send_proxy_ast_msg:484 ERROR:
 ECB9442E19A94EAC896641BFADD55E4B: res M0001f411c9,
 error -107 send AST to node 4
 (dlm_thread,7227,3):dlm_flush_asts:605 ERROR: status = -107
 o2net: No connection established with node 4 after 10.0 seconds, giving up.
 o2net: No connection established with node 4 after 10.0 seconds, giving up.
 o2net: No connection established with node 4 after 10.0 seconds, giving up.
 (dlm_thread,7227,4):dlm_send_proxy_ast_msg:484 ERROR:
 ECB9442E19A94EAC896641BFADD55E4B: res M0001f411c9,
 error -107 send AST to node 4
 (dlm_thread,7227,4):dlm_flush_asts:605 ERROR: status = -107
 o2cb: o2dlm has evicted node 4 from domain ECB9442E19A94EAC896641BFADD55E4B
 o2cb: o2dlm has evicted node 4 from domain ECB9442E19A94EAC896641BFADD55E4B
 o2dlm: Begin recovery on domain ECB9442E19A94EAC896641BFADD55E4B for node 4
 o2dlm: Node 5 (he) is the Recovery Master for the dead node 4 in domain
 ECB9442E19A94EAC896641BFADD55E4B
 o2dlm: End recovery on domain ECB9442E19A94EAC896641BFADD55E4B


 Additionaly ~4 times per day I see:

 ocfs2_check_dir_for_entry:2119 ERROR: status = -17
 ocfs2_mknod:459 ERROR: status = -17
 ocfs2_create:629 ERROR: status = -17


 I currently use kernel 3.4.2
 my filesystem has been created with:
 -N 8-b 4096 -C 32768 --fs-features

 backup-super,strict-journal-super,sparse,extended-slotmap,inline-data,metaecc,xattr,indexed-dirs,refcount,discontig-bg,unwritten,usrquota,grpquota

 Could you tell me what could make my system instable? Which feature ?

 Thanks for any  help

 Pawel


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 https://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2 problem journal size

2012-08-02 Thread Sunil Mushran
The 4 journal inodes got zeroed out. Do you know how/why?

Have you tried running fsck with -fy (enable writes).

fsck.ocfs2 does have a check for bad journals that it will regenerate.

JOURNAL_FILE_INVALID
OCFS2 uses JDB for journalling and some journal files exist in the system
directory. Fsck has found some journal files that are invalid.
Answering yes to this question will regenerate the invalid journal files.

But that may still not work as fsck is currently bailing out during journal
recovery
that happens much earlier on.

Try with -fy. If that does not work, we'll have to reconstruct empty inodes
as
placeholders to allow fsck to complete journal recovery followed by journal
recreation.

On Wed, Aug 1, 2012 at 6:41 PM, Christophe BOUDER christophe.bou...@lip6.fr
 wrote:

 Hello,

 i use ocfs2 1.6.3 kernel 3.4.4 on debian testing
 i had problem on my infortrend device
 media error on a disk
 the result i can't mount my ocfs2 file but
 i can read the files with debugfs.ocfs2

 and my question is
 can i recover or recreate the journal size for node 8 9 10 11 ?

 thank for your help
 here's some log :

 # mount /data
 mount.ocfs2: Internal logic failure while trying to join the group


 # fsck.ocfs2 -n /dev/sdc1
 fsck.ocfs2 1.6.3
 Checking OCFS2 filesystem in /dev/sdc1:
   Label:  data
   UUID:   9B655B51E6874480BBC1309DCA048A39
   Number of blocks:   4027690992
   Block size: 4096
   Number of clusters: 251730687
   Cluster size:   65536
   Number of slots:32

 journal recovery: I/O error on channel while reading cached inode 112 for
 slot 8's journal
 fsck encountered unrecoverable errors while replaying the journals and
 will not continue

 # echo ls -l // | debugfs.ocfs2 /dev/sdc1 |grep journal
 debugfs.ocfs2 1.6.3
 55  -rw-r--r--   1 0 0   268435456
 23-Jun-2007
 21:30 journal:
 56  -rw-r--r--   1 0 0   268435456
 23-Jun-2007
 21:30 journal:0001
 57  -rw-r--r--   1 0 0   268435456
 23-Jun-2007
 21:30 journal:0002
 58  -rw-r--r--   1 0 0   268435456
 23-Jun-2007
 21:30 journal:0003
 59  -rw-r--r--   1 0 0   268435456
 23-Jun-2007
 21:31 journal:0004
 79  -rw-r--r--   1 0 0   268435456
 31-Aug-2007
 00:45 journal:0005
 80  -rw-r--r--   1 0 0   268435456
 31-Aug-2007
 00:45 journal:0006
 81  -rw-r--r--   1 0 0   268435456
 31-Aug-2007
 00:45 journal:0007
 112 --   0 0 0   0
  1-Jan-1970
 01:00 journal:0008
 113 --   0 0 0   0
  1-Jan-1970
 01:00 journal:0009
 114 --   0 0 0   0
  1-Jan-1970
 01:00 journal:0010
 115 --   0 0 0   0
  1-Jan-1970
 01:00 journal:0011
 116 -rw-r--r--   1 0 0   268435456
 31-Aug-2007
 00:46 journal:0012
 117 -rw-r--r--   1 0 0   268435456
 31-Aug-2007
 00:47 journal:0013
 118 -rw-r--r--   1 0 0   268435456
 31-Aug-2007
 00:47 journal:0014
 119 -rw-r--r--   1 0 0   268435456
 31-Aug-2007
 00:47 journal:0015
 142 -rw-r--r--   1 0 0   268435456
 29-May-2009
 22:53 journal:0016
 143 -rw-r--r--   1 0 0   268435456
 29-May-2009
 22:54 journal:0017
 166 -rw-r--r--   1 0 0   268435456
 31-Jan-2010
 15:36 journal:0018
 167 -rw-r--r--   1 0 0   268435456
 31-Jan-2010
 15:36 journal:0019
 168 -rw-r--r--   1 0 0   268435456
 31-Jan-2010
 15:37 journal:0020
 169 -rw-r--r--   1 0 0   268435456
 31-Jan-2010
 15:37 journal:0021
 170 -rw-r--r--   1 0 0   268435456
 31-Jan-2010
 15:38 journal:0022
 171 -rw-r--r--   1 0 0   268435456
 31-Jan-2010
 15:38 journal:0023
 208 -rw-r--r--   1 0 0   268435456
 21-Nov-2010
 19:35 journal:0024
 209 -rw-r--r--   1 0 0   268435456
 21-Nov-2010
 19:35 journal:0025
 210 -rw-r--r--   1 0 0   268435456
 21-Nov-2010
 19:36 journal:0026
 211 -rw-r--r--   1 0 0   268435456
 21-Nov-2010
 19:36 journal:0027
 212 -rw-r--r--   1 0 0   268435456
 21-Nov-2010
 19:36 journal:0028
 213 -rw-r--r--   1 0 0   268435456
 21-Nov-2010
 19:36 journal:0029
 214 -rw-r--r--   1 0 0   268435456
 21-Nov-2010
 19:37 journal:0030
 215 -rw-r--r--   1 0 0   268435456
 

Re: [Ocfs2-users] ocfs2 problem journal size

2012-08-02 Thread Sunil Mushran
oh crap. The dlm lock needs to lock the journals. So you need to recreate
the
journal inodes with i_size 0.

dd a good journal inode and edit it using binary editor. Change the inode
num
to the block number, zero out the i_size and next_free_extent. Repeat for
the
4 inodes.

Hopefully some one on the list has the time to help you further.

On Thu, Aug 2, 2012 at 10:50 AM, Christophe BOUDER 
christophe.bou...@lip6.fr wrote:

 hello,

  The 4 journal inodes got zeroed out. Do you know how/why?

 raid6 with 2 bad disk
 and a third who got problem
 reinsert it in the device it appears good
 but it also crash the device not recognize by the system.

 
  Have you tried running fsck with -fy (enable writes).

 yes but without success
 #fsck.ocfs2 -fy /dev/sdc1
 fsck.ocfs2 1.6.3
 fsck.ocfs2: Internal logic failure while initializing the DLM

  Try with -fy. If that does not work, we'll have to reconstruct empty
  inodes
  as
  placeholders to allow fsck to complete journal recovery followed by
  journal
  recreation.

 ok how can i do that ?


 --
 Christophe Bouder,


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ocfs2-tools git: broken after commit deb5ade9145f8809f1fde19cf53bdfdf1fb7963e

2012-07-26 Thread Sunil Mushran
On Thu, Jul 26, 2012 at 6:37 AM, Dzianis Kahanovich
maha...@bspu.unibel.bywrote:

 ocfs2-tools git wrong commit: deb5ade9145f8809f1fde19cf53bdfdf1fb7963e.

 After cleanup unused variable:
 -else
 -tmp = g_list_append(elem, cfs);

 o2cb_ctl starts to ignore 1 node. Good commit must be:
 else
 -tmp = g_list_append(elem, cfs);
 +g_list_append(elem, cfs);

 Attached patch.


Thanks.

Acked-by: Sunil Mushran sunil.mush...@gmail.com
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Removing a node from cluster.conf (on a specific node)

2012-04-29 Thread Sunil Mushran
Online add/remove of nodes and of global heartbeat devices has been in mainline 
for over a year. I think 2.6.38+ and tools 1.8. The ocfs2-tools tree hosted on 
oss.oracle.com/git has a 1.8.2 tag that can be used safely. It has been fully 
tested. The user's guide has been moved to man pages bundled with the tools. Do 
man ocfs2 after building and installing the tools.

On Apr 29, 2012, at 1:21 PM, Sébastien Riccio s...@swisscenter.com wrote:

 Hi dear list,
 
 I think the subjet might already been discussed, but I can only found 
 old threads about removing a node from the cluster.
 
 I was hoping that in 2012 it would be possible to dynamically add/remove 
 nodes from a shared filesystem but this evening I had this problem:
 
 I wanted to add a node to our ocfs2 cluster, node named xen-blade11 with 
 ip 10.111.10.111
 
 So on every other node I ran this command:
 
 o2cb_ctl -C -i -n xen-blade11 -t node -a number=5 -a 
 ip_address=10.111.10.111 -a ip_port= -a cluster=ocfs2
 
 Which successfully added the node to every cluster node, except on 
 xen-server16
 
 On every node the original cluster.conf was:
 
 node:
 ip_port = 
 ip_address = 10.111.10.116
 number = 0
 name = xen-blade16
 cluster = ocfs2
 
 node:
 ip_port = 
 ip_address = 10.111.10.115
 number = 1
 name = xen-blade15
 cluster = ocfs2
 
 node:
 ip_port = 
 ip_address = 10.111.10.114
 number = 2
 name = xen-blade14
 cluster = ocfs2
 
 node:
 ip_port = 
 ip_address = 10.111.10.113
 number = 3
 name = xen-blade13
 cluster = ocfs2
 
 node:
 ip_port = 
 ip_address = 10.111.10.112
 number = 4
 name = xen-blade12
 cluster = ocfs2
 
 cluster:
 node_count = 5
 name = ocfs2
 
 
 After adding the node, on every cluster.conf I can see that this was added:
 
 node:
 ip_port = 
 ip_address = 10.111.10.111
 number = 5
 name = xen-blade11
 cluster = ocfs2
 
 cluster:
 node_count = 6
 name = ocfs2
 
 EXCEPT on xen-blade16
 
 It added like this:
 
 node:
 ip_port = 
 ip_address = 10.111.10.111
 number = 6
 name = xen-blade11
 cluster = ocfs2
 
 cluster:
 node_count = 6
 name = ocfs2
 
 (Notice the number = 6 instead of number = 5)
 
 So now when i'm trying to connect the xen-blade11 every host accept the 
 connection except the xen-blade16, and the cluster joining is being 
 rejected.
 
 as we can see in the kernel messages on xen-blade11
 
 [ 1852.729539] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1852.729892] o2net: Connected to node xen-blade12 (num 4) at 
 10.111.10.112:
 [ 1852.737122] o2net: Connected to node xen-blade14 (num 2) at 
 10.111.10.114:
 [ 1852.741408] o2net: Connected to node xen-blade15 (num 1) at 
 10.111.10.115:
 [ 1854.733759] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1856.737129] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1856.764520] OCFS2 1.5.0
 [ 1858.740877] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1860.744847] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1862.748919] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1864.752929] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1866.756825] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1868.760809] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1870.764937] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1872.768905] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1874.772947] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1876.776928] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1878.780828] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1880.784974] o2net: Connection to node xen-blade16 (num 0) at 
 10.111.10.116: shutdown, state 7
 [ 1882.784529] o2net: No connection established with node 0 after 30.0 
 seconds, giving up.
 [ 1912.864531] o2net: No connection established with node 0 after 30.0 
 seconds, giving up.
 [ 1917.028531] o2cb: This node could not connect to nodes: 0.
 [ 1917.028684] o2cb: Cluster check failed. Fix errors before retrying.
 [ 1917.028758] (mount.ocfs2,4238,4):ocfs2_dlm_init:3001 ERROR: status = -107
 [ 1917.028880] (mount.ocfs2,4238,4):ocfs2_mount_volume:1879 ERROR: 

Re: [Ocfs2-users] Permission denied on ocfs2 cluster

2012-03-16 Thread Sunil Mushran
Could be selinux related. I mean it is a permission issue. So you have to look 
at all the security regimes. rwx, posix acl, selinux, etc. 

On Mar 16, 2012, at 8:00 AM, зоррыч zo...@megatrone.ru wrote:

 Any idea?
 
 
 
 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com
 [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of зоррыч
 Sent: Thursday, March 15, 2012 11:26 PM
 To: 'Sunil Mushran'
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Permission denied on ocfs2 cluster
 
 [root@noc-1-synt /]# ls -lh | grep ocfs
 drwxr-xr-x.   3 root root 3.9K Mar 15 02:20 ocfs
 [root@noc-1-synt /]# chmod -R gou+rwx ./ocfs/ [root@noc-1-synt /]# ls -lh |
 grep ocfs
 drwxrwxrwx.   3 root root 3.9K Mar 15 02:20 ocfs
 [root@noc-1-synt /]# cd ./ocfs/
 [root@noc-1-synt ocfs]# mkdir 1233
 mkdir: cannot create directory `1233': Permission denied [root@noc-1-synt
 ocfs]#
 Strace:
 [root@noc-1-synt ocfs]# strace mkdir 1233 execve(/bin/mkdir, [mkdir,
 1233], [/* 28 vars */]) = 0
 brk(0)  = 0x2132000
 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
 0x7fbd67514000
 access(/etc/ld.so.preload, R_OK)  = -1 ENOENT (No such file or
 directory)
 open(/etc/ld.so.cache, O_RDONLY)  = 3
 fstat(3, {st_mode=S_IFREG|0644, st_size=45938, ...}) = 0 mmap(NULL, 45938,
 PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbd67508000
 close(3)= 0
 open(/lib64/libselinux.so.1, O_RDONLY) = 3 read(3,
 \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0PX\0D2\0\0\0...,
 832) = 832
 fstat(3, {st_mode=S_IFREG|0755, st_size=124624, ...}) = 0 mmap(0x324400,
 2221912, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
 0x324400 mprotect(0x324401d000, 2093056, PROT_NONE) = 0
 mmap(0x324421c000, 8192, PROT_READ|PROT_WRITE,
 MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c000) = 0x324421c000
 mmap(0x324421e000, 1880, PROT_READ|PROT_WRITE,
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x324421e000
 close(3)= 0
 open(/lib64/libc.so.6, O_RDONLY)  = 3
 read(3,
 \177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\360\355\201B2\0\0\0...,
 832) = 832
 fstat(3, {st_mode=S_IFREG|0755, st_size=1979000, ...}) = 0
 mmap(0x324280, 3803304, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE,
 3, 0) = 0x324280 mprotect(0x3242997000, 2097152, PROT_NONE) = 0
 mmap(0x3242b97000, 20480, PROT_READ|PROT_WRITE,
 MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x197000) = 0x3242b97000
 mmap(0x3242b9c000, 18600, PROT_READ|PROT_WRITE,
 MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3242b9c000
 close(3)= 0
 open(/lib64/libdl.so.2, O_RDONLY) = 3
 read(3,
 \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\340\r\300B2\0\0\0..., 832)
 = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=22536, ...}) = 0 mmap(NULL,
 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
 0x7fbd67507000 mmap(0x3242c0, 2109696, PROT_READ|PROT_EXEC,
 MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3242c0 mprotect(0x3242c02000,
 2097152, PROT_NONE) = 0 mmap(0x3242e02000, 8192, PROT_READ|PROT_WRITE,
 MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x3242e02000
 close(3)= 0
 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
 0x7fbd67505000 arch_prctl(ARCH_SET_FS, 0x7fbd675057a0) = 0
 mprotect(0x324421c000, 4096, PROT_READ) = 0 mprotect(0x3242b97000, 16384,
 PROT_READ) = 0 mprotect(0x3242e02000, 4096, PROT_READ) = 0
 mprotect(0x324261f000, 4096, PROT_READ) = 0
 munmap(0x7fbd67508000, 45938)   = 0
 statfs(/selinux, {f_type=0xf97cff8c, f_bsize=4096, f_blocks=0, f_bfree=0,
 f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255,
 f_frsize=4096}) = 0
 brk(0)  = 0x2132000
 brk(0x2153000)  = 0x2153000
 open(/usr/lib/locale/locale-archive, O_RDONLY) = 3 fstat(3,
 {st_mode=S_IFREG|0644, st_size=99158704, ...}) = 0 mmap(NULL, 99158704,
 PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbd61674000
 close(3)= 0
 mkdir(1233, 0777) = -1 EACCES (Permission denied)
 open(/usr/share/locale/locale.alias, O_RDONLY) = 3 fstat(3,
 {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0 mmap(NULL, 4096,
 PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbd67513000
 read(3, # Locale name alias data base.\n#..., 4096) = 2512
 read(3, , 4096)   = 0
 close(3)= 0
 munmap(0x7fbd67513000, 4096)= 0
 open(/usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo, O_RDONLY) =
 -1 ENOENT (No such file or directory)
 open(/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo, O_RDONLY) = -1
 ENOENT (No such file or directory)
 open(/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo, O_RDONLY) = -1
 ENOENT (No such file or directory)
 open(/usr/share/locale/en.UTF-8/LC_MESSAGES/coreutils.mo, O_RDONLY) = -1
 ENOENT (No such file or directory

Re: [Ocfs2-users] Permission denied on ocfs2 cluster

2012-03-15 Thread Sunil Mushran
strace may show more. I would first confirm that my perms are correct.

On 03/15/2012 07:58 AM, ?? wrote:
 I am testing the scheme of drbd and ocfs2

 If you attempt to write to the cluster error:

 [root@noc-1-m77 share]# mkdir 12

 mkdir: cannot create directory `12': Permission denied

 [root@noc-1-m77 share]#

 Config:

 [root@noc-1-m77 /]# cat /etc/ocfs2/cluster.conf

 cluster:

 node_count = 2

 name = cluster-ocfs2

 node:

 ip_port = 

 ip_address = 10.1.20.10

 number = 0

 name = noc-1-synt.rutube.ru

 cluster = cluster-ocfs2

 node:

 ip_port = 

 ip_address = 10.2.20.9

 number = 1

 name = noc-1-m77.rutube.ru

 cluster = cluster-ocfs2

 logs:

 Mar 15 05:42:04 noc-1-synt kernel: OCFS2 1.5.0

 Mar 15 05:42:04 noc-1-synt kernel: o2dlm: Nodes in domain
 5426CCF9AC414CD59E78F3AE48B9DE2C: 1

 Mar 15 05:42:04 noc-1-synt kernel: ocfs2: Mounting device (147,0) on
 (node 1, slot 0) with ordered data mode.

 Mar 15 05:42:07 noc-1-synt kernel: o2net: accepted connection from node
 noc-1-m77.rutube.ru (num 2) at 10.2.20.9:

 Mar 15 05:42:11 noc-1-synt kernel: o2dlm: Node 2 joins domain
 5426CCF9AC414CD59E78F3AE48B9DE2C

 Mar 15 05:42:11 noc-1-synt kernel: o2dlm: Nodes in domain
 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 2

 Mar 15 05:50:54 noc-1-synt kernel: o2dlm: Node 2 leaves domain
 5426CCF9AC414CD59E78F3AE48B9DE2C

 Mar 15 05:50:54 noc-1-synt kernel: o2dlm: Nodes in domain
 5426CCF9AC414CD59E78F3AE48B9DE2C: 1

 Mar 15 05:50:56 noc-1-synt kernel: o2net: connection to node
 noc-1-m77.rutube.ru (num 2) at 10.2.20.9: shutdown, state 8

 Mar 15 05:50:56 noc-1-synt kernel: o2net: no longer connected to node
 noc-1-m77.rutube.ru (num 2) at 10.2.20.9:

 Mar 15 05:51:12 noc-1-synt kernel: ocfs2: Unmounting device (147,0) on
 (node 1)

 Mar 15 05:51:45 noc-1-synt kernel: o2net: accepted connection from node
 noc-1-m77.rutube.ru (num 2) at 10.2.20.9:

 Mar 15 05:51:47 noc-1-synt kernel: o2dlm: Nodes in domain
 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 2

 Mar 15 05:51:47 noc-1-synt kernel: ocfs2: Mounting device (147,0) on
 (node 1, slot 1) with ordered data mode.

 How do I fix this?



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] ocfs2-1.4.7 is not binding in scientific linux 6.2

2012-03-12 Thread Sunil Mushran
ocfs2 1.4 will not build with 2.6.32. A better solution is to
just enable ocfs2 in the 2.6.32 kernel src tree and build.

On 03/11/2012 07:37 AM, зоррыч wrote:
 Hi.

 I use scientific linux 6.2:

 [root@noc-1-m77 ocfs2-1.4.7]# cat /etc/redhat-release

 Scientific Linux release 6.2 (Carbon)

 [root@noc-1-m77 ocfs2-1.4.7]# uname -r

 2.6.32-220.4.1.el6.x86_64

 Does not compile:

 [root@noc-1-m77 ocfs2-1.4.7]# ./configure
 --with-kernel=/usr/src/kernels/2.6.32-220.7.1.el6.x86_64

 checking build system type... x86_64-unknown-linux-gnu

 checking host system type... x86_64-unknown-linux-gnu

 checking for gcc... gcc

 checking for C compiler default output file name... a.out

 checking whether the C compiler works... yes

 checking whether we are cross compiling... no

 checking for suffix of executables...

 checking for suffix of object files... o

 checking whether we are using the GNU C compiler... yes

 checking whether gcc accepts -g... yes

 checking for gcc option to accept ANSI C... none needed

 checking how to run the C preprocessor... gcc -E

 checking for a BSD-compatible install... /usr/bin/install -c

 checking whether ln -s works... yes

 checking for egrep... grep -E

 checking for ANSI C header files... yes

 checking for an ANSI C-conforming const... yes

 checking for vendor... not found

 checking for vendor kernel... not supported

 checking for debugging... no

 checking for directory with kernel build tree...
 /usr/src/kernels/2.6.32-220.7.1.el6.x86_64

 checking for kernel version... 2.6.32-220.7.1.el6.x86_64

 checking for directory with kernel sources...
 /usr/src/kernels/2.6.32-220.7.1.el6.x86_64

 checking for kernel source version... 2.6.32-220.7.1.el6.x86_64

 checking for struct delayed_work in workqueue.h... yes

 checking for uninitialized_var() in compiler-gcc4.h... yes

 checking for zero_user_page() in highmem.h... no

 checking for do_sync_mapping_range() in fs.h... yes

 checking for fault() in struct vm_operations_struct in mm.h... yes

 checking for f_path in fs.h... yes

 checking for enum umh_wait in kmod.h... yes

 checking for inc_nlink() in fs.h... yes

 checking for drop_nlink() in fs.h... yes

 checking for kmem_cache_create() with dtor arg in slab.h... no

 checking for kmem_cache_zalloc in slab.h... yes

 checking for flag FS_RENAME_DOES_D_MOVE in fs.h... yes

 checking for enum FS_OCFS2 in sysctl.h... yes

 checking for configfs_depend_item() in configfs.h... yes

 checking for register_sysctl() with two args in sysctl.h... no

 checking for su_mutex in struct configfs_subsystem in configfs.h... yes

 checking for struct subsystem in kobject.h... no

 checking for is_owner_or_cap() in fs.h... yes

 checking for fallocate() in fs.h... yes

 checking for struct splice_desc in splice.h... yes

 checking for MNT_RELATIME in mount.h... yes

 checking for should_remove_suid() in fs.h... no

 checking for generic_segment_checks() in fs.h... no

 checking for s_op declared as const in struct super_block in fs.h... yes

 checking for i_op declared as const in struct inode in fs.h... yes

 checking for f_op declared as const in struct file in fs.h... yes

 checking for a_ops declared as const in struct address_space in fs.h... yes

 checking for aio_read() in struct file_operations using iovec in fs.h... yes

 checking for __splice_from_pipe() in splice.h... yes

 checking for old bio_end_io_t in bio.h... no

 checking for b_size is u32 struct buffer_head in buffer_head.h... no

 checking for exportfs.h... yes

 checking for linux/lockdep.h... yes

 checking for mandatory_lock() in fs.h... yes

 checking for range prefix in struct writeback_control... yes

 checking for SYNC_FILE_RANGE flags... yes

 checking for blkcnt_t in types.h... yes

 checking for i_private in struct inode... yes

 checking for page_mkwrite in struct vm_operations_struct... no

 checking for get_sb_bdev() with 5 arguments in fs.h... no

 checking for read_mapping_page in pagemap.h... yes

 checking for ino_t in filldir_t in fs.h... no

 checking for invalidatepage returning int in fs.h... no

 checking for get_blocks_t type... no

 checking for linux/uaccess.h... yes

 checking for system_utsname in utsname.h... no

 checking for MS_LOOP_NO_AOPS flag defined... no

 checking for fops-sendfile() in fs.h... no

 checking for task_pid_nr in sched.h... yes

 checking for confirm() in struct pipe_buf_operations in pipe_fs_i.h... yes

 checking for mutex_lock_nested() in mutex.h... yes

 checking for inode_double_lock) in fs.h... no

 checking for splice_read() in fs.h... yes

 checking for sops-statfs takes struct super_block * in fs.h... no

 checking for le16_add_cpu() in byteorder/generic.h... yes

 checking for le32_add_cpu() in byteorder/generic.h... yes

 checking for le64_add_cpu() in byteorder/generic.h... yes

 checking for be32_add_cpu() in byteorder/generic.h... yes

 checking for clear_nlink() in fs.h... yes

 configure: creating ./config.status

 config.status: creating Config.make

 

Re: [Ocfs2-users] ocfs2console hangs on startup

2012-03-10 Thread Sunil Mushran
ocfs2console has been obsoleted. Just use the utilities directly.
To detect ocfs2 volumes, use blkid. You can use it to restrict
the lookup paths. Refer its manpage.

On 03/09/2012 06:15 PM, John Major wrote:
 Hi,

 Hope this is the right place to ask this.

 I have set up 2 ubuntu lts machines with an IBM iscsi san. I have set up
 multipathd and ocfs2 and it seems to be working.

 The problem is that when I run up ocfs2console it hangs (the console
 app, not the system). Using strace, I can see that it is running through
 all the /dev/sdx devices and loops trying to access the first one in
 'ghost' state per 'multipath -ll'.

 Is there a way to restrict which devices the app looks at as it starts
 to say  /dev/mapper/mpath* since I don't actually want it to access any
 of the /dev/sd.. devices directly?

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 1.2/1.6

2012-03-02 Thread Sunil Mushran
The file system on-disk image has not changed. So the 1.6 file system
software can mount the volume created with 1.2 mkfs. What you cannot do
is concurrently mount the same volume with nodes running 1.2 and 1.6 
versions of the file system software.

It is not mixed mode. The 1.6 fs software will read the on-disk features
on the 1.2 volume and limit the functioning on that volume to just that.
Perfectly normal.

Yes, you can add the tablespace on the 1.2 volume.

For the 1.2 volume to be able to use 1.6 features, the said features
will have to be enabled. Once you do enable those features, the volume
will not be mountable on the older RHEL4 boxes unless those features
are disabled. There is a whole section in the users' guide that explains
this in more detail.

On 03/02/2012 08:09 AM, Maki, Nancy wrote:
 We are in the process of migrating to new database servers. Our current
 RAC clusters are running OCFS2 1.2.9 on RHEL 4. Our new servers are
 running OCFS2 1.6 OEL5. If possible, we would like to minimize the
 amount of data that needs to move as we migrate to the new servers. We
 have the following questions:

 1.Can we mount an existing OCFS2 1.2 volume on a servers running OCFS2 1.6?

 2.Are there any negative implications of being in a mixed mode?

 3.If we need to add a OCFS2 1.6 volume to increase a tablespace size,
 can we have one datafile be OCFS2 1.2 and another

 be OCFS2 1.6 for the same tablespace?

 4.Can we use OCFS2 1.6 features against an OCFS2 1.2 volume mounted on
 OCFS2 1.6?

 Thank you,

 Nancy

 circle http://www.suny.edu/**

   

 *Nancy Maki*
 /Manager of Database Services/

 Office of Information  Technology
 The State University of New York
 State University Plaza - Albany, New York 12246
 Tel: 518.320.1213 Fax: 518.320.1550

 eMail: nancy.m...@suny.edu
 */Be a part of Generation SUNY: /**/Facebook/*
 http://www.facebook.com/generationsuny*/- /**/Twitter/*
 http://www.twitter.com/generationsuny*/- /**/YouTube/*
 http://www.youtube.com/generationsuny



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Ocfs2-users Digest, Vol 98, Issue 9

2012-03-02 Thread Sunil Mushran
On 02/29/2012 04:10 PM, David Johle wrote:
 I too have seen some serious performance issues under 1.4, especially
 with writes.  I'll share some info I've gathered on this topic, take
 it however you wish...

 In the past I never really thought about running benchmarks against
 the shared block device as a baseline to compare with the
 filesystem.  So today I did run several dd tests of my own (both read
 and write) against a shared block device (different LUN, but using
 the exact same storage hardware including specific disks as the one
 with OCFS2).

 My tests were not in line with those of Erik Schwartz, as I
 determined the performance degradations to be OCFS2 related.

 I have a a fs shared by 2 nodes, both are dual quad core xeon systems
 with 2 dedicated storage NICs per box.
 Storage is a Dell/EqualLogic iSCSI SAN with 3 gigE NICs, dedicated
 gigE switches, using jumbo frames.
 I'm using dm-multipath as well.

 RHEL5 (2.6.18-194.3.1.el5 kernel)
 ocfs2-2.6.18-194.11.4.el5-1.4.7-1.el5
 ocfs2-tools-1.4.4-1.el5

 Using the individual /dev/sdX vs. the /dev/mapper/mpathX devices
 indicates that multipath is working properly as the numbers are close
 to double what the separates each give.

 Given the hardware, I'd consider 200MB/s a limit for a single box and
 300MB/s the limit for the SAN.

 Block device:
 Sequential reads tend to be in the 180-190MB/s range with just one
 node reading.
 Both nodes simultaneously reading gives about 260-270MB/s total throughput.
 Sequential writes tend to be in the 115-140MB/s range with just one
 node writing.
 Both nodes simultaneously writing gives about 200-230MB/s total throughput.

 OCFS2:
 Sequential reads tend to be in the 80-95MB/s range with just one node reading.
 Both nodes simultaneously reading gives about 125-135MB/s total throughput.
 Sequential writes tend to be in the 5-20MB/s range with just one node writing.
 Both nodes simultaneously writing (different files) gives unbearably
 slow performance of less than 1MB/s total throughput.

 Now one thing I will say is that I was testing on a mature
 filesystem that has been in use for quite some time.  Tons of file
 directory creation, reading, updating, deleting, over the course of a
 couple years.

 So to see how that might affect things, I then created a new
 filesystem on that same block device I used above (with same options
 as the mature one) and ran the set of dd-based fs tests on that.

 Create params: -b 4K -C 4K
 --fs-features=backup-super,sparse,unwritten,inline-data
Mount params: -o noatime,data=writeback

 Fresh OCFS2:
 Sequential reads tend to be in the 100-125MB/s range with just one
 node reading.
 Both nodes simultaneously reading gives about 165-180MB/s total throughput.
 Sequential writes tend to be in the 120-140MB/s range with just one
 node writing.
 Both nodes simultaneously writing (different files) gives reasonable
 performance of around 100MB/s total throughput.


 Wow, what a difference!  I will say that, for the mature filesystem
 above that is performing poorly, it has definitely gotten worse over
 time.  It seems to me that the filesystem itself has some time or
 usage based performance degradation issues.

 I'm actually thinking it would be to the benefit of my cluster to
 create a new volume, shut down all applications, copy the contents
 over, shuffle mount points, and start it all back up.  The only
 problem is that this will make for some highly unappreciated
 downtime!  Also, I'm concerned that all that copying and loading it
 up with contents may just result in the same performance losses,
 making the whole process just wasted effort.


We have worked on reducing fragmentation in later releases. One specific
feature added was allocation reservation (in 2.6.35). It is available
in prod releases starting 1.6.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Concurrent write performance issues with OCFS2

2012-02-28 Thread Sunil Mushran
In 1.4, the local allocator window is small. 8MB. Meaning the node
has to hit the global bitmap after every 8MB. In later releases, the
window is much larger.

Second, a single node is not a good baseline. A better baseline is
multiple nodes writing concurrently to the block device. Not fs.
Use dd. Set different write offsets. This should help figure out how
the shared device works with multiple nodes.

On 2/28/2012 9:24 AM, Erik Schwartz wrote:
 I have a two-node RHEL5 cluster that runs the following Linux kernel and
 accompanying OCFS2 module packages:

* kernel-2.6.18-274.17.1.el5
* ocfs2-2.6.18-274.17.1.el5-1.4.7-1.el5

 A 2.5TB LUN is presented to both nodes via DM-Multipath. I have carved
 out a single partition (using the entire LUN), and formatted it with OCFS2:

# mkfs.ocfs2 -N 2 -L 'foofs' -T datafiles /dev/mapper/bams01p1

 Finally, the filesystem is mounted to both nodes with the following options:

# mount | grep bams01
 /dev/mapper/bams01p1 on /foofs type ocfs2
 (rw,_netdev,noatime,data=writeback,heartbeat=local)

 --

 When a single node is writing arbitrary data (i.e. dd(1) with /dev/zero
 as input) to a large (say, 10 GB) file in /foofs, I see the expected
 performance of ~850 MB/sec.

 If both nodes are concurrently writing large files full of zeros to
 /foofs, performance drops way down to ~45 MB/s. I experimented with each
 node writing to /foofs/test01/ and /foofs/test02/ subdirectories,
 respectively, and found that performance increased slightly to a - still
 poor - 65 MB/s.

 --

 I understand from searching past mailing list threads that the culprit
 is likely related to the negotiation of file locks, and waiting for data
 to be flushed to journal / disk.

 My two questions are:

 1. Does this dramatic write performance slowdown sound reasonable and
 expected?

 2. Are there any OCFS2-level steps I can take to improve this situation?


 Thanks -



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?

2012-02-01 Thread Sunil Mushran
On 02/01/2012 07:02 AM, Mark wrote:
 One more thing.  When I straced one of the application processes (these are 
 the
 processes that create the files) I saw this:

 % time seconds  usecs/callcalls errors syscall
 --- --  --  -- ---
68.94   3.002017 11127154open
18.93   0.929679   2   418108read
12.40   0.543714   2   257548write

 So it seams that inode creation is the biggest time consumer by far.

Yes. open() triggers cluster lock creation which cannot be skipped. 
Reads and writes could skip cluster activity if the node already has the 
appropriate lock level.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Extend space on ocfs mount point

2012-02-01 Thread Sunil Mushran
I am not aware of any downsizes in resizing.

On 02/01/2012 09:57 AM, Kalra, Pratima wrote:
 We have a ucm installation on ocfs mount point and we need to increase
 the space on that mount point from 20gb to 30 gb. Is this possible
 without resulting in any after effects?
 Pratima.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?

2012-02-01 Thread Sunil Mushran
debugfs.ocfs2 -R stats /dev/mapper/...
I want to see the features enabled.

The main issue with large metdata is the fsck timing. The recently 
tagged 1.8 release of the tools has much better fsck performance.

On 02/01/2012 05:25 AM, Mark Hampton wrote:
 We have an application that has many processing threads writing more
 than a billion files ranging from 2KB – 50KB, with 50% under
 8KB (currently there are 700 million files).  The files are never
 deleted or modified – they are written once, and read infrequently.  The
 files are hashed so that they are evenly distributed across ~1,000,000
 subdirectories up to 3 levels deep, with up to 1000 files per
 directory.  The directories are structured like this:

 0/00/00

 0/00/01

 …

 F/FF/FE

 F/FF/FF

 The files need to be readable and writable across a number of
 servers. The NetApp filer we purchased for this project has both NFS and
 iSCSI capabilities.

 We first tried doing this via NFS.  After writing 700 million files (12
 TB) into a single NetApp volume, file-write performance became abysmally
 slow.  We can't create more than 200 files per second on the NetApp
 volume, which is about 20% of our required performance target of 1000
 files per second.  It appears that most of the file-write time is going
 towards stat and inode-create operations.

 So I now I’m trying the same thing with OCFS2 over iSCSI.  I created 16
 luns on the NetApp.  The 16 luns became 16 OCFS2 filesystems with 16
 different mount points on our servers.

 With this configuration I was initially able to write ~1800 files per
 second.  Now that I have completed 100 million files, performance has
 dropped to ~1500 files per second.

 I’m using OEL 6.1 (2.6.32-100 kernel) with OCFS2 version 1.6.  The
 application servers have 128GB of memory.  I created my OCFS2
 filesystems as follows:

 mkfs.ocfs2 –T mail –b 4k –C 4k –L my label --fs-features=indexed-dirs
 –fs-feature-level=max-features /dev/mapper/my device

 And I mount them with these options:

 _netdev,commit=30,noatime,localflocks,localalloc=32

 So my questions are these:


 1) Given a billion files sized 2KB – 50KB, with 50% under 8KB, do I have
 the optimal OCFS2 filesystem and mount-point configurations?


 2) Should I split the files across even more filesystems?  Currently I
 have them split across 16 OCFS2 filesystems.

 Thanks a billion!

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?

2012-02-01 Thread Sunil Mushran
On 02/01/2012 10:24 AM, Mark Hampton wrote:
 Here's what I got from debugfs.ocfs2 -R stats.  I have to type it out
 manually, so I'm only including the features lines:

 Feature Compat: 3 backup-super strict-journal-super
 Feature Incompat: 16208 sparse extended-slotmap inline-data metaecc
 xattr indexed-dirs refcount discontig-bg
 Feature RO compat: 7 unwritten usrquota grpquota


 Some other info that may be interesting:

 Links: 0   Clusters: 52428544


I would disable quotas. That line suggests the vol is 200G is size.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Bad magic number in inode

2012-02-01 Thread Sunil Mushran
inode#11 is in the system directory. fsck cannot fix this automatically.
If the corruption is limited, there is a chance the inodes could be
recreated manually. But do look at backups to restore.

On 02/01/2012 10:20 AM, Werner Flamme wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,

 when I try to mount an OCFS2 volume, I get

 - ---snip---
 [12212.195823] OCFS2: ERROR (device sde1): ocfs2_validate_inode_block:
 Invalid dinode #11: signature =
 [12212.195825]
 [12212.195827] File system is now read-only due to the potential of
 on-disk corruption. Please run fsck.ocfs2 once the file system is
 unmounted.
 [12212.195832] (mount.ocfs2,9772,0):ocfs2_read_locked_inode:499 ERROR:
 status = -22
 [12212.195842] (mount.ocfs2,9772,0):_ocfs2_get_system_file_inode:158
 ERROR: status = -116
 [12212.195853]
 (mount.ocfs2,9772,0):ocfs2_init_global_system_inodes:475 ERROR: status
 = -22
 [12212.195860]
 (mount.ocfs2,9772,0):ocfs2_init_global_system_inodes:478 ERROR: Unable
 to load system inode 4, possibly corrupt fs?
 [12212.195862] (mount.ocfs2,9772,0):ocfs2_initialize_super:2379 ERROR:
 status = -22
 [12212.195864] (mount.ocfs2,9772,0):ocfs2_fill_super:1064 ERROR:
 status = -22
 [12212.195869] ocfs2: Unmounting device (8,65) on (node 0)
 - ---pins---

 And doing an fsck, it looks like this:
 - ---snip---
 # fsck.ocfs2 -f  /dev/disk/by-label/ERSATZ
 fsck.ocfs2 1.8.0
 Checking OCFS2 filesystem in /dev/disk/by-label/ERSATZ:
Label:  ERSATZ
UUID:   AEB995484F2D4D19835AA380CAE0683A
Number of blocks:   268434093
Block size: 4096
Number of clusters: 268434093
Cluster size:   4096
Number of slots:40

 /dev/disk/by-label/ERSATZ was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 pass0: Bad magic number in inode reading inode alloc inode 11 for
 verification
 fsck.ocfs2: Bad magic number in inode while performing pass 0
 - ---pins---

 Any chance to access the filesystem other that reformatting it?

 The node ist the only node that can access this volume. I plan to
 share it via iSCSI, but first it must be mountable... There are 3
 other volumes in this cluster, mounted by about a dozen nodes.

 Regards,
 Werner

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

2012-01-27 Thread Sunil Mushran
Symmetric clustering works best when the nodes are comparable because 
all nodes have to work in sync. NFS may be more suitable for your needs.

On 01/26/2012 05:51 PM, Jorge Adrian Salaices wrote:
 I have been working on trying to convince Mgmt at work that we want to
 go to OCFS2 away from NFS for the sharing of the Application Layer of
 our Oracle EBS (Enterprise Business Suite), and for just general Backup
 Share, but general instability in my setup has dissuaded me to
 recommend it.

 I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and
 something as simple as an umount has triggered random Node reboots, even
 on nodes that have Other OCFS2 mounts not shared by the rebooting nodes.
 You see the problem I have is that I have disparate hardware and some of
 these servers are even VM's.

 Several documents state that nodes have to be somewhat equal of power
 and specs and in my case that will never be.
 Unfortunately for me, I have had several other events of random Fencing
 that have been unexplained by common checks.
 i.e. My Network has never been the problem yet one server may see
 another one go away when all of the other services on that node may be
 running perfectly fine. I can only surmise that the reason why that may
 have been is because of an elevated load on the server that starved the
 Heartbeat process preventing it from sending Network packets to other
 nodes.

 My config has about 40 Nodes on it, I have 4 or 5 different shared LUNs
 out of our SAN and not all servers share all Mounts.
 meaning only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3
 share a third, unfortunately the complexity is such that a server may
 intersect with some of the servers but not all.
 perhaps a change in my config to create separate clusters may be the
 solution but only if a node can be part of multiple clusters:

 /node:
 ip_port = 
 ip_address = 172.20.16.151
 number = 1
 name = txri-oprdracdb-1.tomkinsbp.com
 cluster = ocfs2-back

 node:
 ip_port = 
 ip_address = 172.20.16.152
 number = 2
 name = txri-oprdracdb-2.tomkinsbp.com
 cluster = ocfs2-back

 node:
 ip_port = 
 ip_address = 10.30.12.172
 number = 4
 name = txri-util01.tomkinsbp.com
 cluster = ocfs2-util, ocfs2-back
 node:
 ip_port = 
 ip_address = 10.30.12.94
 number = 5
 name = txri-util02.tomkinsbp.com
 cluster = ocfs2-util, ocfs2-back

 cluster:
 node_count = 2
 name = ocfs2-back

 cluster:
 node_count = 2
 name = ocfs2-util
 /
 Is this even Legal, or can it be done some other way ?
 or is this done based on the Different DOMAINS that are created once a
 mount is done .


 How can I make the cluster more stable ? and Why does a node fence
 itself on the cluster even if it does Not have any locks on the shared
 LUN ? It seems to be that the node may be fenceable simply by having
 the OCFS2 services turned ON, without a mount .
 is this correct ?

 Another question I have been having as well is: can the Fencing method
 be other than Panic or restart ? Can a third party or a Userland event
 be triggered to recover from what may be construed by the Heartbeat or
 Network tests as a downed node ?

 Thanks for any of the help you can give me.


 --
 Jorge Adrian Salaices
 Sr. Linux Engineer
 Tomkins Building Products



 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] One node, two clusters?

2011-12-22 Thread Sunil Mushran
You don't need to have two clusters for this. This can be accomplished
with one cluster with the default local heartbeat.

Create one cluster.conf with all the nodes. All nodes, except the one
machine, will mount from just one san. The common node will mount from
both sans.

If you look at the cluster membership, other than the common node,
all nodes will be interacting (network connection, etc.) with nodes that
they can see on the san.

On 12/22/2011 09:40 AM, Werner Flamme wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Kushnir, Michael (NIH/NLM/LHC) [C] [22.12.2011 18:20]:
 Is it possible to have one machine be part of two different ocfs2
 clusters with two different sans? Kind of to serve as a bridge for
 moving data between two clusters but without actually fully
 combining the two clusters?

 Thanks, Michael
 Michael,

 I asked this two years ago and the answer was no.

 When I look at /etc/ocfs2/cluster.conf, I do not see a possibility to
 configure a second cluster. Though the nodes must be assigned to a
 cluster (and exactly one cluster, this is), there ist only one entry
 cluster: in the file, and so there is no way to define a second one.

 We synced via rsync :-(

 HTH
 Werner

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.18 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

 iEYEARECAAYFAk7za4EACgkQk33Krq8b42MvSwCfQAXzqVQRPyhOdFrKM8PCPqbf
 g0cAn20CV4rjzXNrTa/YGaUeNlO3+rmc
 =CBmQ
 -END PGP SIGNATURE-

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] One node, two clusters?

2011-12-22 Thread Sunil Mushran
On 12/22/2011 10:39 AM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote:
 Is there a separate DLM instance for each ocfs2 volume?

 I have two sub-clusters in the same cluster... A 10 node Hadoop cluster 
 sharing a SATA RAID10 and a Two node web server cluster sharing a SSD RAID0. 
 One server mounts both volumes to move data between as necessary.

 This morning I got the following error (see end of message), and all nodes 
 lost access to all storage. I'm trying to mitigate risk of this happening 
 again.

 My hadoop nodes are used to generate search engine indexes, so they can go 
 down. But my web servers provide the search engine service so I need them to 
 not be tied to my hadoop nodes. I just feel safer that way. At the same time, 
  I need a bridge node to move data between the two. I can do it via NFS or 
 SCP, but I figured it'd be worth while to ask if one node can be in two 
 different clusters.

 Dec 22 09:15:42 lhce-imed-web1 kernel: 
 (updatedb,1832,1):dlm_get_lock_resource:898 
 042F68B6AF134E5C9A9EDF4D7BD7BE99:O0013d2ef94: at least 
 one node (11) to recover before lock mastery can begin


You should add ocfs2 to PRUNEFS in /etc/updatedb.conf. updatedb generates
a lot of io and network traffic. And it will happen around the same time on all 
nodes.

Yes, each volume has a different dlm domain (instance).

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] reflink status

2011-12-17 Thread Sunil Mushran
First we have to get the new syscall added to the kernel.
The first attempt failed because people overloaded the call with
extraneous stuff. Recently there is another attempt to go back
to the original proposal. Hopefully, next kernel release.

The reflink utility should work. So what it is based on an older
coreutils. It is derived from the hard link (ln) utility.

On 12/17/2011 4:15 AM, richard -rw- weinberger wrote:
 Hi!

 What do I need to use reflinks on OCFS2 1.6?

 coreutils 8.4's cp --reflink=always doesn't seem too work.

 I found
 http://oss.oracle.com/git/?p=jlbec/reflink.git;a=shortlog
 and
 http://oss.oracle.com/~smushran/reflink-tools/

 Both contain a patched and outdated coreutils package.

 Are there any plans to merge it upstream?



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] reflink status

2011-12-17 Thread Sunil Mushran
On 12/17/2011 12:05 PM, richard -rw- weinberger wrote:
 The reflink utility should work. So what it is based on an older
 coreutils. It is derived from the hard link (ln) utility.
 So, building it from http://oss.oracle.com/git/?p=jlbec/reflink.git;a=shortlog
 via reflink.spec is the way to go?


For now, yes.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 cluster won't come up and stay up

2011-12-01 Thread Sunil Mushran
To analyze one needs the logs. And a bugzilla is a good place holder for the 
logs. 

On Dec 1, 2011, at 6:05 PM, Tony Rios t...@tonyrios.com wrote:

 Sunil,
 Is submitting a bug report the only answer?
 I'm happy to send in this information, but can I take the cluster down 
 entirely and sort of reset it so we can get these servers back online and 
 talking again in the meanwhile?
 Tony
 
 On Dec 1, 2011, at 5:05 PM, Sunil Mushran wrote:
 
 Node 3 is joining the domain. It is having problms getting the superblock 
 cluster lock.
 Create a bugzilla on oss.oracle.com and attach the /var/logs/messages from 
 all nodes.
 If you have netconsole setup, attach those logs too.
 
 On 12/01/2011 04:55 PM, Tony Rios wrote:
 I'm having an issue today where I just can't seem to keep all the servers 
 in the cluster online.
 They aren't losing network connectivity and I can ping the iSCSI host just 
 fine and the host is logged in.
 
 These are the errors form the dmesg when I try to mount the filesystem:
 
 root@pedge36:~# dmesg
 [0.00] Initializing cgroup subsys cpuset
 [0.00] Initializing cgroup subsys cpu
 [0.00] Linux version 2.6.38-10-generic (buildd@yellow) (gcc version 
 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ) #46-Ubuntu SMP Tue Jun 28 15:07:17 
 UTC 2011 (Ubuntu 2.6.38-10.46-generic 2.6.38.7)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-2.6.38-10-generic 
 root=UUID=3cd859b8-2605-4a38-8767-a6d1f99d53bd ro debug ignore_loglevel
 [0.00] BIOS-provided physical RAM map:
 [0.00]  BIOS-e820:  - 000a (usable)
 [0.00]  BIOS-e820: 0010 - effc (usable)
 [0.00]  BIOS-e820: effc - effcfc00 (ACPI data)
 [0.00]  BIOS-e820: effcfc00 - e000 (reserved)
 [0.00]  BIOS-e820: f000 - f400 (reserved)
 [0.00]  BIOS-e820: fec0 - fed00400 (reserved)
 [0.00]  BIOS-e820: fed13000 - feda (reserved)
 [0.00]  BIOS-e820: fee0 - fee1 (reserved)
 [0.00]  BIOS-e820: ffb0 - 0001 (reserved)
 [0.00]  BIOS-e820: 0001 - 0001e000 (usable)
 [0.00]  BIOS-e820: 0001e000 - 0002 (reserved)
 [0.00]  BIOS-e820: 0002 - 00021000 (usable)
 [0.00] debug: ignoring loglevel setting.
 [0.00] NX (Execute Disable) protection: active
 [0.00] DMI 2.3 present.
 [0.00] DMI: Dell Computer Corporation PowerEdge 850/0Y8628, BIOS 
 A04 08/22/2006
 [0.00] e820 update range:  - 0001 
 (usable) ==  (reserved)
 [0.00] e820 remove range: 000a - 0010 
 (usable)
 [0.00] No AGP bridge found
 [0.00] last_pfn = 0x21 max_arch_pfn = 0x4
 [0.00] MTRR default type: uncachable
 [0.00] MTRR fixed ranges enabled:
 [0.00]   0-9 write-back
 [0.00]   A-B uncachable
 [0.00]   C-CBFFF write-protect
 [0.00]   CC000-EBFFF uncachable
 [0.00]   EC000-F write-protect
 [0.00] MTRR variable ranges enabled:
 [0.00]   0 base 0 mask E write-back
 [0.00]   1 base 2 mask FF000 write-back
 [0.00]   2 base 0F000 mask FF000 uncachable
 [0.00]   3 disabled
 [0.00]   4 disabled
 [0.00]   5 disabled
 [0.00]   6 disabled
 [0.00]   7 disabled
 [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 
 0x7010600070106
 [0.00] e820 update range: f000 - 0001 
 (usable) ==  (reserved)
 [0.00] last_pfn = 0xeffc0 max_arch_pfn = 0x4
 [0.00] found SMP MP-table at [880fe710] fe710
 [0.00] initial memory mapped : 0 - 2000
 [0.00] init_memory_mapping: -effc
 [0.00]  00 - 00efe0 page 2M
 [0.00]  00efe0 - 00effc page 4k
 [0.00] kernel direct mapping tables up to effc @ 
 1fffa000-2000
 [0.00] init_memory_mapping: 0001-00021000
 [0.00]  01 - 021000 page 2M
 [0.00] kernel direct mapping tables up to 21000 @ 
 effb6000-effc
 [0.00] RAMDISK: 366d - 3736
 [0.00] ACPI: RSDP 000fd160 00014 (v00 DELL  )
 [0.00] ACPI: RSDT 000fd174 00038 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: FACP 000fd1b8 00074 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: DSDT effc 01C19 (v01 DELL   PE830
 0001 MSFT 010E)
 [0.00] ACPI: FACS effcfc00 00040
 [0.00] ACPI: APIC 000fd22c 00074 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: SPCR

Re: [Ocfs2-users] Monitoring progress of fsck.ocfs2

2011-11-18 Thread Sunil Mushran
Do:
cat /proc/PID/stack

It is probably stuck in the block layer.

On 11/18/2011 08:33 AM, Nick Khamis wrote:
 Hello Everyone,

 I just ran fsck.ocfs2 on /dev/drbd0 which is a one gig partition on a
 vm with limited resource (100meg of ram).
 I am worried that the process crashed because it has not responded in
 the past hour or so?

 fsck.ocfs2 /dev/drbd0
 fsck.ocfs2 1.6.4
 [RECOVER_CLUSTER_INFO] The running cluster is using the cman stack
 with the cluster name ASTCluster, but the filesystem is configured for
 the classic o2cb stack.  Thus, fsck.ocfs2 cannot determine whether the
 filesystem is in use.  fsck.ocfs2 can reconfigure the filesystem to
 use the currently running cluster configuration.  DANGER: YOU MUST BE
 ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE
 MODIFYING ITS CLUSTER CONFIGURATION.  Recover cluster configuration
 information the running cluster?n  y


 ps -uroot
 8040 pts/000:00:00 fsck.ocfs2


 I want to mention that I did issue a ctrl+c and ctrl+x when I paniced.
 But I do not think anything happened.

 Nick

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Number of Nodes defined

2011-11-17 Thread Sunil Mushran
It must be the same fragmentation issue that we've addressed in 1.6 and later.
Is this 1.4?

On 11/17/2011 08:45 AM, David wrote:
 Sunil, et al,

 The reason I needed to make this changed was because the ocfs2 partition, 
 which is 101G in size with 41G currently in use ran out of disk space even 
 though the OS was reporting 60G available.

 I had this issue once before and found that the node slot of that cluster was 
 set to 4 even though there were only 2 nodes in the cluster.  When i reduced 
 the node slots to 2 disk space was freed up.

 I made these changes to this cluster; reduced the node slots to 2 and 
 everything worked until this morning when the same error returned No space 
 left on device.

 The OS is still showing available disk space but as the error suggests i 
 can't write to the partition.

 Any idea what could be happening?

 On 11/16/2011 05:45 PM, Sunil Mushran wrote:
 Reducing node-slots frees up the journal and distributes the metadata
 that that slot was tracking to the remaining slots. I am not aware of
 any reason why there should be an impact.

 On 11/16/2011 03:07 PM, David wrote:
 I did read the man page for tunefs.ocfs2 but I didn't see anything 
 indicating what the impact to the fs would be when making a change to an 
 existing fs such as reducing the node slots.

 Anyway, thank you for the feedback, I was able to make the changes with no 
 impact to the fs.

 David

 On 11/16/2011 12:12 PM, Sunil Mushran wrote:
 man tunefs.ocfs2

 It cannot be done in an active cluster. But it can be done without having 
 to
 reformat the volume.

 On 11/16/2011 10:08 AM, David wrote:
 I wasn't able to find any documentation that answers whether or not the
 number of nodes defined for a cluster,  can be reduced on an active
 cluster as seen via:

 tunefs.ocfs2 -Q %B %T %N\n

 Does anyone know if this can be done, or do I have to copy the data off
 of the fs, make the changes, reformat the fs and copy the data back?

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] [Ocfs2-devel] vmstore option - mkfs

2011-11-16 Thread Sunil Mushran
fstype is a handy way to format the volume with parameters that are thought
to be useful for that use-case. The result of this is printed during format by
way of the parameters selected. man mkfs.ocfs2 has a blurb about the features
it enabled by default.

On 11/16/2011 08:45 AM, Artur Baruchi wrote:
 Hi.

 I tried to find some information about the option vmstore when
 formating a device, but didnt found anything about it (no
 documentation, I did some greps inside the source code, but nothing
 returned). My doubts about this:

 - What kind of optimization this option creates in my file system to
 store vm images? I mean.. what does exactly this option do?
 - Where, in source code, I can find the part that makes this optimization?

 Thanks in advance.

 Att.
 Artur Baruchi



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] [Ocfs2-devel] vmstore option - mkfs

2011-11-16 Thread Sunil Mushran
Yes. But this is just the features. It also selects the appropriate cluster 
size, block size,
journal size, etc. All the params selected are printed by mkfs. You also have 
the option of
running with the --dry-option to see the params.

On 11/16/2011 09:41 AM, Artur Baruchi wrote:
 I just found this:

 + {OCFS2_FEATURE_COMPAT_BACKUP_SB | OCFS2_FEATURE_COMPAT_JBD2_SB,
 +  OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC |
 +  OCFS2_FEATURE_INCOMPAT_INLINE_DATA |
 +  OCFS2_FEATURE_INCOMPAT_XATTR |
 +  OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE,
 +  OCFS2_FEATURE_RO_COMPAT_UNWRITTEN},  /* FS_VMSTORE */

 These options are the ones that, when choosing for vmstore, are
 enabled by default. Is this correct?

 Thanks.

 Att.
 Artur Baruchi



 On Wed, Nov 16, 2011 at 3:26 PM, Sunil Mushransunil.mush...@oracle.com  
 wrote:
 fstype is a handy way to format the volume with parameters that are thought
 to be useful for that use-case. The result of this is printed during format
 by
 way of the parameters selected. man mkfs.ocfs2 has a blurb about the
 features
 it enabled by default.

 On 11/16/2011 08:45 AM, Artur Baruchi wrote:
 Hi.

 I tried to find some information about the option vmstore when
 formating a device, but didnt found anything about it (no
 documentation, I did some greps inside the source code, but nothing
 returned). My doubts about this:

 - What kind of optimization this option creates in my file system to
 store vm images? I mean.. what does exactly this option do?
 - Where, in source code, I can find the part that makes this optimization?

 Thanks in advance.

 Att.
 Artur Baruchi




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Number of Nodes defined

2011-11-16 Thread Sunil Mushran
man tunefs.ocfs2

It cannot be done in an active cluster. But it can be done without having to
reformat the volume.

On 11/16/2011 10:08 AM, David wrote:
 I wasn't able to find any documentation that answers whether or not the
 number of nodes defined for a cluster,  can be reduced on an active
 cluster as seen via:

 tunefs.ocfs2 -Q %B %T %N\n

 Does anyone know if this can be done, or do I have to copy the data off
 of the fs, make the changes, reformat the fs and copy the data back?

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Number of Nodes defined

2011-11-16 Thread Sunil Mushran
Reducing node-slots frees up the journal and distributes the metadata
that that slot was tracking to the remaining slots. I am not aware of
any reason why there should be an impact.

On 11/16/2011 03:07 PM, David wrote:
 I did read the man page for tunefs.ocfs2 but I didn't see anything indicating 
 what the impact to the fs would be when making a change to an existing fs 
 such as reducing the node slots.

 Anyway, thank you for the feedback, I was able to make the changes with no 
 impact to the fs.

 David

 On 11/16/2011 12:12 PM, Sunil Mushran wrote:
 man tunefs.ocfs2

 It cannot be done in an active cluster. But it can be done without having to
 reformat the volume.

 On 11/16/2011 10:08 AM, David wrote:
 I wasn't able to find any documentation that answers whether or not the
 number of nodes defined for a cluster,  can be reduced on an active
 cluster as seen via:

 tunefs.ocfs2 -Q %B %T %N\n

 Does anyone know if this can be done, or do I have to copy the data off
 of the fs, make the changes, reformat the fs and copy the data back?

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-14 Thread Sunil Mushran
o2image is only useful for debugging. It allows us to get a copy of the file 
system
on which we can test fsck inhouse. The files in lost+found have to be resolved
manually. If they are junk, delete them. If useful, move it to another 
directory.

On 11/11/2011 05:36 PM, Nick Khamis wrote:
 All Fixed!

 Just a few questions. Is there any documentation on howto diagnose on
 ocfs2 filesystem:
 * How to transfer an image file for testing onto a different machine.
 As you did with o2image.out
 * Does fsck.ocfs2 -fy /dev/loop0 pretty much fix all the common problems
 * What can I do with the files in lost+found

 Thanks Again,

 Nick.

 On Fri, Nov 11, 2011 at 8:02 PM, Sunil Mushransunil.mush...@oracle.com  
 wrote:
 So it detected one cluster that was doubly allocated. It fixed it.
 Details below. The other fixes could be because the o2image was
 taken on a live volume.

 As to how this could happen... I would look at the storage.


 # fsck.ocfs2 -fy /dev/loop0
 fsck.ocfs2 1.6.3
 Checking OCFS2 filesystem in /dev/loop0:
   Label:  AsteriskServer
   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
   Number of blocks:   592384
   Block size: 4096
   Number of clusters: 592384
   Cluster size:   4096
   Number of slots:2

 /dev/loop0 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Duplicate clusters detected.  Pass 1b will be run
 Running additional passes to resolve clusters claimed by more than one
 inode...
 Pass 1b: Determining ownership of multiply-claimed clusters
 Pass 1c: Determining the names of inodes owning multiply-claimed clusters
 Pass 1d: Reconciling multiply-claimed clusters
 Cluster 161335 is claimed by the following inodes:
   /asterisk/extensions.conf
   /moh/macroform-cold_day.wav
 [DUP_CLUSTERS_CLONE] Inode /asterisk/extensions.conf may be cloned or
 deleted to break the claim it has on its clusters. Clone inode
 /asterisk/extensions.conf to break claims on clusters it shares with other
 inodes? y
 [DUP_CLUSTERS_CLONE] Inode /moh/macroform-cold_day.wav may be cloned or
 deleted to break the claim it has on its clusters. Clone inode
 /moh/macroform-cold_day.wav to break claims on clusters it shares with
 other inodes? y
 Pass 2: Checking directory entries.
 [DIRENT_INODE_FREE] Directory entry 'musiconhold.conf' refers to inode
 number 35348 which isn't allocated, clear the entry? y
 Pass 3: Checking directory connectivity.
 [LOSTFOUND_MISSING] /lost+found does not exist.  Create it so that we can
 possibly fill it with orphaned inodes? y
 Pass 4a: checking for orphaned inodes
 Pass 4b: Checking inodes link counts.
 [INODE_COUNT] Inode 96783 has a link count of 1 on disk but directory entry
 references come to 2. Update the count on disk to match? y
 [INODE_NOT_CONNECTED] Inode 96784 isn't referenced by any directory entries.
   Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96785 isn't referenced by any directory entries.
   Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96794 isn't referenced by any directory entries.
   Move it to lost+found? y
 [INODE_NOT_CONNECTED] Inode 96796 isn't referenced by any directory entries.
   Move it to lost+found? y
 All passes succeeded.
 Slot 0's journal dirty flag removed
 Slot 1's journal dirty flag removed


 [root@ca-test92 ocfs2]# fsck.ocfs2 -fy /dev/loop0
 fsck.ocfs2 1.6.3
 Checking OCFS2 filesystem in /dev/loop0:
   Label:  AsteriskServer
   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
   Number of blocks:   592384
   Block size: 4096
   Number of clusters: 592384
   Cluster size:   4096
   Number of slots:2

 /dev/loop0 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Pass 2: Checking directory entries.
 Pass 3: Checking directory connectivity.
 Pass 4a: checking for orphaned inodes
 Pass 4b: Checking inodes link counts.
 All passes succeeded.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 and db_block_size

2011-11-14 Thread Sunil Mushran
We talk about this in the user's guide.
1. Always use 4K blocksize.
2. Never set the cluster size less than the database block size.

Having a smaller cluster size could mean that a db block may not be contiguous.
And you don't want that for performance and other reasons. Having a still larger
cluster size is an easy way to ensure the files are contiguous. Contiguity can 
only
help perf.

On 11/14/2011 03:35 PM, Pravin K Patil wrote:
 Hi All,
 Is there a benchmark study done different block sizes of ocfs2 and 
 corrosponding db_block_size and its impact on read / write?
 Similar way s there any study done for cluster size of ocfs2 and 
 corrosponding db_block_size and its impact on read / write?
 For example if the db_block_size is 8K and if we have ocfs2 cluster size as 
 4K will it have any performance impact or in other words, if we make cluster 
 size of file systems on which data files are located as 8K will it improve 
 performance? if so is it for read or write?
 Looking for actual expereince on the settings of ocfs2 block size, cluster 
 size and db_block_size corelation.

 Regards,
 Pravin



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-10 Thread Sunil Mushran
Do:

fsck.ocfs2 -f /dev/...

Without -f, it only replays the journal.

On 11/09/2011 05:49 PM, Nick Khamis wrote:
 Hello Sunil,

 This is only on the protoype so it's not crucial however, it would be
 nice to figure out why for
 future reference:

 fsck.ocfs2 /dev/drbd0
 fsck.ocfs2 1.6.4
 Checking OCFS2 filesystem in /dev/drbd0:
   Label:  AsteriskServer
   UUID:   3A791AB36DED41008E58CEF52EBEEFD3
   Number of blocks:   592384
   Block size: 4096
   Number of clusters: 592384
   Cluster size:   4096
   Number of slots:2

 /dev/drbd0 is clean.  It will be checked after 20 additional mounts.

 I can mount it and write to it just fine (read and write). It's just
 when I start the application that reads from the filesystem
 (I don't think there is any writing going on), that it goes into read
 mode... It use to work, other than the update to 1.6.4 I am not sure
 what I have changed..

 Not quite sure what kind of information you would need to help figure
 out the problem?

 Cheers,

 Nick.

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-10 Thread Sunil Mushran
The ro issue was different. It appears the volume has more problems.
If you want to me to look at the issue, I'll need the image of the volume.
# o2image /dev/device  /tmp/o2image.out

On 11/10/2011 01:55 PM, Nick Khamis wrote:
 Hello Sunil,

 Thank you so much for your time, and I do not want to take any more
 of it. I ran fsck with -f and have the following:

 fsck.ocfs2 -f /dev/drbd0
 fsck.ocfs2 1.6.4
 Checking OCFS2 filesystem in /dev/drbd0:
Label:  ASTServer
UUID:   3A791AB36DED41008E58CEF52EBEEFD3
Number of blocks:   592384
Block size: 4096
Number of clusters: 592384
Cluster size:   4096
Number of slots:2

 /dev/drbd0 was run with -f, check forced.
 Pass 0a: Checking cluster allocation chains
 Pass 0b: Checking inode allocation chains
 Pass 0c: Checking extent block allocation chains
 Pass 1: Checking inodes and blocks.
 Duplicate clusters detected.  Pass 1b will be run
 Running additional passes to resolve clusters claimed by more than one 
 inode...
 Pass 1b: Determining ownership of multiply-claimed clusters
 pass1b: Inode type does not contain extents while processing inode 5
 fsck.ocfs2: Inode type does not contain extents while performing pass 1

 Not sure if the read-only is due to the detected duplicate?

 Thanks in Advance,

 Nick.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] mixing ocfs2 versions in a cluster

2011-11-09 Thread Sunil Mushran
I would recommend upgrading all the nodes to 1.2.9 as it contains fixes
to known bugs in the versions you are running. Mixing versions is never
recommended mainly because it is hard to test all possible combinations.
It is alright to do so on an interim basis. But never recommended as a
stable setup.

On 11/09/2011 10:53 AM, Shashank wrote:
 Can you mix ocfs2 versions in a cluster?

 Eg. I have 4 nodes in a cluster. two nodes with version 1.2.7.-1el4
 and the other two with 1.2.5-6.

 Thanks,
 Vik



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm locking

2011-11-09 Thread Sunil Mushran
This has nothing to do with the dlm. The error states that the fs encountered
a bad inode on disk. Possible disk corruption. On encountering the fs goes 
readonly
and asks the user to run fsck.

On 11/09/2011 11:51 AM, Nick Khamis wrote:
 Hello Everyone,

 For the first time I eoerienced a dlm lock:

 [ 9721.831813] OCFS2 DLM 1.5.0
 [ 9721.917032] ocfs2: Registered cluster interface o2cb
 [ 9722.170848] OCFS2 DLMFS 1.5.0
 [ 9722.179018] OCFS2 User DLM kernel interface loaded
 [ 9755.743195] ocfs2_dlm: Nodes in domain
 (3A791AB36DED41008E58CEF52EBEEFD3): 1
 [ 9755.852798] ocfs2: Mounting device (147,0) on (node 1, slot 0) with
 ordered data mode.
 [ 9783.240424] block drbd0: Handshake successful: Agreed network
 protocol version 91
 [ 9783.242922] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
 [ 9783.243074] block drbd0: conn( WFConnection -  WFReportParams )
 [ 9783.243205] block drbd0: Starting asender thread (from drbd0_receiver 
 [4390])
 [ 9783.271014] block drbd0: data-integrity-alg:not-used
 [ 9783.271298] block drbd0: drbd_sync_handshake:
 [ 9783.271318] block drbd0: self
 964FFEDA732A512B:0ABD16D2597E52D9:54E3AEC293CEDC7E:120384BD0E3A5705
 bits:3 flags:0
 [ 9783.271342] block drbd0: peer
 B4C81B0FD76EFAC2:0ABD16D2597E52D9:54E3AEC293CEDC7F:120384BD0E3A5705
 bits:0 flags:0
 [ 9783.271364] block drbd0: uuid_compare()=100 by rule 90
 [ 9783.271380] block drbd0: Split-Brain detected, 1 primaries,
 automatically solved. Sync from this node
 [ 9783.271417] block drbd0: peer( Unknown -  Secondary ) conn(
 WFReportParams -  WFBitMapS )
 [ 9783.399967] block drbd0: peer( Secondary -  Primary )
 [ 9783.515979] block drbd0: conn( WFBitMapS -  SyncSource ) pdsk(
 Outdated -  Inconsistent )
 [ 9783.522521] block drbd0: Began resync as SyncSource (will sync 12
 KB [3 bits set]).
 [ 9783.629758] block drbd0: Implicitly set pdsk Inconsistent!
 [ 9783.799387] block drbd0: Resync done (total 1 sec; paused 0 sec; 12 K/sec)
 [ 9783.799956] block drbd0: conn( SyncSource -  Connected ) pdsk(
 Inconsistent -  UpToDate )
 [ 9795.430801] o2net: accepted connection from node astdrbd2 (num 2)
 at 192.168.2.111:
 [ 9800.231650] ocfs2_dlm: Node 2 joins domain 3A791AB36DED41008E58CEF52EBEEFD3
 [ 9800.231668] ocfs2_dlm: Nodes in domain
 (3A791AB36DED41008E58CEF52EBEEFD3): 1 2
 [ 9861.922744] OCFS2: ERROR (device drbd0):
 ocfs2_validate_inode_block: Invalid dinode #35348: OCFS2_VALID_FL not
 set
 [ 9861.922767]
 [ 9861.927278] File system is now read-only due to the potential of
 on-disk corruption. Please run fsck.ocfs2 once the file system is
 unmounted.
 [ 9861.928231] (8009,0):ocfs2_read_locked_inode:496 ERROR: status = -22

 Not sure where to start, but with your appreciated help I am sure we
 can get it resolved.

 Thanks in Advance,

 Nick.

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Error building ocfs2-tools

2011-10-28 Thread Sunil Mushran
On 10/27/2011 07:10 PM, Tim Serong wrote:
 Damn.  It was in Pacemaker's include/crm/ais.h, back before June 27 last 
 year(!), when it was moved to Pacemaker's configure.ac:

 https://github.com/ClusterLabs/pacemaker/commit/8e939b0ad779c65d445e2fa150df1cc046428a93#include/crm/ais.h

 This means it probably no longer appears in any of Pacemaker's public (devel 
 package) header files, which explains the compile error.

 I did some more digging, and we (SUSE) presumably never had this problem 
 because we've been carrying the attached patch for rather a long time. It 
 replaces CRM_SERVICE (a relatively uninteresting number) with a somewhat more 
 useful string literal...


 I thought the O2CB OCF RA was always provided by either pacemaker (or,
 on SUSE at least, in ocfs2-tools), but was never included in the
 upstream ocfs2-tools source tree?


 I thought we had checked-in all the pacemaker related patches. Are we
 missing something?

 The O2CB OCF RA is this thing:

 https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/o2cb

 It's the (better/stronger/faster :)) equivalent of the o2cb init script, 
 which you use when OCFS2 is under Pacemaker's control.

 There's (IMO) a good argument for having OCF RAs included with the project 
 they're intended for use with (all code pertaining to the operation of some 
 program lives in one place).

 OTOH, there's another argument for having them included in the generic 
 resource-agents or pacemaker package (Pacemaker and RHCS probably being the 
 only things that actually use OCF RAs).

 I suspect the RA was either never submitted to ocfs2-tools, or was never 
 accepted (don't know which, I wasn't involved when it was originally written).

So I am checking in the patch with your sign-off. I hope that is ok with you.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Error building ocfs2-tools

2011-10-27 Thread Sunil Mushran
ocfs2-tools-1.4.4 is too old. Build 1.6.4. The source tarball is on 
oss.oracle.com.

On 10/27/2011 12:45 PM, Nick Khamis wrote:
 Hello Everyone,

 I am building ocfs2-tools from source. Modified
 /ocfs2_controld/Makefile to point to the correct pacemaker 1.1.6
 headers:

 PCMK_INCLUDES = -I/usr/include/pacemaker -I/usr/include/heartbeat
 -I/usr/include/libxml2 $(GLIB_CFLAGS)

 However, for some reason I am getting:

 setup_stack:
 pacemaker.c:158: error: PCMK_SERVICE_ID undeclared (first use in this 
 function)
 pacemaker.c:158: error: (Each undeclared identifier is reported only once
 pacemaker.c:158: error: for each function it appears in.)
 make[1]: *** [pacemaker.o] Error 1
 make[1]: Leaving directory `/usr/local/src/ocfs2-tools-1.4.4/ocfs2_controld'

 The config I am using:

 ./configure --sbindir=/sbin --bin=/bin --libdir=/usr/lib
 --sysconfdir=/etc --datadir=/etc/ocfs2 --sharedstatedir=/var/ocfs2
 --libexecdir=/usr/libexec --localstatedir=/var --mandir=/usr/man
 --enable-dynamic-fsck --enable-dynamic-ctl


 Thanks in Advance,

 Nick.

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Error building ocfs2-tools

2011-10-27 Thread Sunil Mushran
I don't remember that resource. If it did exist, it would have
existed in pacemaker. ocfs2-tools does not carry any pacemaker
bits. It carries bits that allows it to work with pacemaker  cman.

On 10/27/2011 02:27 PM, Nick Khamis wrote:
 Hello Sunil,

 Thank you so much for your response. I just downloaded 1.6. And had to
 add the following to pacemaker.c:

 #define PCMK_SERVICE_ID 9
 line 158: log_error(Connection to our AIS plugin (%d) failed,
 PCMK_SERVICE_ID);

 to avoid.

 pacemaker.c: In function setup_stack:
 pacemaker.c:158: error: PCMK_SERVICE_ID undeclared (first use in this 
 function)
 pacemaker.c:158: error: (Each undeclared identifier is reported only once
 pacemaker.c:158: error: for each function it appears in.)
 make[1]: *** [pacemaker.o] Error 1
 make[1]: Leaving directory `/usr/local/src/ocfs2-tools-1.6.4/ocfs2_controld'
 make: *** [ocfs2_controld] Error 2

 Not sure if that was the right thing to do?

 On a slightly unreallated. There use to be pacemaker ocf resource
 agent script included for o2cb o2cb.ocf.
 I take it this is now only provided by pacemaker?

 Cheers,

 Nick.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Error building ocfs2-tools

2011-10-27 Thread Sunil Mushran
On 10/27/2011 05:26 PM, Tim Serong wrote:
 That ought to work...  But where did PCMK_SERVICE_ID come from in that 
 context?  AFAICT it's always been CRM_SERVICE there.  See current head:

 http://oss.oracle.com/git/?p=ocfs2-tools.git;a=blob;f=ocfs2_controld/pacemaker.c;hb=HEAD#l158

 CRM_SERVICE is then mapped back to PCMK_SERVICE_ID in pacemaker's 
 include/crm/ais.h:

 https://github.com/ClusterLabs/pacemaker/blob/master/include/crm/ais.h#L54


Where is PCMK_SERVICE_ID defined? This qs has come up more than once.


 I thought the O2CB OCF RA was always provided by either pacemaker (or, on 
 SUSE at least, in ocfs2-tools), but was never included in the upstream 
 ocfs2-tools source tree?


I thought we had checked-in all the pacemaker related patches. Are we missing 
something?

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-23 Thread Sunil Mushran

I think it stops by uuid. So try doing this the next time.
You are encountering some issue that we have not seen before.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2

On 10/23/2011 05:32 AM, Laurentiu Gosu wrote:

Hi Sunil,
Sorry for my late reply, i just had time today to start from scratch 
and test.
I rebuilt my environment(2 nodes connected to a SAN via 
iSCSI+multipath). I still have the issue that the heartbeat is active 
after I umount my ocfs2 volume.

/etc/init.d/o2cb stop
Stopping O2CB cluster CLUST: Failed
Unable to stop cluster as heartbeat region still active

ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs

After i manually kill the ref (ocfs2_hb_ctl -K -d 
/dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
live with that but why doesn't it stop automatically? As i understand, 
hearbeat should be started and stopped once the volume gets 
mounted/umounted.


br,
Laurentiu.

On 10/19/2011 02:28, Sunil Mushran wrote:

Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
OK, i rebooted one of the nodes(both had similar issues); . But 
something is still fishy.

- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root0 Oct 19 01:50 
0C4AB55FE9314FA5A9F81652FDB9B22D

-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/


PS: i'm going to sleep now, i have to be up in a few hours. We can 
continue tomorrow if it's ok with you.

Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:

One way this can happen is if one starts the hb manually and then force
formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
Yes, i did reformat it(even more than once i think, last week). 
This is a pre-production system and i'm trying various options 
before moving into real life.



On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
*


On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
heartbeat


No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  
UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   
0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2


mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2

Re: [Ocfs2-users] OCFS2 slow with multiple writes

2011-10-21 Thread Sunil Mushran
Because in this case the cluster lock may be waiting for the journal
commit to complete. It depends on where the file is being created,
what internal metadata blocks need to be locked, etc. Your dd is not
a simple write. It is a create + allocation + write. If the file already
exists, then the data extents will first be truncated too.

On 10/21/2011 03:27 AM, Prakash Velayutham wrote:
 Hi Sunil,

 Thanks for the response. Do you mean OCFS2 is blocking writes from multiple 
 clients? Is that how OCFS2 works? I can understand that writing the (2) 20G 
 files might take longer with ordered option as data needs to be flushed to 
 the FS before journal commit, but why is that blocking a new separate file 
 from being written to the file system?

 Regards,
 Prakash

 On Oct 20, 2011, at 6:25 PM, Sunil Mushran wrote:

 Use writeback. Ordered data requires the data to be flushed
 before journal commit. And flushing 40G takes time.

 mount -t data=writeback DEVICE PATH

 On 10/20/2011 03:05 PM, Prakash Velayutham wrote:
 Hi,

 OS - SLES 11.1 with HAE
 OCFS2 - 1.4.3-0.16.7
 Cluster stack - Pacemaker

 I have Heartbeat Filesystem monitor that monitors the OCFS2 file system for 
 availability. This monitor kicks in every minute and tries to write a file 
 using dd as below.

 dd of=/var/lib/mysql/data1/.Filesystem_status/default_bmimysqlp3 
 oflag=direct,sync bs=512 conv=fsync,sync

 If the OCFS2 file system is busy, like when I try to create 2 large files 
 (20GB each) in the OCFS2 directory, I see that the above monitor process 
 hangs until the 2 files are created. But this causes Pacemaker to fence the 
 node as the RA is configured for a timeout of 45secs and the 2 file 
 creations do take more than that. The OCFS2 file system is mounted as below.

 /dev/mapper/bmimysqlp3_p4_vol1 on /var/lib/mysql/data1 type ocfs2 
 (rw,_netdev,nointr,data=ordered,cluster_stack=pcmk)

 Is there something wrong with the file system itself that a small file 
 creation hangs like that? Please let me know if you need any more 
 information.

 Thanks,
 Prakash


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 slow with multiple writes

2011-10-20 Thread Sunil Mushran
Use writeback. Ordered data requires the data to be flushed
before journal commit. And flushing 40G takes time.

mount -t data=writeback DEVICE PATH

On 10/20/2011 03:05 PM, Prakash Velayutham wrote:
 Hi,

 OS - SLES 11.1 with HAE
 OCFS2 - 1.4.3-0.16.7
 Cluster stack - Pacemaker

 I have Heartbeat Filesystem monitor that monitors the OCFS2 file system for 
 availability. This monitor kicks in every minute and tries to write a file 
 using dd as below.

 dd of=/var/lib/mysql/data1/.Filesystem_status/default_bmimysqlp3 
 oflag=direct,sync bs=512 conv=fsync,sync

 If the OCFS2 file system is busy, like when I try to create 2 large files 
 (20GB each) in the OCFS2 directory, I see that the above monitor process 
 hangs until the 2 files are created. But this causes Pacemaker to fence the 
 node as the RA is configured for a timeout of 45secs and the 2 file creations 
 do take more than that. The OCFS2 file system is mounted as below.

 /dev/mapper/bmimysqlp3_p4_vol1 on /var/lib/mysql/data1 type ocfs2 
 (rw,_netdev,nointr,data=ordered,cluster_stack=pcmk)

 Is there something wrong with the file system itself that a small file 
 creation hangs like that? Please let me know if you need any more information.

 Thanks,
 Prakash
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran
ls -lR /sys/kernel/config/cluster

What does this return?

On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
 Hi,
 I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
 ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
 My problem is that all the time when i try to run /etc/init.d/o2cb stop
 it fails with this error:
   Stopping O2CB cluster CLUSTER: Failed
   Unable to stop cluster as heartbeat region still active
 There is no active mount point. I tried to manually stop the heartdbeat
 with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding
 the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ).
 But even if refs number is set to zero the heartbeat region still
 active occurs.
 How can i fix this?

 Thank you in advance.
 Laurentiu.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran
What does this return?
cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

Also, do:
ls -lR /sys/kernel/debug/ocfs2
ls -lR /sys/kernel/debug/o2dlm

On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
 Here is the output:

 ls -lR /sys/kernel/config/cluster
 /sys/kernel/config/cluster:
 total 0
 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

 /sys/kernel/config/cluster/CLUSTER:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
 drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
 drwxr-xr-x 4 root root0 Oct 11 20:23 node
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

 /sys/kernel/config/cluster/CLUSTER/heartbeat:
 total 0
 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
 -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

 /sys/kernel/config/cluster/CLUSTER/node:
 total 0
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num




 On 10/19/2011 00:12, Sunil Mushran wrote:
 ls -lR /sys/kernel/config/cluster

 What does this return?

 On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
 Hi,
 I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
 ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
 My problem is that all the time when i try to run /etc/init.d/o2cb stop
 it fails with this error:
   Stopping O2CB cluster CLUSTER: Failed
   Unable to stop cluster as heartbeat region still active
 There is no active mount point. I tried to manually stop the heartdbeat
 with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding
 the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ).
 But even if refs number is set to zero the heartbeat region still
 active occurs.
 How can i fix this?

 Thank you in advance.
 Laurentiu.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran
mount -t debugfs debugfs /sys/kernel/debug

Then list that dir.

Also, do:
ocfs2_hb_ctl -l -d /dev/dm-2

Be careful before killing. We want to be sure that dev is not mounted.

On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
 Again   the outputs:
  cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
 dm-2
 ---here should be volgr1-lvol0 i guess?

 ls -lR /sys/kernel/debug/ocfs2
 ls: /sys/kernel/debug/ocfs2: No such file or directory

 ls -lR /sys/kernel/debug/o2dlm
 ls: /sys/kernel/debug/o2dlm: No such file or directory

 I think i have to enable debug first somehow..?

 Laurentiu.

 On 10/19/2011 00:17, Sunil Mushran wrote:
 What does this return?
 cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

 Also, do:
 ls -lR /sys/kernel/debug/ocfs2
 ls -lR /sys/kernel/debug/o2dlm

 On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
 Here is the output:

 ls -lR /sys/kernel/config/cluster
 /sys/kernel/config/cluster:
 total 0
 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

 /sys/kernel/config/cluster/CLUSTER:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
 drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
 drwxr-xr-x 4 root root0 Oct 11 20:23 node
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

 /sys/kernel/config/cluster/CLUSTER/heartbeat:
 total 0
 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
 -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

 /sys/kernel/config/cluster/CLUSTER/node:
 total 0
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num




 On 10/19/2011 00:12, Sunil Mushran wrote:
 ls -lR /sys/kernel/config/cluster

 What does this return?

 On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
 Hi,
 I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
 ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
 My problem is that all the time when i try to run /etc/init.d/o2cb stop
 it fails with this error:
   Stopping O2CB cluster CLUSTER: Failed
   Unable to stop cluster as heartbeat region still active
 There is no active mount point. I tried to manually stop the heartdbeat
 with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding
 the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ).
 But even if refs number is set to zero the heartbeat region still
 active occurs.
 How can i fix this?

 Thank you in advance.
 Laurentiu.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users






___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran
So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
 ls -lR /sys/kernel/debug/ocfs2
 /sys/kernel/debug/ocfs2:
 total 0

 ls -lR /sys/kernel/debug/o2dlm
 /sys/kernel/debug/o2dlm:
 total 0

 ocfs2_hb_ctl -I -d /dev/dm-2
 ocfs2_hb_ctl: Device name specified was not found while reading uuid

 There is no /dev/dm-2 mounted.


 On 10/19/2011 00:27, Sunil Mushran wrote:
 mount -t debugfs debugfs /sys/kernel/debug

 Then list that dir.

 Also, do:
 ocfs2_hb_ctl -l -d /dev/dm-2

 Be careful before killing. We want to be sure that dev is not mounted.

 On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
 Again   the outputs:
  cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
 dm-2
 ---here should be volgr1-lvol0 i guess?

 ls -lR /sys/kernel/debug/ocfs2
 ls: /sys/kernel/debug/ocfs2: No such file or directory

 ls -lR /sys/kernel/debug/o2dlm
 ls: /sys/kernel/debug/o2dlm: No such file or directory

 I think i have to enable debug first somehow..?

 Laurentiu.

 On 10/19/2011 00:17, Sunil Mushran wrote:
 What does this return?
 cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

 Also, do:
 ls -lR /sys/kernel/debug/ocfs2
 ls -lR /sys/kernel/debug/o2dlm

 On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
 Here is the output:

 ls -lR /sys/kernel/config/cluster
 /sys/kernel/config/cluster:
 total 0
 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

 /sys/kernel/config/cluster/CLUSTER:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
 drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
 drwxr-xr-x 4 root root0 Oct 11 20:23 node
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

 /sys/kernel/config/cluster/CLUSTER/heartbeat:
 total 0
 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
 -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

 /sys/kernel/config/cluster/CLUSTER/node:
 total 0
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num




 On 10/19/2011 00:12, Sunil Mushran wrote:
 ls -lR /sys/kernel/config/cluster

 What does this return?

 On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
 Hi,
 I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
 ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
 My problem is that all the time when i try to run /etc/init.d/o2cb stop
 it fails with this error:
   Stopping O2CB cluster CLUSTER: Failed
   Unable to stop cluster as heartbeat region still active
 There is no active mount point. I tried to manually stop the heartdbeat
 with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding
 the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ).
 But even if refs number is set to zero the heartbeat region still
 active occurs.
 How can i fix this?

 Thank you in advance.
 Laurentiu.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users








___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran
ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
 mounted.ocfs2 -d
 DeviceFS Stack  UUID  Label
 /dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  
 ocfs2

 mounted.ocfs2 -f
 DeviceFS Nodes
 /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

 ro02xsrv001 = the other node in the cluster.

 By the way, there is no /dev/md-2
  ls /dev/dm-*
 /dev/dm-0  /dev/dm-1


 On 10/19/2011 00:37, Sunil Mushran wrote:
 So it is not mounted. But we still have a hb thread because
 hb could not be stopped during umount. The reason for that
 could be the same that causes ocfs2_hb_ctl to fail.

 Do:
 mounted.ocfs2 -d

 On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
 ls -lR /sys/kernel/debug/ocfs2
 /sys/kernel/debug/ocfs2:
 total 0

 ls -lR /sys/kernel/debug/o2dlm
 /sys/kernel/debug/o2dlm:
 total 0

 ocfs2_hb_ctl -I -d /dev/dm-2
 ocfs2_hb_ctl: Device name specified was not found while reading uuid

 There is no /dev/dm-2 mounted.


 On 10/19/2011 00:27, Sunil Mushran wrote:
 mount -t debugfs debugfs /sys/kernel/debug

 Then list that dir.

 Also, do:
 ocfs2_hb_ctl -l -d /dev/dm-2

 Be careful before killing. We want to be sure that dev is not mounted.

 On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
 Again   the outputs:
  cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
 dm-2
 ---here should be volgr1-lvol0 i guess?

 ls -lR /sys/kernel/debug/ocfs2
 ls: /sys/kernel/debug/ocfs2: No such file or directory

 ls -lR /sys/kernel/debug/o2dlm
 ls: /sys/kernel/debug/o2dlm: No such file or directory

 I think i have to enable debug first somehow..?

 Laurentiu.

 On 10/19/2011 00:17, Sunil Mushran wrote:
 What does this return?
 cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

 Also, do:
 ls -lR /sys/kernel/debug/ocfs2
 ls -lR /sys/kernel/debug/o2dlm

 On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
 Here is the output:

 ls -lR /sys/kernel/config/cluster
 /sys/kernel/config/cluster:
 total 0
 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

 /sys/kernel/config/cluster/CLUSTER:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
 drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
 drwxr-xr-x 4 root root0 Oct 11 20:23 node
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

 /sys/kernel/config/cluster/CLUSTER/heartbeat:
 total 0
 drwxr-xr-x 2 root root0 Oct 19 00:12 
 918673F06F8F4ED188DDCE14F39945F6
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
 -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

 /sys/kernel/config/cluster/CLUSTER/node:
 total 0
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num




 On 10/19/2011 00:12, Sunil Mushran wrote:
 ls -lR /sys/kernel/config/cluster

 What does this return?

 On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
 Hi,
 I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
 ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
 My problem is that all the time when i try to run /etc/init.d/o2cb 
 stop
 it fails with this error:
   Stopping O2CB cluster CLUSTER: Failed
   Unable to stop cluster as heartbeat region still active
 There is no active mount point. I tried to manually stop the 
 heartdbeat
 with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after 
 finding
 the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ).
 But even if refs number is set to zero the heartbeat region still
 active occurs.
 How can i fix this?

 Thank you in advance.
 Laurentiu.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users










___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran
See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
 ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


 On 10/19/2011 00:43, Sunil Mushran wrote:
 ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

 On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
 mounted.ocfs2 -d
 DeviceFS Stack  UUID  Label
 /dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  
 ocfs2

 mounted.ocfs2 -f
 DeviceFS Nodes
 /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

 ro02xsrv001 = the other node in the cluster.

 By the way, there is no /dev/md-2
  ls /dev/dm-*
 /dev/dm-0  /dev/dm-1


 On 10/19/2011 00:37, Sunil Mushran wrote:
 So it is not mounted. But we still have a hb thread because
 hb could not be stopped during umount. The reason for that
 could be the same that causes ocfs2_hb_ctl to fail.

 Do:
 mounted.ocfs2 -d

 On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
 ls -lR /sys/kernel/debug/ocfs2
 /sys/kernel/debug/ocfs2:
 total 0

 ls -lR /sys/kernel/debug/o2dlm
 /sys/kernel/debug/o2dlm:
 total 0

 ocfs2_hb_ctl -I -d /dev/dm-2
 ocfs2_hb_ctl: Device name specified was not found while reading uuid

 There is no /dev/dm-2 mounted.


 On 10/19/2011 00:27, Sunil Mushran wrote:
 mount -t debugfs debugfs /sys/kernel/debug

 Then list that dir.

 Also, do:
 ocfs2_hb_ctl -l -d /dev/dm-2

 Be careful before killing. We want to be sure that dev is not mounted.

 On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
 Again   the outputs:
  cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
 dm-2
 ---here should be volgr1-lvol0 i guess?

 ls -lR /sys/kernel/debug/ocfs2
 ls: /sys/kernel/debug/ocfs2: No such file or directory

 ls -lR /sys/kernel/debug/o2dlm
 ls: /sys/kernel/debug/o2dlm: No such file or directory

 I think i have to enable debug first somehow..?

 Laurentiu.

 On 10/19/2011 00:17, Sunil Mushran wrote:
 What does this return?
 cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

 Also, do:
 ls -lR /sys/kernel/debug/ocfs2
 ls -lR /sys/kernel/debug/o2dlm

 On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
 Here is the output:

 ls -lR /sys/kernel/config/cluster
 /sys/kernel/config/cluster:
 total 0
 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

 /sys/kernel/config/cluster/CLUSTER:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
 drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
 drwxr-xr-x 4 root root0 Oct 11 20:23 node
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

 /sys/kernel/config/cluster/CLUSTER/heartbeat:
 total 0
 drwxr-xr-x 2 root root0 Oct 19 00:12 
 918673F06F8F4ED188DDCE14F39945F6
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
 -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

 /sys/kernel/config/cluster/CLUSTER/node:
 total 0
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num




 On 10/19/2011 00:12, Sunil Mushran wrote:
 ls -lR /sys/kernel/config/cluster

 What does this return?

 On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
 Hi,
 I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
 ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
 My problem is that all the time when i try to run /etc/init.d/o2cb 
 stop
 it fails with this error:
   Stopping O2CB cluster CLUSTER: Failed
   Unable to stop cluster as heartbeat region still active
 There is no active mount point. I tried to manually stop the 
 heartdbeat
 with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after 
 finding
 the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 
 ).
 But even if refs number is set to zero the heartbeat region still
 active occurs.
 How can i fix this?

 Thank you in advance.
 Laurentiu.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran
Let's do it by hand.
rm -rf /sys/kernel/config/cluster/.../heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:
  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
 ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

 No improvment :(


 On 10/19/2011 00:50, Sunil Mushran wrote:
 See if this cleans it up.
 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

 On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
 ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


 On 10/19/2011 00:43, Sunil Mushran wrote:
 ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

 On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
 mounted.ocfs2 -d
 DeviceFS Stack  UUID  
 Label
 /dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  
 ocfs2

 mounted.ocfs2 -f
 DeviceFS Nodes
 /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

 ro02xsrv001 = the other node in the cluster.

 By the way, there is no /dev/md-2
  ls /dev/dm-*
 /dev/dm-0  /dev/dm-1


 On 10/19/2011 00:37, Sunil Mushran wrote:
 So it is not mounted. But we still have a hb thread because
 hb could not be stopped during umount. The reason for that
 could be the same that causes ocfs2_hb_ctl to fail.

 Do:
 mounted.ocfs2 -d

 On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
 ls -lR /sys/kernel/debug/ocfs2
 /sys/kernel/debug/ocfs2:
 total 0

 ls -lR /sys/kernel/debug/o2dlm
 /sys/kernel/debug/o2dlm:
 total 0

 ocfs2_hb_ctl -I -d /dev/dm-2
 ocfs2_hb_ctl: Device name specified was not found while reading uuid

 There is no /dev/dm-2 mounted.


 On 10/19/2011 00:27, Sunil Mushran wrote:
 mount -t debugfs debugfs /sys/kernel/debug

 Then list that dir.

 Also, do:
 ocfs2_hb_ctl -l -d /dev/dm-2

 Be careful before killing. We want to be sure that dev is not mounted.

 On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
 Again   the outputs:
  cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
 dm-2
 ---here should be volgr1-lvol0 i guess?

 ls -lR /sys/kernel/debug/ocfs2
 ls: /sys/kernel/debug/ocfs2: No such file or directory

 ls -lR /sys/kernel/debug/o2dlm
 ls: /sys/kernel/debug/o2dlm: No such file or directory

 I think i have to enable debug first somehow..?

 Laurentiu.

 On 10/19/2011 00:17, Sunil Mushran wrote:
 What does this return?
 cat 
 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

 Also, do:
 ls -lR /sys/kernel/debug/ocfs2
 ls -lR /sys/kernel/debug/o2dlm

 On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
 Here is the output:

 ls -lR /sys/kernel/config/cluster
 /sys/kernel/config/cluster:
 total 0
 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

 /sys/kernel/config/cluster/CLUSTER:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
 drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
 drwxr-xr-x 4 root root0 Oct 11 20:23 node
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

 /sys/kernel/config/cluster/CLUSTER/heartbeat:
 total 0
 drwxr-xr-x 2 root root0 Oct 19 00:12 
 918673F06F8F4ED188DDCE14F39945F6
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

 /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
 -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

 /sys/kernel/config/cluster/CLUSTER/node:
 total 0
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num

 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
 total 0
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
 -rw-r--r-- 1 root root 4096 Oct 19 00:12 num




 On 10/19/2011 00:12, Sunil Mushran wrote:
 ls -lR /sys/kernel/config/cluster

 What does this return?

 On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
 Hi,
 I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
 ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
 My problem is that all the time when i try to run 
 /etc/init.d/o2cb stop
 it fails with this error:
   Stopping O2CB cluster CLUSTER: Failed
   Unable to stop cluster as heartbeat region still active
 There is no active mount point. I tried to manually stop the 
 heartdbeat
 with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after 
 finding

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D *

On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2

mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:

ls -lR /sys/kernel/debug/ocfs2
/sys/kernel/debug/ocfs2:
total 0

ls -lR /sys/kernel/debug/o2dlm
/sys/kernel/debug/o2dlm:
total 0

ocfs2_hb_ctl -I -d /dev/dm-2
ocfs2_hb_ctl: Device name specified was not found while reading uuid

There is no /dev/dm-2 mounted.


On 10/19/2011 00:27, Sunil Mushran wrote:

mount -t debugfs debugfs /sys/kernel/debug

Then list that dir.

Also, do:
ocfs2_hb_ctl -l -d /dev/dm-2

Be careful before killing. We want to be sure that dev is not mounted.

On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:

Again   the outputs:
 cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
dm-2
---here should be volgr1-lvol0 i guess?

ls -lR /sys/kernel/debug/ocfs2
ls: /sys/kernel/debug/ocfs2: No such file or directory

ls -lR /sys/kernel/debug/o2dlm
ls: /sys/kernel/debug/o2dlm: No such file or directory

I think i have to enable debug first somehow..?

Laurentiu.

On 10/19/2011 00:17, Sunil Mushran wrote:

What does this return?
cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

Also, do:
ls -lR /sys/kernel/debug/ocfs2
ls -lR /sys/kernel/debug/o2dlm

On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:

Here is the output:

ls -lR /sys/kernel/config/cluster
/sys/kernel/config/cluster:
total 0
drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

/sys/kernel/config/cluster/CLUSTER:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
-rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
-rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
drwxr-xr-x 4 root root0 Oct 11 20:23 node
-rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

/sys/kernel/config/cluster/CLUSTER/heartbeat:
total 0
drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
-rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
-rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
-r--r--r-- 1 root root 4096 Oct 19 00:12 pid
-rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

/sys/kernel/config/cluster/CLUSTER/node:
total 0
drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

/sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
-rw-r--r-- 1 root root 4096 Oct 19 00:12 local
-rw-r--r-- 1 root root 4096 Oct 19 00:12 num

/sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
-rw-r--r-- 1 root root 4096 Oct 19 00:12 local
-rw-r--r-- 1 root root 4096 Oct 19 00:12 num




On 10/19/2011 00:12, Sunil Mushran wrote:

ls -lR /sys/kernel/config/cluster

What does this return?

On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:

Hi,
I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
My problem

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

One way this can happen is if one starts the hb manually and then force
formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:

Yes, i did reformat it(even more than once i think, last week). This is a 
pre-production system and i'm trying various options before moving into real 
life.


On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D *

On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2

mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:

ls -lR /sys/kernel/debug/ocfs2
/sys/kernel/debug/ocfs2:
total 0

ls -lR /sys/kernel/debug/o2dlm
/sys/kernel/debug/o2dlm:
total 0

ocfs2_hb_ctl -I -d /dev/dm-2
ocfs2_hb_ctl: Device name specified was not found while reading uuid

There is no /dev/dm-2 mounted.


On 10/19/2011 00:27, Sunil Mushran wrote:

mount -t debugfs debugfs /sys/kernel/debug

Then list that dir.

Also, do:
ocfs2_hb_ctl -l -d /dev/dm-2

Be careful before killing. We want to be sure that dev is not mounted.

On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:

Again   the outputs:
 cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
dm-2
---here should be volgr1-lvol0 i guess?

ls -lR /sys/kernel/debug/ocfs2
ls: /sys/kernel/debug/ocfs2: No such file or directory

ls -lR /sys/kernel/debug/o2dlm
ls: /sys/kernel/debug/o2dlm: No such file or directory

I think i have to enable debug first somehow..?

Laurentiu.

On 10/19/2011 00:17, Sunil Mushran wrote:

What does this return?
cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev

Also, do:
ls -lR /sys/kernel/debug/ocfs2
ls -lR /sys/kernel/debug/o2dlm

On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:

Here is the output:

ls -lR /sys/kernel/config/cluster
/sys/kernel/config/cluster:
total 0
drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER

/sys/kernel/config/cluster/CLUSTER:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat
-rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
-rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
drwxr-xr-x 4 root root0 Oct 11 20:23 node
-rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms

/sys/kernel/config/cluster/CLUSTER/heartbeat:
total 0
drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
-rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
-rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
-r--r--r-- 1 root root 4096 Oct 19 00:12 pid
-rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block

/sys/kernel/config/cluster/CLUSTER/node:
total 0
drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002

/sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
-rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
-rw-r--r-- 1 root root 4096 Oct 19 00:12 local
-rw-r--r-- 1 root root 4096 Oct 19 00:12 num

/sys/kernel

Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active

2011-10-18 Thread Sunil Mushran

Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:

OK, i rebooted one of the nodes(both had similar issues); . But something is 
still fishy.
- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root0 Oct 19 01:50 0C4AB55FE9314FA5A9F81652FDB9B22D
-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/

PS: i'm going to sleep now, i have to be up in a few hours. We can continue 
tomorrow if it's ok with you.
Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:

One way this can happen is if one starts the hb manually and then force
formats on that volume. The format will generate a new uuid. Once that
happens, the hb tool cannot map the region to the device and thus fail
to stop it. Right now the easiest option on this box is resetting it.

On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:

Yes, i did reformat it(even more than once i think, last week). This is a 
pre-production system and i'm trying various options before moving into real 
life.


On 10/19/2011 01:19, Sunil Mushran wrote:

Did you reformat the volume recently? or, when did you format last?

On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:

well..this is weird
ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
*918673F06F8F4ED188DDCE14F39945F6*  dead_threshold

looks like we have different UUIDs. Where is this coming from??

ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
918673F06F8F4ED188DDCE14F39945F6: 1 refs


On 10/19/2011 01:04, Sunil Mushran wrote:

Let's do it by hand.
rm -rf 
/sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D *

On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:

 ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

No improvment :(


On 10/19/2011 00:50, Sunil Mushran wrote:

See if this cleans it up.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:

ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs


On 10/19/2011 00:43, Sunil Mushran wrote:

ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D

On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:

mounted.ocfs2 -d
DeviceFS Stack  UUID  Label
/dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2

mounted.ocfs2 -f
DeviceFS Nodes
/dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001

ro02xsrv001 = the other node in the cluster.

By the way, there is no /dev/md-2
 ls /dev/dm-*
/dev/dm-0  /dev/dm-1


On 10/19/2011 00:37, Sunil Mushran wrote:

So it is not mounted. But we still have a hb thread because
hb could not be stopped during umount. The reason for that
could be the same that causes ocfs2_hb_ctl to fail.

Do:
mounted.ocfs2 -d

On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:

ls -lR /sys/kernel/debug/ocfs2
/sys/kernel/debug/ocfs2:
total 0

ls -lR /sys/kernel/debug/o2dlm
/sys/kernel/debug/o2dlm:
total 0

ocfs2_hb_ctl -I -d /dev/dm-2
ocfs2_hb_ctl: Device name specified was not found while reading uuid

There is no /dev/dm-2 mounted.


On 10/19/2011 00:27, Sunil Mushran wrote:

mount -t debugfs debugfs /sys/kernel/debug

Then list that dir.

Also, do:
ocfs2_hb_ctl -l -d /dev/dm-2

Be careful before killing. We want to be sure that dev is not mounted.

On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:

Again   the outputs:
 cat 
/sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
dm-2
---here should be volgr1-lvol0 i guess?

ls -lR /sys/kernel/debug/ocfs2
ls: /sys/kernel/debug/ocfs2: No such file or directory

ls -lR /sys/kernel/debug/o2dlm
ls: /sys/kernel/debug/o2dlm: No such file or directory

I think i have to enable debug first somehow..?

Laurentiu.

On 10/19/2011 00:17, Sunil Mushran wrote:

What does

Re: [Ocfs2-users] Partition table crash, where can I find debug message?

2011-10-12 Thread Sunil Mushran

Not sure what you mean by a partition table crash. Is it that someone
overwrote the partition table on the iscsi server? That's what it looks
like. If mount cannot detect the fs type, then it means atleast superblock
corruption. And such corruptions typically caused by external entities.
Stray dd perhaps.

Did you try recovering the superblock using one of the the backups?
fsck.ocfs2 -r [1-6] /dev/sdX ?

On 10/11/2011 07:04 PM, Frank Zhang wrote:


Hi Experts, recently I observed a partition table crash that made me really 
scared.

I have two OVM servers sharing OCFS2 over iscsi, after running  a bunch of VMs 
for a while,  all VMs were gone and I saw the mount points of OCFS2 gone on 
both hosts.

Then I tried to mount it again, the iscsi device crashed by saying please specify 
filesystem type. I checked dmesg but there is nothing useful except

SCSI device sdc: drive cache: write back

sdc: unknown partition table

sd 2:0:0:1: Attached scsi disk sdc

sd 2:0:0:1: Attached scsi generic sg3 type 0

OCFS2 Node Manager 1.4.4

OCFS2 DLM 1.4.4

OCFS2 DLMFS 1.4.4

OCFS2 User DLM kernel interface loaded

connection1:0: detected conn error (1011)

basically after logging into ISCSI device on both hosts, I created soft links 
of /dev/ovm_iscsi1 pointing to device node under 
/dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 
and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf 
and made o2cb correctly start).

Could somebody tell me where to get more debug info to trace the problem? This 
is really scared considering I may lose all my VMs because of the silent crash.

And is there any way to recover the partition table? Thanks


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Partition table crash, where can I find debug message?

2011-10-12 Thread Sunil Mushran

Hard to say. You'll need to investigate the extent of the crash.

On 10/12/2011 10:49 AM, Frank Zhang wrote:


Sorry, it's not power outage, it's just a normal reboot.

Is this serious to corrupt the super block?

*From:*Frank Zhang
*Sent:* Wednesday, October 12, 2011 10:37 AM
*To:* 'Sunil Mushran'
*Cc:* 'ocfs2-users@oss.oracle.com'
*Subject:* RE: [Ocfs2-users] Partition table crash, where can I find debug 
message?

Thanks Suni. Yes the terminology should be super block corruption.

I checked with my colleague they said  the ISCSI server suffered a power outage 
yesterday so they rebooted it.

Given it was under heavy usage because of many VM running on, I guess this may 
be the cause. now I am trying to recover it

*From:*Sunil Mushran [mailto:sunil.mush...@oracle.com] 
mailto:[mailto:sunil.mush...@oracle.com]
*Sent:* Wednesday, October 12, 2011 10:08 AM
*To:* Frank Zhang
*Cc:* 'ocfs2-users@oss.oracle.com'
*Subject:* Re: [Ocfs2-users] Partition table crash, where can I find debug 
message?

Not sure what you mean by a partition table crash. Is it that someone
overwrote the partition table on the iscsi server? That's what it looks
like. If mount cannot detect the fs type, then it means atleast superblock
corruption. And such corruptions typically caused by external entities.
Stray dd perhaps.

Did you try recovering the superblock using one of the the backups?
fsck.ocfs2 -r [1-6] /dev/sdX ?

On 10/11/2011 07:04 PM, Frank Zhang wrote:

Hi Experts, recently I observed a partition table crash that made me really 
scared.

I have two OVM servers sharing OCFS2 over iscsi, after running  a bunch of VMs 
for a while,  all VMs were gone and I saw the mount points of OCFS2 gone on 
both hosts.

Then I tried to mount it again, the iscsi device crashed by saying please specify 
filesystem type. I checked dmesg but there is nothing useful except

SCSI device sdc: drive cache: write back

sdc: unknown partition table

sd 2:0:0:1: Attached scsi disk sdc

sd 2:0:0:1: Attached scsi generic sg3 type 0

OCFS2 Node Manager 1.4.4

OCFS2 DLM 1.4.4

OCFS2 DLMFS 1.4.4

OCFS2 User DLM kernel interface loaded

connection1:0: detected conn error (1011)

basically after logging into ISCSI device on both hosts, I created soft links 
of /dev/ovm_iscsi1 pointing to device node under 
/dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 
and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf 
and made o2cb correctly start).

Could somebody tell me where to get more debug info to trace the problem? This 
is really scared considering I may lose all my VMs because of the silent crash.

And is there any way to recover the partition table? Thanks

  
  
___

Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com  mailto:Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Partition table crash, where can I find debug message?

2011-10-12 Thread Sunil Mushran

extent of the corruption... (not crash)

On 10/12/2011 10:51 AM, Sunil Mushran wrote:

Hard to say. You'll need to investigate the extent of the crash.

On 10/12/2011 10:49 AM, Frank Zhang wrote:


Sorry, it's not power outage, it's just a normal reboot.

Is this serious to corrupt the super block?

*From:*Frank Zhang
*Sent:* Wednesday, October 12, 2011 10:37 AM
*To:* 'Sunil Mushran'
*Cc:* 'ocfs2-users@oss.oracle.com'
*Subject:* RE: [Ocfs2-users] Partition table crash, where can I find debug 
message?

Thanks Suni. Yes the terminology should be super block corruption.

I checked with my colleague they said  the ISCSI server suffered a power outage 
yesterday so they rebooted it.

Given it was under heavy usage because of many VM running on, I guess this may 
be the cause. now I am trying to recover it

*From:*Sunil Mushran [mailto:sunil.mush...@oracle.com] 
mailto:[mailto:sunil.mush...@oracle.com]
*Sent:* Wednesday, October 12, 2011 10:08 AM
*To:* Frank Zhang
*Cc:* 'ocfs2-users@oss.oracle.com'
*Subject:* Re: [Ocfs2-users] Partition table crash, where can I find debug 
message?

Not sure what you mean by a partition table crash. Is it that someone
overwrote the partition table on the iscsi server? That's what it looks
like. If mount cannot detect the fs type, then it means atleast superblock
corruption. And such corruptions typically caused by external entities.
Stray dd perhaps.

Did you try recovering the superblock using one of the the backups?
fsck.ocfs2 -r [1-6] /dev/sdX ?

On 10/11/2011 07:04 PM, Frank Zhang wrote:

Hi Experts, recently I observed a partition table crash that made me really 
scared.

I have two OVM servers sharing OCFS2 over iscsi, after running  a bunch of VMs 
for a while,  all VMs were gone and I saw the mount points of OCFS2 gone on 
both hosts.

Then I tried to mount it again, the iscsi device crashed by saying please specify 
filesystem type. I checked dmesg but there is nothing useful except

SCSI device sdc: drive cache: write back

sdc: unknown partition table

sd 2:0:0:1: Attached scsi disk sdc

sd 2:0:0:1: Attached scsi generic sg3 type 0

OCFS2 Node Manager 1.4.4

OCFS2 DLM 1.4.4

OCFS2 DLMFS 1.4.4

OCFS2 User DLM kernel interface loaded

connection1:0: detected conn error (1011)

basically after logging into ISCSI device on both hosts, I created soft links 
of /dev/ovm_iscsi1 pointing to device node under 
/dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 
and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf 
and made o2cb correctly start).

Could somebody tell me where to get more debug info to trace the problem? This 
is really scared considering I may lose all my VMs because of the silent crash.

And is there any way to recover the partition table? Thanks

  
  
___

Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com  mailto:Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] one node kernel panic

2011-10-07 Thread Sunil Mushran
uek is a different kernel entirely. It is hard to say whether you
will or will not hit it with uek mainly because the underlying code
is different.

On 10/06/2011 10:33 PM, Hideyasu Kojima wrote:
 Thank you for responding.

 I think UEK5 is based on RHEL5 kernel.
 Does the problem same as UEK5 arise?

 (2011/10/05 1:45), Sunil Mushran wrote:
 int sigprocmask(int how, sigset_t *set, sigset_t *oldset)
 {
 int error;

 spin_lock_irq(current-sighand-siglock);  CRASH
 if (oldset)
 *oldset = current-blocked;
 ...
 }

 current-sighand is NULL. So definitely a race. Generic kernel issue.
 Ping your kernel vendor.

 On 10/03/2011 07:49 PM, Hideyasu Kojima wrote:
 Hi,

 I run ocfs2/drbd active-active 2node cluster.

 ocfs2 version is 1.4.7-1
 ocfs2-tool version is 1.4.4
 Linux version is RHEL 5.4 (2.6.18-164.el5 x86_64)

 1 node crash with kernel panic once.

 What is the cause?

 The bottom is the analysis of vmcore.

 

 Unable to handle kernel NULL pointer dereference at 0808 RIP:
 [80064ae6] _spin_lock_irq+0x1/0xb
 PGD 187e15067 PUD 187e16067 PMD 0
 Oops: 0002 [1] SMP
 last sysfs file:
 /devices/pci:00/:00:09.0/:06:00.0/:07:00.0/irq
 CPU 1
 Modules linked in: mptctl mptbase softdog autofs4 ipmi_devintf ipmi_si
 ipmi_msghandler ocfs2(U) ocfs2_dlmfs(U) ocfs2_dlm(U)
 ocfs2_nodemanager(U) configfs drbd(U) bonding ipv6 xfrm_nalgo crypto_api
 bnx2i(U) libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic(U)
 dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core
 button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev
 sr_mod cdrom sg pcspkr serio_raw hpilo bnx2(U) dm_raid45 dm_message
 dm_region_hash dm_log dm_mod dm_mem_cache hpahcisr(PU) ata_piix libata
 shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
 Pid: 21924, comm: res Tainted: P 2.6.18-164.el5 #1
 RIP: 0010:[80064ae6] [80064ae6]
 _spin_lock_irq+0x1/0xb
 RSP: 0018:81008b1cfae0 EFLAGS: 00010002
 RAX: 810187af4040 RBX:  RCX: 8101342b7b80
 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808
 RBP: 81008b1cfb98 R08:  R09: 
 R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8
 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8
 FS: () GS:810105d51840()
 knlGS:
 CS: 0010 DS:  ES:  CR0: 8005003b
 CR2: 0808 CR3: 000187e14000 CR4: 06e0
 Process res (pid: 21924, threadinfo 81008b1ce000, task
 810187af4040)
 Stack: 8001db30 81007f070520 885961f3
 810105d39400
 88596323 06ff813231393234 810075463018 810075463018
 0297 81007f070520 810075463028 0246
 Call Trace:
 [8001db30] sigprocmask+0x28/0xdb
 [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691
 [88596323] :ocfs2:ocfs2_delete_inode+0x130/0x1691
 [88581f16] :ocfs2:ocfs2_drop_lock+0x67a/0x77b
 [8858026a] :ocfs2:ocfs2_remove_lockres_tracking+0x10/0x45
 [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691
 [8002f49e] generic_delete_inode+0xc6/0x143
 [88595c85] :ocfs2:ocfs2_drop_inode+0xf0/0x161
 [8000d46e] dput+0xf6/0x114
 [800e9c44] prune_one_dentry+0x66/0x76
 [8002e958] prune_dcache+0x10f/0x149
 [8004d66e] shrink_dcache_parent+0x1c/0xe1
 [80104f8b] proc_flush_task+0x17c/0x1f6
 [8008fa2c] sched_exit+0x27/0xb5
 [80018024] release_task+0x387/0x3cb
 [80015c50] do_exit+0x865/0x911
 [80049281] cpuset_exit+0x0/0x88
 [8002b080] get_signal_to_deliver+0x42c/0x45a
 [8005ae7b] do_notify_resume+0x9c/0x7af
 [8008b6a2] deactivate_task+0x28/0x5f
 [80021f3f] __up_read+0x19/0x7f
 [80066b58] do_page_fault+0x4fe/0x830
 [800b65b2] audit_syscall_exit+0x336/0x362
 [8005d32e] int_signal+0x12/0x17


 Code: f0 ff 0f 0f 88 f3 00 00 00 c3 53 48 89 fb e8 33 f5 02 00 f0
 RIP [80064ae6] _spin_lock_irq+0x1/0xb
 RSP81008b1cfae0
 crash bt
 PID: 21924 TASK: 810187af4040 CPU: 1 COMMAND: res
 #0 [81008b1cf840] crash_kexec at 800ac5b9
 #1 [81008b1cf900] __die at 80065127
 #2 [81008b1cf940] do_page_fault at 80066da7
 #3 [81008b1cfa30] error_exit at 8005dde9
 [exception RIP: _spin_lock_irq+1]
 RIP: 80064ae6 RSP: 81008b1cfae0 RFLAGS: 00010002
 RAX: 810187af4040 RBX:  RCX: 8101342b7b80
 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808
 RBP: 81008b1cfb98 R8:  R9: 
 R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8
 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8
 ORIG_RAX:  CS: 0010 SS: 0018
 #4 [81008b1cfae0] sigprocmask at 8001db30
 #5

Re: [Ocfs2-users] Kernel Panic / Fencing

2011-10-06 Thread Sunil Mushran
I am unclear. What happens when a server is rebooted (or crashes).
Crash the network? Can you expand on this?

On 10/06/2011 05:52 PM, Tony Rios wrote:
 Hey all,

 I'm running a current version of Ubuntu and we are using OCFS2 across
 a cluster of 9 web servers.
 Everything works perfectly, so long as none of the servers need to be
 rebooted (or crash).

 I've done several web searches and one of the items that I've found to
 be suggested was to double the Heartbeat threshold.
 I increased ours from 31 to 61 and it doesn't appear to have helped at all.

 I can't imagine that if a server becomes unreachable that by design it
 is intended to crash the entire network.

 I'm hoping that someone will have some feedback here because I'm at a loss.

 Thanks so much,
 Tony

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Fwd: OCFS drives not syncing

2011-10-05 Thread Sunil Mushran
On 10/05/2011 08:46 AM, Bradlee Landis wrote:
 Sorry Sunil, my email replied to you instead of the list.

 On Wed, Oct 5, 2011 at 10:09 AM, Sunil Mushransunil.mush...@oracle.com  
 wrote:
 ocfs2 is a shared disk cluster file system. It requires a shared disk.

 However, if you are only going to use 2 nodes, you could use drbd,
 a replicating block device. To ocfs2, it appears as a shared disk.
 Google drbd and ocfs2 for more.

 So I've been confused about this the whole time I guess. So how is the
 OCFS drive shared? Is it done through OCFS, or does it require NFS?

 How do I access the filesystem from the other node?

The drives need to be physically shared. As in, all nodes need
to be able to concurrently read and write directly to the disk.

Two popular solutions are fiber channel and iscsi.

A fiber channel solution could be EMC disk array + FC switch +
hbas on all nodes hooked up to the switch.

An iscsi solution could be a iscsi target running on one server
with the disks. The nodes would use an iscsi initiator to access
the target. The devices will show up as regular devices (/dev/sdX)
on all nodes.

The cheapest solution would be to use drbd.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] one node kernel panic

2011-10-04 Thread Sunil Mushran
int sigprocmask(int how, sigset_t *set, sigset_t *oldset)
{
 int error;

 spin_lock_irq(current-sighand-siglock);  CRASH
 if (oldset)
 *oldset = current-blocked;
...
}

current-sighand is NULL. So definitely a race. Generic kernel issue.
Ping your kernel vendor.

On 10/03/2011 07:49 PM, Hideyasu Kojima wrote:
 Hi,

 I run ocfs2/drbd active-active 2node cluster.

 ocfs2 version is 1.4.7-1
 ocfs2-tool version is 1.4.4
 Linux version is RHEL 5.4 (2.6.18-164.el5 x86_64)

 1 node crash with kernel panic once.

 What is the cause?

 The bottom is the analysis of vmcore.

 

 Unable to handle kernel NULL pointer dereference at 0808 RIP:
   [80064ae6] _spin_lock_irq+0x1/0xb
 PGD 187e15067 PUD 187e16067 PMD 0
 Oops: 0002 [1] SMP
 last sysfs file:
 /devices/pci:00/:00:09.0/:06:00.0/:07:00.0/irq
 CPU 1
 Modules linked in: mptctl mptbase softdog autofs4 ipmi_devintf ipmi_si
 ipmi_msghandler ocfs2(U) ocfs2_dlmfs(U) ocfs2_dlm(U)
 ocfs2_nodemanager(U) configfs drbd(U) bonding ipv6 xfrm_nalgo crypto_api
 bnx2i(U) libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic(U)
 dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core
 button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev
 sr_mod cdrom sg pcspkr serio_raw hpilo bnx2(U) dm_raid45 dm_message
 dm_region_hash dm_log dm_mod dm_mem_cache hpahcisr(PU) ata_piix libata
 shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
 Pid: 21924, comm: res Tainted: P  2.6.18-164.el5 #1
 RIP: 0010:[80064ae6]  [80064ae6] _spin_lock_irq+0x1/0xb
 RSP: 0018:81008b1cfae0  EFLAGS: 00010002
 RAX: 810187af4040 RBX:  RCX: 8101342b7b80
 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808
 RBP: 81008b1cfb98 R08:  R09: 
 R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8
 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8
 FS:  () GS:810105d51840() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0808 CR3: 000187e14000 CR4: 06e0
 Process res (pid: 21924, threadinfo 81008b1ce000, task 810187af4040)
 Stack:  8001db30 81007f070520 885961f3 810105d39400
   88596323 06ff813231393234 810075463018 810075463018
   0297 81007f070520 810075463028 0246
 Call Trace:
   [8001db30] sigprocmask+0x28/0xdb
   [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691
   [88596323] :ocfs2:ocfs2_delete_inode+0x130/0x1691
   [88581f16] :ocfs2:ocfs2_drop_lock+0x67a/0x77b
   [8858026a] :ocfs2:ocfs2_remove_lockres_tracking+0x10/0x45
   [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691
   [8002f49e] generic_delete_inode+0xc6/0x143
   [88595c85] :ocfs2:ocfs2_drop_inode+0xf0/0x161
   [8000d46e] dput+0xf6/0x114
   [800e9c44] prune_one_dentry+0x66/0x76
   [8002e958] prune_dcache+0x10f/0x149
   [8004d66e] shrink_dcache_parent+0x1c/0xe1
   [80104f8b] proc_flush_task+0x17c/0x1f6
   [8008fa2c] sched_exit+0x27/0xb5
   [80018024] release_task+0x387/0x3cb
   [80015c50] do_exit+0x865/0x911
   [80049281] cpuset_exit+0x0/0x88
   [8002b080] get_signal_to_deliver+0x42c/0x45a
   [8005ae7b] do_notify_resume+0x9c/0x7af
   [8008b6a2] deactivate_task+0x28/0x5f
   [80021f3f] __up_read+0x19/0x7f
   [80066b58] do_page_fault+0x4fe/0x830
   [800b65b2] audit_syscall_exit+0x336/0x362
   [8005d32e] int_signal+0x12/0x17


 Code: f0 ff 0f 0f 88 f3 00 00 00 c3 53 48 89 fb e8 33 f5 02 00 f0
 RIP  [80064ae6] _spin_lock_irq+0x1/0xb
   RSP81008b1cfae0
 crash  bt
 PID: 21924  TASK: 810187af4040  CPU: 1   COMMAND: res
   #0 [81008b1cf840] crash_kexec at 800ac5b9
   #1 [81008b1cf900] __die at 80065127
   #2 [81008b1cf940] do_page_fault at 80066da7
   #3 [81008b1cfa30] error_exit at 8005dde9
  [exception RIP: _spin_lock_irq+1]
  RIP: 80064ae6  RSP: 81008b1cfae0  RFLAGS: 00010002
  RAX: 810187af4040  RBX:   RCX: 8101342b7b80
  RDX: 81008b1cfb98  RSI: 81008b1cfba8  RDI: 0808
  RBP: 81008b1cfb98   R8:    R9: 
  R10: 810075463090  R11: 88595b95  R12: 81008b1cfba8
  R13: 81007f070520  R14: 0001  R15: 81008b1cfce8
  ORIG_RAX:   CS: 0010  SS: 0018
   #4 [81008b1cfae0] sigprocmask at 8001db30
   #5 [81008b1cfb00] ocfs2_delete_inode at 88596323
   #6 [81008b1cfbf0] generic_delete_inode at 8002f49e
   #7 [81008b1cfc10] ocfs2_drop_inode at 

Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0000000000000001b027d69b591f15 not on the Tracking list

2011-09-30 Thread Sunil Mushran
On 09/30/2011 06:49 AM, Herman L wrote:
 On Thursday, September 29, 2011 2:04 PM Sunil Mushran wrote:
 On 09/29/2011 08:56 AM, Herman L wrote:
 On Wednesday, September 21, 2011 4:00 PM, Sunil Mushran wrote:
 On 09/21/2011 12:37 PM, Herman L wrote:
 On 09/19/2011 08:35 AM, Herman L wrote:
 Hi all,

 Got a couple of these messages recently, but I don't know what they 
 mean.  Can anyone let me know if I need to panic?  I'm using OCFS2 
 compiled from the kernel source of RHEL 6.0's 2.6.32-71.18.2.el6.x86_64.

 Sep 19 08:07:15 server-1 kernel: [3892420.40] 
 (10387,12):dlm_lockres_release:507 ERROR: Resource 
 W0001b027d69b591f15 not on the Tracking list
 Sep 19 08:07:15 server-1 kernel: [3892420.398194] lockres: 
 W0001b027d69b591f1, owner=1, state=0
 Sep 19 08:07:15 server-1 kernel: [3892420.398195]  last used: 
 8197071325, refcnt: 0, on purge list: no
 Sep 19 08:07:15 server-1 kernel: [3892420.398197]  on dirty list: no, 
 on reco list: no, migrating pending: no
 Sep 19 08:07:15 server-1 kernel: [3892420.398198]  inflight locks: 0, 
 asts reserved: 0
 Sep 19 08:07:15 server-1 kernel: [3892420.398199]  refmap nodes: [ ], 
 inflight=0
 Sep 19 08:07:15 server-1 kernel: [3892420.398200]  granted queue:
 Sep 19 08:07:15 server-1 kernel: [3892420.398200]  converting queue:
 Sep 19 08:07:15 server-1 kernel: [3892420.398201]  blocked queue:

 Thanks!
 Herman
 From: Sunil Mushran
 To: Herman L
 Sent: Monday, September 19, 2011 12:57 PM
 Subject: Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource 
 W0001b027d69b591f15 not on the Tracking list

I've no idea of the state of the source that you are using. The 
 message
is a warning indicating a race. While it probably did not affect 
 the functioning,
there is no guarantee that that would be the case the next time 
 around.

The closest relevant patch is over 2 years old.
 http://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=commit;h=b0d4f817ba5de8adb875ace594554a96d7737710
 Thanks Sunil for responding.  I know you can't easily support my setup, 
 but anyways I checked the sources.

 Looks like the patch you mention is in the sources I compiled from ( 
 RHEL6.0 kernel-2.6.32-71.24.1.el6.src.rpm ), so I guess the source of the 
 problem is elsewhere.

 The fs/ocfs2 directory from the RHEL6 sources I compiled from is almost 
 exactly the same as the mainline 2.6.32 kernel, except
 1) It looks like they implemented the changes in aops.c from the cleanup 
 blockdev_direct_IO locking patch that's in 2.6.33.
 2) In journal.c, they rename ocfs2_commit_trigger to 
 ocfs2_frozen_trigger, which seems to be from 2.6.35.
 3) In cluster/masklog.c they add a const to the mlog_attr_ops 
 declaration
 4) And in quota.h, they are missing #define QFMT_OCFS2 3

 Not sure if that helps any, but thanks in any case!
 All those changes are ok. And unrelated. This is a new one.
 Sorry, I think I accidentally wrote a message with only the quoted block... 
 oops.  Sorry.


 Sunil, are you able to and interested in looking at this issue?  If so, is 
 there any information that I can provide that might help?  Fortunately, 
 after those few initial days of daily errors, it seems to have stopped for 
 now.  But of course, I'm still worried about this.

 http://oss.oracle.com/~smushran/0001-ocfs2-dlm-Use-dlm-track_lock-when-adding-resource-to.patch

 This should fix it. But do note that the patch is untested.
 Thanks for the quick reply and patch!  I'll try to test it out when I get a 
 chance.  Also, is there any way to force this error so that I can know if 
 that patch is working?  Also, now that you have a fix for this, can you make 
 any kind of guess as to how likely or what circumstances that the unpatched 
 OCFS2 will cause  dangerous problems?

Well, the first goal is always to see nothing else is breaking. That's the most
important bit. As far as fixing the issue goes, only time will tell. There is no
way I can think of that will definitely prove that the issue is resolved. Also, 
even
if it does reproduce, it does not mean that this patch is bad. It could be there
is another race that we have to plug.

Depends on the definition of dangerous. If it means cluster-wide corruption, or
cluster-wide outage, then no. But if it means a node crashing, then yes. Though
the chance of that is fairly low.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5

2011-09-27 Thread Sunil Mushran
On 09/27/2011 09:12 AM, Ulf Zimmermann wrote:
 - -Original Message-
 From: Sunil Mushran [mailto:sunil.mush...@oracle.com]
 Sent: Monday, September 26, 2011 10:09 AM
 To: Ulf Zimmermann
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 
 on
 EL5

 I'll look at the tunefs issue. But the other one does not make sense.
 strict_jbd is a compat flag. Mount should work. What is the mount
 error? As in, in dmesg.
 I don't see any dmesg or /var/log/messages, but the error I saw was from 
 tunefs:

 demodb01 root /home/ulf # /usr/bin/yes | /sbin/tunefs.ocfs2 -U -L /export/u07 
 /dev/mapper/u07
 tunefs.ocfs2 1.2.7
 tunefs.ocfs2: Filesystem has unsupported feature(s) while opening device 
 /dev/mapper/u07


So that is correct. In short that flag was added to allow us to use the
jbd(2) features. We use this to create volumes  16TB.

I guess if you want to use with 1.2, format it with 1.2 tools.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5

2011-09-26 Thread Sunil Mushran
I'll look at the tunefs issue. But the other one does not make sense.
strict_jbd is a compat flag. Mount should work. What is the mount
error? As in, in dmesg.

On 09/25/2011 04:43 AM, Ulf Zimmermann wrote:
 As tunefs.ocfs2 wasn't working for us, I tried to mkfs.ocfs2 the volumes 
 again with --fs-feature-level=max-compat. This still turns on 
 strict-journal-super and there seems no way around this? This makes the 
 volume not compatible with OCFS 1.2.9

 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-
 boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann
 Sent: Sunday, September 25, 2011 1:43 AM
 To: ocfs2-users@oss.oracle.com
 Subject: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on 
 EL5

 We are running into a problem which looks like the same we had with
 fsck.ocfs2 a while back. This is with ocfs2-tools 1.4.4. I am trying to use
 tunefs.ocfs2 to turn off some features. The program starts up but then starts
 eating all available memory and more and the system starts to swap like crazy
 in and out. This is exactly the same behavior as the fsck.ocfs2 for which we
 were given a patched binary.

 I tried to compile the tunefs.ocfs2 from 1.6.x but the same problem with that
 binary.

 Ulf.


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0000000000000001b027d69b591f15 not on the Tracking list

2011-09-21 Thread Sunil Mushran
On 09/21/2011 12:37 PM, Herman L wrote:
 On 09/19/2011 08:35 AM, Herman L wrote:
 Hi all,

 Got a couple of these messages recently, but I don't know what they mean.  
 Can anyone let me know if I need to panic?  I'm using OCFS2 compiled from 
 the kernel source of RHEL 6.0's 2.6.32-71.18.2.el6.x86_64.

 Sep 19 08:07:15 server-1 kernel: [3892420.40] 
 (10387,12):dlm_lockres_release:507 ERROR: Resource 
 W0001b027d69b591f15 not on the Tracking list
 Sep 19 08:07:15 server-1 kernel: [3892420.398194] lockres: 
 W0001b027d69b591f1, owner=1, state=0
 Sep 19 08:07:15 server-1 kernel: [3892420.398195]   last used: 8197071325, 
 refcnt: 0, on purge list: no
 Sep 19 08:07:15 server-1 kernel: [3892420.398197]   on dirty list: no, on 
 reco list: no, migrating pending: no
 Sep 19 08:07:15 server-1 kernel: [3892420.398198]   inflight locks: 0, asts 
 reserved: 0
 Sep 19 08:07:15 server-1 kernel: [3892420.398199]   refmap nodes: [ ], 
 inflight=0
 Sep 19 08:07:15 server-1 kernel: [3892420.398200]   granted queue:
 Sep 19 08:07:15 server-1 kernel: [3892420.398200]   converting queue:
 Sep 19 08:07:15 server-1 kernel: [3892420.398201]   blocked queue:

 Thanks!
 Herman
 From: Sunil Mushran
 To: Herman L
 Sent: Monday, September 19, 2011 12:57 PM
 Subject: Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource 
 W0001b027d69b591f15 not on the Tracking list

I've no idea of the state of the source that you are using. The message
is a warning indicating a race. While it probably did not affect the 
 functioning,
there is no guarantee that that would be the case the next time around.

The closest relevant patch is over 2 years old.
 http://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=commit;h=b0d4f817ba5de8adb875ace594554a96d7737710
 Thanks Sunil for responding.  I know you can't easily support my setup, but 
 anyways I checked the sources.

 Looks like the patch you mention is in the sources I compiled from ( RHEL6.0 
 kernel-2.6.32-71.24.1.el6.src.rpm ), so I guess the source of the problem is 
 elsewhere.

 The fs/ocfs2 directory from the RHEL6 sources I compiled from is almost 
 exactly the same as the mainline 2.6.32 kernel, except
 1) It looks like they implemented the changes in aops.c from the cleanup 
 blockdev_direct_IO locking patch that's in 2.6.33.
 2) In journal.c, they rename ocfs2_commit_trigger to ocfs2_frozen_trigger, 
 which seems to be from 2.6.35.
 3) In cluster/masklog.c they add a const to the mlog_attr_ops declaration
 4) And in quota.h, they are missing #define QFMT_OCFS2 3

 Not sure if that helps any, but thanks in any case!

All those changes are ok. And unrelated. This is a new one.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0000000000000001b027d69b591f15 not on the Tracking list

2011-09-19 Thread Sunil Mushran

I've no idea of the state of the source that you are using. The message
is a warning indicating a race. While it probably did not affect the 
functioning,
there is no guarantee that that would be the case the next time around.

The closest relevant patch is over 2 years old.
http://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=commit;h=b0d4f817ba5de8adb875ace594554a96d7737710

On 09/19/2011 08:35 AM, Herman L wrote:

Hi all,

Got a couple of these messages recently, but I don't know what they mean.  Can 
anyone let me know if I need to panic?  I'm using OCFS2 compiled from the 
kernel source of RHEL 6.0's 2.6.32-71.18.2.el6.x86_64.

Sep 19 08:07:15 server-1 kernel: [3892420.40] 
(10387,12):dlm_lockres_release:507 ERROR: Resource 
W0001b027d69b591f15 not on the Tracking list
Sep 19 08:07:15 server-1 kernel: [3892420.398194] lockres: 
W0001b027d69b591f1, owner=1, state=0
Sep 19 08:07:15 server-1 kernel: [3892420.398195]   last used: 8197071325, 
refcnt: 0, on purge list: no
Sep 19 08:07:15 server-1 kernel: [3892420.398197]   on dirty list: no, on reco 
list: no, migrating pending: no
Sep 19 08:07:15 server-1 kernel: [3892420.398198]   inflight locks: 0, asts 
reserved: 0
Sep 19 08:07:15 server-1 kernel: [3892420.398199]   refmap nodes: [ ], 
inflight=0
Sep 19 08:07:15 server-1 kernel: [3892420.398200]   granted queue:
Sep 19 08:07:15 server-1 kernel: [3892420.398200]   converting queue:
Sep 19 08:07:15 server-1 kernel: [3892420.398201]   blocked queue:

Thanks!
Herman


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] 11gr1 RAC + ocfs2 node2 is down and not able to mount the ocfs2 FS on node1

2011-09-19 Thread Sunil Mushran

The connect is failing. One of the main reason is a firewall.
See if iptables is running. Check on both nodes. If so, shutdown
it down or add a rule to allow traffic on the o2cb port.

On 09/18/2011 08:57 PM, veeraa bose wrote:

Hi All,


we are having two node 11gr1 RAC (we have used ocfs2 for CRS and ASM for DB 
data), now the node2 is down and node1 got rebooted and after node1 is the 
ocfs2 Fs used for CRS is not getting mounted and the error is.

 #/etc/init.d/ocfs2 start
Starting Oracle Cluster File System (OCFS2) mount.ocfs2: Transport endpoint is 
not connected while mounting /dev/mapper/vg_oracle_shared-
RAC--DG--CLUS--01 on /u02/ocfs2/RAC-DG-CLUS-01. Check 'dmesg' for more 
information on this error.
mount.ocfs2: Transport endpoint is not connected while mounting 
/dev/mapper/vg_oracle_shared-RAC--DG--CLUS--02 on /u02/ocfs2/RAC-DG-CLUS-02. 
Check 'dmesg' for more information on this error.
mount.ocfs2: Transport endpoint is not connected while mounting 
/dev/mapper/vg_oracle_shared-global_backup on /global/backup. Check 'dmesg' for 
more information on this error.
   [FAILED]

And below is the log from Dmesg.

(o2net,6121,4):o2net_connect_expired:1664 ERROR: no connection established with 
node 2 after 60.0 seconds, giving up and returning errors.
(mount.ocfs2,7327,12):dlm_request_join:1036 ERROR: status = -107
(mount.ocfs2,7327,12):dlm_try_to_join_domain:1210 ERROR: status = -107
(mount.ocfs2,7327,12):dlm_join_domain:1488 ERROR: status = -107
(mount.ocfs2,7327,12):dlm_register_domain:1754 ERROR: status = -107
(mount.ocfs2,7327,12):ocfs2_dlm_init:2808 ERROR: status = -107
(mount.ocfs2,7327,12):ocfs2_mount_volume:1447 ERROR: status = -107
ocfs2: Unmounting device (253,19) on (node 1)

please guideme,  how to mount the ocfs2 FS on the node1 and bring the cluster 
up.

Thanks
Veera.


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] fsck doesn't fix bad chain

2011-09-17 Thread Sunil Mushran
Can you save the o2image of the volume when it is in that state.
We'll need that for analysis.

On 09/16/2011 05:41 AM, Andre Nathan wrote:
 Hello

 For a while I had seen errors like this in the kernel logs:

OCFS2: ERROR (device drbd5): ocfs2_validate_gd_parent: Group
descriptor #69084874 has bad chain 126
File system is now read-only due to the potential of on-disk
corruption. Please run fsck.ocfs2 once the file system is unmounted.

 This always happened in the same device, and whenever it happened I ran
 fsck.ocfs2 -fy /dev/drbd5, which showed messages like these:

[GROUP_FREE_BITS] Group descriptor at block 201309696 claims to have
9893 free bits which is more than 9886 bits indicated by the bitmap.
Drop its free bit count down to the total? y
[CHAIN_BITS] Chain 166 in allocator inode 11 has 1264713 bits
marked free out of 1516032 total bits but the block groups in the
chain have 1264706 free out of 1516032 total.  Fix this by updating
the chain record? y
[CHAIN_GROUP_BITS] Allocator inode 11 has 79407510 bits marked used
out of 365955414 total bits but the chains have 79407911 used out of
365955414 total.  Fix this by updating the inode counts? y
[INODE_COUNT] Inode 69085510 has a link count of 0 on disk but
directory entry references come to 1. Update the count on disk to
match? y

 As time passed, the frequency of these issues started to increase, and
 the last time it happened, I decided to run fsck twice in a row, and was
 surprised to see it showed the same messages in both runs. It seems it
 was unable to fix the problem.

 I identified the files corresponding to the inodes using debugfs.ocfs2
 and copied them to a new place, and then moved the copy over the
 original file, in order to recreate the inodes. Whenever I did that for
 one inode, the error above happened and the filesystem became read-only,
 so I had to umount/mount the volume again in order to be able to write
 to it again.

 After doing this, I ran fsck.ocfs2 -fy again twice, and no errors were
 reported. Since then I haven't seen this problem again.

 I'm running kernel 2.6.35 and ocfs2-tools 1.6.4.

 Has anyone else seen an issue like that?

 Thanks
 Andre


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Linux kernel crash due to ocfs2

2011-09-16 Thread Sunil Mushran
I got it. But I still don't see the symbols. Maybe we are corrupting the stack.
Maybe this is ppc specific. Do you have a x86/x86_64 box that can access
the same volume? If so I could give you a drop of the same for that arch.

Also, have to run fsck on this volume before? One reason o2image could
fail is if there is a bad block pointer. While it is supposed to handle all such
cases, it is known to miss some cases.

On 09/16/2011 12:06 AM, Betzos Giorgos wrote:
 Please try http://portal-md.glk.gr/ocfs2/core.32578.bz2

 Please let me know, in case you have any problem downloading it.

 Thanks,

 George

 On Thu, 2011-09-15 at 09:45 -0700, Sunil Mushran wrote:
 I was hoping to get a readable stack. Please could you provide a link to
 the coredump.

 On 09/15/2011 02:51 AM, Betzos Giorgos wrote:
 Hello,

 I am sorry for the delay in responding. Unfortunately, if faulted again.

 Here is the log. Although my email client folds the Memory Map lines.
 The core file is available.

 Thanks,

 George

 # ./o2image.ppc.dbg /dev/mapper/mpath0 /files_shared/u02.o2image
 *** glibc detected *** ./o2image.ppc.dbg: corrupted double-linked list:
 0x10075000 ***
 === Backtrace: =
 /lib/libc.so.6[0xfeb1ab4]
 /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
 ./o2image.ppc.dbg[0x1000d098]
 ./o2image.ppc.dbg[0x1000297c]
 ./o2image.ppc.dbg[0x10001eb8]
 ./o2image.ppc.dbg[0x1000228c]
 ./o2image.ppc.dbg[0x10002804]
 ./o2image.ppc.dbg[0x10001eb8]
 ./o2image.ppc.dbg[0x1000228c]
 ./o2image.ppc.dbg[0x10002804]
 ./o2image.ppc.dbg[0x10003bbc]
 ./o2image.ppc.dbg[0x10004480]
 /lib/libc.so.6[0xfe4dc60]
 /lib/libc.so.6[0xfe4dea0]
 === Memory map: 
 0010-0012 r-xp 0010 00:00 0
 [vdso]
 0f43-0f44 r-xp  08:13
 180307 /lib/libcom_err.so.2.1
 0f44-0f45 rw-p  08:13
 180307 /lib/libcom_err.so.2.1
 0f90-0f9c r-xp  08:13
 180293 /lib/libglib-2.0.so.0.1200.3
 0f9c-0f9d rw-p 000b 08:13
 180293 /lib/libglib-2.0.so.0.1200.3
 0fa4-0fa5 r-xp  08:13
 180292 /lib/librt-2.5.so
 0fa5-0fa6 r--p  08:13
 180292 /lib/librt-2.5.so
 0fa6-0fa7 rw-p 0001 08:13
 180292 /lib/librt-2.5.so
 0fce-0fd0 r-xp  08:13
 180291 /lib/libpthread-2.5.so
 0fd0-0fd1 r--p 0001 08:13
 180291 /lib/libpthread-2.5.so
 0fd1-0fd2 rw-p 0002 08:13
 180291 /lib/libpthread-2.5.so
 0fe3-0ffa r-xp  08:13
 180288 /lib/libc-2.5.so
 0ffa-0ffb r--p 0016 08:13
 180288 /lib/libc-2.5.so
 0ffb-0ffc rw-p 0017 08:13
 180288 /lib/libc-2.5.so
 0ffc-0ffe r-xp  08:13
 180287 /lib/ld-2.5.so
 0ffe-0fff r--p 0001 08:13
 180287 /lib/ld-2.5.so
 0fff-1000 rw-p 0002 08:13
 180287 /lib/ld-2.5.so
 1000-1005 r-xp  08:13
 7487795/root/o2image.ppc.dbg
 1005-1006 rw-p 0004 08:13
 7487795/root/o2image.ppc.dbg
 1006-1009 rwxp 1006 00:00 0
 [heap]
 f768-f7ff rw-p f768 00:00 0
 ff9a-ffaf rw-p ff9a 00:00 0
 [stack]
 Aborted (core dumped)


 On Thu, 2011-09-08 at 12:10 -0700, Sunil Mushran wrote:
 http://oss.oracle.com/~smushran/o2image.ppc.dbg

 Use the above executable. Hoping it won't fault. But if it does
 email me the backtrace. That trace will be readable as the exec
 has debugging symbols enabled.

 On 09/07/2011 11:24 PM, Betzos Giorgos wrote:
 # rpm -q ocfs2-tools
 ocfs2-tools-1.4.4-1.el5.ppc

 On Wed, 2011-09-07 at 09:13 -0700, Sunil Mushran wrote:
 version of ocfs2-tools?

 On 09/07/2011 09:10 AM, Betzos Giorgos wrote:
 Hello,

 I tried what you suggested but here is what I got:

 # o2image /dev/mapper/mpath0 /files_shared/u02.o2image
 *** glibc detected *** o2image: corrupted double-linked list: 
 0x10045000 ***
 === Backtrace: =
 /lib/libc.so.6[0xfeb1ab4]
 /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
 o2image[0x10007bb0]
 o2image[0x10002748]
 o2image[0x10001f50]
 o2image[0x10002334]
 o2image[0x100026a0]
 o2image[0x10001f50]
 o2image[0x10002334]
 o2image[0x100026a0]
 o2image[0x1000358c]
 o2image[0x10003e28]
 /lib/libc.so.6[0xfe4dc60]
 /lib/libc.so.6[0xfe4dea0]
 === Memory map: 
 0010-0012 r-xp 0010 00:00 0 
  [vdso]
 0f55-0f56 r-xp  08:13 2881590   
  /lib/libcom_err.so.2.1
 0f56-0f57 rw-p  08:13 2881590   
  /lib/libcom_err.so.2.1
 0f90-0f9c r-xp  08:13 2881576

Re: [Ocfs2-users] Syslog reports (ocfs2_wq, 15527, 2):ocfs2_orphan_del:1841 ERROR: status = -2

2011-09-15 Thread Sunil Mushran
  drwxr-xr-x   2 0 04096
 21-Jun-2008 16:42 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 12:01 ..
 debugfs: ls -l //orphan_dir:0002
  14  drwxr-xr-x   2 0 04096
 22-May-2008 12:01 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 12:01 ..
 debugfs: ls -l //orphan_dir:0003
  15  drwxr-xr-x   2 0 04096
 22-May-2008 12:01 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 12:01 ..

 Working on /dev/mapper/mpath20p1
 debugfs.ocfs2 1.4.4
 debugfs: ls -l //orphan_dir:
  12  drwxr-xr-x   2 0 04096
 3-Jun-2008 16:59 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:58 ..
 debugfs: ls -l //orphan_dir:0001
  13  drwxr-xr-x   2 0 04096
 21-Jun-2008 17:39 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:58 ..
 debugfs: ls -l //orphan_dir:0002
  14  drwxr-xr-x   2 0 04096
 22-May-2008 11:58 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:58 ..
 debugfs: ls -l //orphan_dir:0003
  15  drwxr-xr-x   2 0 04096
 22-May-2008 11:58 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:58 ..

 Working on /dev/mapper/mpath18p1
 debugfs.ocfs2 1.4.4
 debugfs: ls -l //orphan_dir:
  12  drwxr-xr-x   2 0 04096
 9-Jun-2008 13:54 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:56 ..
 debugfs: ls -l //orphan_dir:0001
  13  drwxr-xr-x   2 0 04096
 22-May-2008 11:56 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:56 ..
 debugfs: ls -l //orphan_dir:0002
  14  drwxr-xr-x   2 0 04096
 22-May-2008 11:56 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:56 ..
 debugfs: ls -l //orphan_dir:0003
  15  drwxr-xr-x   2 0 04096
 22-May-2008 11:56 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:56 ..

 Working on /dev/mapper/mpath19p1
 debugfs.ocfs2 1.4.4
 debugfs: ls -l //orphan_dir:
  12  drwxr-xr-x   2 0 04096
 3-Jun-2008 17:47 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:57 ..
 debugfs: ls -l //orphan_dir:0001
  13  drwxr-xr-x   2 0 04096
 30-Aug-2009 14:55 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:57 ..
 debugfs: ls -l //orphan_dir:0002
  14  drwxr-xr-x   2 0 04096
 22-May-2008 11:57 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:57 ..
 debugfs: ls -l //orphan_dir:0003
  15  drwxr-xr-x   2 0 04096
 22-May-2008 11:57 .
  6   drwxr-xr-x   6 0 04096
 22-May-2008 11:57 ..

 Working on /dev/mapper/mpath33p1
 debugfs.ocfs2 1.4.4
 debugfs: ls -l //orphan_dir:
  12  drwxr-xr-x   2 0 04096
 12-Dec-2008 13:41 .
  6   drwxr-xr-x   6 0 04096
 21-Nov-2008 10:54 ..
 debugfs: ls -l //orphan_dir:0001
  13  drwxr-xr-x   2 0 04096
 21-Nov-2008 10:54 .
  6   drwxr-xr-x   6 0 04096
 21-Nov-2008 10:54 ..
 debugfs: ls -l //orphan_dir:0002
  14  drwxr-xr-x   2 0 04096
 21-Nov-2008 10:54 .
  6   drwxr-xr-x   6 0 04096
 21-Nov-2008 10:54 ..
 debugfs: ls -l //orphan_dir:0003
  15  drwxr-xr-x   2 0 04096
 21-Nov-2008 10:54 .
  6   drwxr-xr-x   6 0 04096
 21-Nov-2008 10:54 ..

 [root@ausracdbd01 tmp]#


 

 From: Sunil Mushran [mailto:sunil.mush...@oracle.com]
 Sent: Thursday, September 15, 2011 10:04 AM
 To: Daniel Keisling
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] Syslog reports (ocfs2_wq, 15527,
 2):ocfs2_orphan_del:1841 ERROR: status = -2


 The issue that caused it has been fixed. The fix is here.
 http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=b6f3de3fd54026df74
 8bfd1449bbe31b9803f8f7

 The actual problem could have happened much earlier.
 1.4.4 is showing the messages as it is more aggressive (than 1.4.1)
 in cleaning up the orphans. By default, the fs scans for orphans
 once every 10 mins on a node in the cluster.

 fsck should fix it. I would have

  1   2   3   4   5   6   7   8   9   10   >