Re: [Ocfs2-users] [Ocfs2-devel] size increase
This is because you are specifying a 128k cluster size. Refer to man mkfs.ocfs2 for more. On Mar 17, 2015 8:04 PM, Umarzuki Mochlis umarz...@gmail.com wrote: Hi, What I meant by total size is output of 'du -hs' I can see output of fdisk on mpath1 of ocfs2 LUN similar to logical volume of ext4 partition (255 head 63 sectors) It is a 2 nodes ocfs cluster. 2015-03-18 10:50 GMT+08:00 Xue jiufei xuejiu...@huawei.com: Hi Umarzuki, What is the meaning of total size, file size or disk usage? If you means the disk usage, I think maybe the difference of cluster size(the minimum allocation unit) is the case. Have you notice the cluster size or block size of your ocfs2 and ext4 filesystem? Thanks, Xuejiufei ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 “Heartbeat generation mismatch on device” error when mounting iscsi target
If ps aux|grep o2hb does not return anything, means you are using local heartbeat. That means you have mismatching ocfs2.conf file. And I suspect the node where this is failing is the one that has the bad ocfs2.conf file. Compare the config files from all the nodes and ensure it is the same. Or you could simply replace the one on the failing node from another node. The file should be the same everywhere. Remember to restart the cluster on that node. On Mon, Feb 9, 2015 at 2:27 PM, Danijel Krmar danijel.kr...@activecollab.com wrote: No, nothing there: $ ps aux | grep o2hb root 5724 0.0 0.0 8320 888 pts/0S+ 22:30 http://airmail.calendar/2015-02-09%2022:30:00%20GMT+1 0:00 http://airmail.calendar/2015-02-09%2000:00:00%20GMT+1 grep --color o2hb Still the same error if i try to mount the iSCSI disk: o2hb_check_own_slot:590 ERROR: Heartbeat generation mismatch on device (sdb): expected(2:0x2f32486d4c54730a, 0x54d926d7), ondisk(2:0xb016e6a72676a791, 0x54d926d7) As said, there are no such problems on other machines, just this one. I can’t get my head around this Heartbeat generation mismatch” error message. -- Danijel Krmar A51 D.O.O. Novi Sad https://www.activecollab.com/ On February 9, 2015 at 8:09:06 PM, Sunil Mushran (sunil.mush...@gmail.com) wrote: On node 2, do: ps aux | grep o2hb I suspect you have multiple o2hb threads running. If so, restart the o2cb cluster on that node. On Mon, Feb 9, 2015 at 10:08 AM, Danijel Krmar danijel.kr...@activecollab.com wrote: As said in the title, when I want to mount a iSCSI target on one machine I get the following error: (o2hb-3F92114867,7826,3):o2hb_check_own_slot:590 ERROR: Heartbeat generation mismatch on device (sdb): expected(2:0xa0cf28215b4b1ed3, 0x54d8a036), ondisk(2:0xb016e6a72676a791, 0x54d8a037) The same iSCSI target is working on other machines. Any idea what this error means? -- Danijel Krmar A51 D.O.O. Novi Sad https://www.activecollab.com/ ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] How to unlock a bloked resource? Thanks
What is the output of the commands? The protocol is supposed to do the unlocking on its own. See what is it blocked on. It could be that the node that has the lock cannot unlock it because it cannot flush the journal to disk. On Tue, Sep 9, 2014 at 7:55 PM, Guozhonghua guozhong...@h3c.com wrote: Hi All: As we test with two node in one OCFS2 cluster. The cluster is hang up may be for dead lock. We use the debugfs.ocfs tool founding that one resource is holding by one node who has it for long time and another node can still wait for the resource. So the cluster is hang up. debugfs.ocfs2 -R fs_locks -B /dev/dm-0 debugfs.ocfs2 -R dlm_locks LOCKID_XXX /dev/dm-0 How to unlock the lock held by the node? Is there some commands to unlock the resource? Thanks. - 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] FSCK may be failing and corrupting my disk???
fsck cannot determine which of the two inodes is incorrect. In such cases, fsck makes a copy of one of the inodes (with data) and asks the user to delete the bad file after mounting. On Sun, Mar 23, 2014 at 7:18 AM, Eric Raskin eras...@paslists.com wrote: I did some more research by running a fsck -fn. Basically it is one inode that is wrong and needs to be cleared. Is there a way to do that via debugfs? If I can delete that one inode, then all the doubly-linked clusters will not be doubly linked any more and all of the errors will go away. Isn't that quicker than cloning a bad inode? On 03/22/2014 09:40 PM, Sunil Mushran wrote: Cloning the inode means inode + data. Let it finish. On Sat, Mar 22, 2014 at 3:44 PM, Eric Raskin eras...@paslists.com wrote: Hi: I am running a two-node Oracle VM Server 2.2.2 installation. We were having some strange problems creating new virtual machines, so I shut down the systems and unmounted the OVS Repository (ocfs2 file system on Equallogic equipment). I ran a fsck -y first, which replayed the logs and said all was clean. But, I am pretty sure there are other issues, so I started an fsck -fy One of the messages I got was: Cluster 161213953 is claimed by the following inodes: 76289548 /running_pool/450_gebidb/System.img [DUP_CLUSTERS_CLONE] Inode (null) may be cloned or deleted to break the claim it has on its clusters. Clone inode (null) to break claims on clusters it shares with other inodes? y I then watched with an strace -p fsck process to see what was happening, since it was taking a long time with no messages. I see: pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0+\3H\26O}\306\374\0\0\0\0\0..., 4096, 10465599488) = 4096 pwrite64(3, GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096, 10462699520) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0/\3H\26O}\302\374\0\0\0\0\0..., 4096, 10465583104) = 4096 pwrite64(3, GROUP01\0\300\17\0\4Q\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096, 10462699520) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374\0\0\0\0\0..., 4096, 10465558528) = 4096 pwrite64(3, INODE01\0H\26O}\0\0L\0\0\0\0\0\24\346\17\0\0\0\0\0\0\0\0\0..., 4096, 2686701568) = 4096 pwrite64(3, GROUP01\0\300\17\0~\3\0#\0H\26O}\0\0\0\0\0n\0\1\0\0\0\0..., 4096, 100940120064) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\7\0\0\0\0\0\0\6\0\30\0\0\0\0\0\0\0\0..., 4096, 45056) = 4096 pwrite64(3, GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096, 10462699520) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0\272\2H\26O}\274\374\0\0\0\0\0..., 4096, 10465558528) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374\0\0\0\0\0..., 4096, 10465558528) = 4096 pwrite64(3, GROUP01\0\300\17\0\4O\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096, 10462699520) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 This is going on and on. It looks like it is writing lots of entries to fix one duplicate inode??? At this point, I have aborted the fsck, as I am worried that it is completely trashing our OVS repository disk. Can anybody shed some light on this before I restart the fsck? We need to be back up and running ASAP! Thanks in advance! -- --- Eric H. Raskin 914-765-0500 x120 914-765-0500%20x120 Professional Advertising Systems Inc. 914-765-0503 fax 200 Business Park Dr Suite 304 eras...@paslists.com Armonk, NY 10504 http://www.paslists.com ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users -- --- Eric H. Raskin 914-765-0500 x120 Professional Advertising Systems Inc. 914-765-0503 fax 200 Business Park Dr Suite 304 eras...@paslists.com Armonk, NY 10504 http://www.paslists.com ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] FSCK may be failing and corrupting my disk???
Cloning the inode means inode + data. Let it finish. On Sat, Mar 22, 2014 at 3:44 PM, Eric Raskin eras...@paslists.com wrote: Hi: I am running a two-node Oracle VM Server 2.2.2 installation. We were having some strange problems creating new virtual machines, so I shut down the systems and unmounted the OVS Repository (ocfs2 file system on Equallogic equipment). I ran a fsck -y first, which replayed the logs and said all was clean. But, I am pretty sure there are other issues, so I started an fsck -fy One of the messages I got was: Cluster 161213953 is claimed by the following inodes: 76289548 /running_pool/450_gebidb/System.img [DUP_CLUSTERS_CLONE] Inode (null) may be cloned or deleted to break the claim it has on its clusters. Clone inode (null) to break claims on clusters it shares with other inodes? y I then watched with an strace -p fsck process to see what was happening, since it was taking a long time with no messages. I see: pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0+\3H\26O}\306\374\0\0\0\0\0..., 4096, 10465599488) = 4096 pwrite64(3, GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096, 10462699520) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0/\3H\26O}\302\374\0\0\0\0\0..., 4096, 10465583104) = 4096 pwrite64(3, GROUP01\0\300\17\0\4Q\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096, 10462699520) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374\0\0\0\0\0..., 4096, 10465558528) = 4096 pwrite64(3, INODE01\0H\26O}\0\0L\0\0\0\0\0\24\346\17\0\0\0\0\0\0\0\0\0..., 4096, 2686701568) = 4096 pwrite64(3, GROUP01\0\300\17\0~\3\0#\0H\26O}\0\0\0\0\0n\0\1\0\0\0\0..., 4096, 100940120064) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\7\0\0\0\0\0\0\6\0\30\0\0\0\0\0\0\0\0..., 4096, 45056) = 4096 pwrite64(3, GROUP01\0\300\17\0\4P\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096, 10462699520) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0\272\2H\26O}\274\374\0\0\0\0\0..., 4096, 10465558528) = 4096 pwrite64(3, EXBLK01\0\0\0\0\0\0\0\0\0\0\0003\3H\26O}\274\374\0\0\0\0\0..., 4096, 10465558528) = 4096 pwrite64(3, GROUP01\0\300\17\0\4O\0\0\0H\26O}\0\0\0\0\0\0\0\0\0\0\0\0..., 4096, 10462699520) = 4096 pwrite64(3, INODE01\0H\26O}\377\377\22\0\0\0\0\0\0$\0\0\0\0\0\0\0\0\0\0..., 4096, 90112) = 4096 This is going on and on. It looks like it is writing lots of entries to fix one duplicate inode??? At this point, I have aborted the fsck, as I am worried that it is completely trashing our OVS repository disk. Can anybody shed some light on this before I restart the fsck? We need to be back up and running ASAP! Thanks in advance! -- --- Eric H. Raskin 914-765-0500 x120 Professional Advertising Systems Inc. 914-765-0503 fax 200 Business Park Dr Suite 304 eras...@paslists.com Armonk, NY 10504 http://www.paslists.com ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] How to break out the unstop loop in the recovery thread? Thanks a lot.
It is encountering scsi errrors reading the device. Fixing that will fix the issue. If you want to stop the logging, I don't believe there is a method right now. But i could be trivially added. Allow user to disable mlog(ML_ERROR) logging. On Thu, Oct 31, 2013 at 7:38 PM, Guozhonghua guozhong...@h3c.com wrote: Hi everyone, I have one OCFS2 issue. The OS is Ubuntu, using linux kernel is 3.2.50. There are three node in the OCFS2 cluster, and all the node is using the iSCSI SAN of HP 4330 as the storage. As the storage restarted, there were two node restarted for fence without heartbeating writting on to the storage. But the last one does not restart, and it still write error message into syslog as below: Oct 30 02:01:01 server177 kernel: [25786.227598] (ocfs2rec,14787,13):ocfs2_read_journal_inode:1463 ERROR: status = -5 Oct 30 02:01:01 server177 kernel: [25786.227615] (ocfs2rec,14787,13):ocfs2_replay_journal:1496 ERROR: status = -5 Oct 30 02:01:01 server177 kernel: [25786.227631] (ocfs2rec,14787,13):ocfs2_recover_node:1652 ERROR: status = -5 Oct 30 02:01:01 server177 kernel: [25786.227648] (ocfs2rec,14787,13):__ocfs2_recovery_thread:1358 ERROR: Error -5 recovering node 2 on device (8,32)! Oct 30 02:01:01 server177 kernel: [25786.227670] (ocfs2rec,14787,13):__ocfs2_recovery_thread:1359 ERROR: Volume requires unmount. Oct 30 02:01:01 server177 kernel: [25786.227696] sd 4:0:0:0: [sdc] Unhandled error code Oct 30 02:01:01 server177 kernel: [25786.227707] sd 4:0:0:0: [sdc] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK Oct 30 02:01:01 server177 kernel: [25786.227726] sd 4:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 13 40 00 00 08 00 Oct 30 02:01:01 server177 kernel: [25786.227792] end_request: recoverable transport error, dev sdc, sector 4928 Oct 30 02:01:01 server177 kernel: [25786.227812] (ocfs2rec,14787,13):ocfs2_read_journal_inode:1463 ERROR: status = -5 Oct 30 02:01:01 server177 kernel: [25786.227830] (ocfs2rec,14787,13):ocfs2_replay_journal:1496 ERROR: status = -5 Oct 30 02:01:01 server177 kernel: [25786.227848] (ocfs2rec,14787,13):ocfs2_recover_node:1652 ERROR: status = -5 ... Oct 30 06:48:41 server177 kernel: [43009.457816] sd 4:0:0:0: [sdc] Unhandled error code Oct 30 06:48:41 server177 kernel: [43009.457826] sd 4:0:0:0: [sdc] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK Oct 30 06:48:41 server177 kernel: [43009.457843] sd 4:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 13 40 00 00 08 00 Oct 30 06:48:41 server177 kernel: [43009.457911] end_request: recoverable transport error, dev sdc, sector 4928 Oct 30 06:48:41 server177 kernel: [43009.457930] (ocfs2rec,14787,9):ocfs2_read_journal_inode:1463 ERROR: status = -5 Oct 30 06:48:41 server177 kernel: [43009.457946] (ocfs2rec,14787,9):ocfs2_replay_journal:1496 ERROR: status = -5 Oct 30 06:48:41 server177 kernel: [43009.457960] (ocfs2rec,14787,9):ocfs2_recover_node:1652 ERROR: status = -5 Oct 30 06:48:41 server177 kernel: [43009.457975] (ocfs2rec,14787,9):__ocfs2_recovery_thread:1358 ERROR: Error -5 recovering node 2 on device (8,32)! Oct 30 06:48:41 server177 kernel: [43009.457996] (ocfs2rec,14787,9):__ocfs2_recovery_thread:1359 ERROR: Volume requires unmount. Oct 30 06:48:41 server177 kernel: [43009.458021] sd 4:0:0:0: [sdc] Unhandled error code Oct 30 06:48:41 server177 kernel: [43009.458031] sd 4:0:0:0: [sdc] Result: hostbyte=DID_TRANSPORT_FAILFAST driverbyte=DRIVER_OK Oct 30 06:48:41 server177 kernel: [43009.458049] sd 4:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 13 40 00 00 08 00 Oct 30 06:48:41 server177 kernel: [43009.458117] end_request: recoverable transport error, dev sdc, sector 4928 Oct 30 06:48:41 server177 kernel: [43009.458137] (ocfs2rec,14787,9):ocfs2_read_journal_inode:1463 ERROR: status = -5 Oct 30 06:48:41 server177 kernel: [43009.458153] (ocfs2rec,14787,9):ocfs2_replay_journal:1496 ERROR: status = -5 Oct 30 06:48:41 server177 kernel: [43009.458168] (ocfs2rec,14787,9):ocfs2_recover_node:1652 ERROR: status = -5 . .. The same log message as before, and the syslog is very large, it can occupy all the capacity remains on the disk... So as the syslog file size increases quikly, and is very large and it occupy all the capacity of the system directory / remains. So the host is blocked and not any response. According to the log as before, In the function __ocfs2_recovery_thread, there may be an un-stop loop which result in the super-large syslog file. __ocfs2_recovery_thread { while (rm-rm_used) { ……… status = ocfs2_recover_node(osb, node_num, slot_num); skip_recovery: if (!status) {
Re: [Ocfs2-users] How do I check fragmentation amount?
debugfs.ocfs2 -R frag filespec DEVICE will show you the fragmentation level on an inode basis. You could run that for all inodes and figure out the value for the entire volume. On Fri, Nov 1, 2013 at 3:00 PM, Andy ary...@allantgroup.com wrote: How can I check the amount on fragmentation on an OCFS2 volume? Thanks, Andy ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 tuning, fragmentation and localalloc option. Cluster hanging during mix read+write workloads
If the storage connectivity is not stable, then dlm issues are to be expected. In this case, the processes are all trying to take the readlock. One possible scenario is that the node holding the writelock is not able to relinquish the lock because it cannot flush the updated inodes to disk. I would suggest you look into load balancing and how it affects the iscsi connectivity from the hosts. On Tue, Aug 6, 2013 at 2:51 PM, Gavin Jones gjo...@where2getit.com wrote: Hello Goldwyn, Thanks for taking a look at this. So, then, it does seem to be DLM related. We were running fine for a few weeks and then it came up again this morning and has been going on throughout the day. Regarding the DLM debugging, I allowed debugging for DLM_GLUE, DLM_THREAD, DLM_MASTER and DLM_RECOVERY. However, I don't see any DLM logging output in dmesg or syslog --is there perhaps another way to get at the actual DLM log? I've searched around a bit but didn't find anything that made it clear. As for OCFS2 and iSCSI communications, they use the same physical network interface but different VLANs on that interface. The connectionX:0 errors, then, seem to indicate an issue with the ISCSI connection. The system logs and monitoring software don't show any warnings or errors about the interface going down, so the only thing I can think of is the connection load balancing on the SAN, though that's merely a hunch. Maybe I should mail the list and see if anyone has a similar setup. If you could please point me in the right direction to make use of the DLM debugging via debugs.ocfs2, I would appreciate it. Thanks again, Gavin W. Jones Where 2 Get It, Inc. On Tue, Aug 6, 2013 at 4:16 PM, Goldwyn Rodrigues rgold...@suse.de wrote: Hi Gavin, On 08/06/2013 01:59 PM, Gavin Jones wrote: Hi Goldwyn, Apologies for the delayed reply. The hung Apache process / OCFS issue cropped up again, so I thought I'd pass along the contents of /proc/pid/stack of a few affected processes: gjones@slipapp02:~ sudo cat /proc/27521/stack gjones's password: [811663b4] poll_schedule_timeout+0x44/0x60 [81166d56] do_select+0x5a6/0x670 [81166fbe] core_sys_select+0x19e/0x2d0 [811671a5] sys_select+0xb5/0x110 [815429bd] system_call_fastpath+0x1a/0x1f [7f394bdd5f23] 0x7f394bdd5f23 [] 0x gjones@slipapp02:~ sudo cat /proc/27530/stack [81249721] sys_semtimedop+0x5a1/0x8b0 [815429bd] system_call_fastpath+0x1a/0x1f [7f394bdddb77] 0x7f394bdddb77 [] 0x gjones@slipapp02:~ sudo cat /proc/27462/stack [81249721] sys_semtimedop+0x5a1/0x8b0 [815429bd] system_call_fastpath+0x1a/0x1f [7f394bdddb77] 0x7f394bdddb77 [] 0x gjones@slipapp02:~ sudo cat /proc/27526/stack [81249721] sys_semtimedop+0x5a1/0x8b0 [815429bd] system_call_fastpath+0x1a/0x1f [7f394bdddb77] 0x7f394bdddb77 [] 0x Additionally, in dmesg I see, for example, [774981.361149] (/usr/sbin/httpd,8266,3):ocfs2_unlink:951 ERROR: status = -2 [775896.135467] (/usr/sbin/httpd,8435,3):ocfs2_check_dir_for_entry:2119 ERROR: status = -17 [775896.135474] (/usr/sbin/httpd,8435,3):ocfs2_mknod:459 ERROR: status = -17 [775896.135477] (/usr/sbin/httpd,8435,3):ocfs2_create:629 ERROR: status = -17 [788406.624126] connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4491991450, last ping 4491992701, now 4491993952 [788406.624138] connection1:0: detected conn error (1011) [788406.640132] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4491991451, last ping 4491992702, now 4491993956 [788406.640142] connection2:0: detected conn error (1011) [788406.928134] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4491991524, last ping 4491992775, now 4491994028 [788406.928150] connection4:0: detected conn error (1011) [788406.944147] connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4491991528, last ping 4491992779, now 4491994032 [788406.944165] connection5:0: detected conn error (1011) [788408.640123] connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4491991954, last ping 4491993205, now 4491994456 [788408.640134] connection3:0: detected conn error (1011) [788409.907968] connection1:0: detected conn error (1020) [788409.908280] connection2:0: detected conn error (1020) [788409.912683] connection4:0: detected conn error (1020) [788409.913152] connection5:0: detected conn error (1020) [788411.491818] connection3:0: detected conn error (1020) that repeats for a bit and then I see [1952161.012214] INFO: task /usr/sbin/httpd:27491 blocked for more than 480 seconds. [1952161.012219] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this
Re: [Ocfs2-users] High inodes usage
Hoe did you figure this out? Also, which version of the kernel are you using? On Wed, Jul 3, 2013 at 1:05 AM, Nicolas Michel be.nicolas.mic...@gmail.comwrote: Hello guys, I'm using OCFS2 for a shared storage (on SAN). I just saw that the inode usage is really high although these filesystems are used for Oracle DATA storage. So there are really a few big files. I don't understand why the inode usage is so high with such few big files (As an example : one of the filesystem have 16 files and directories but the ~26 million of inodes are almost used!) My questions : - is the inode usage can be a problem in such a situation - if it is : how can I reduce their number used? Or increase the pool of available inodes? - why so many inodes are used with such a few files? I was sure that there were traditionaly one inode used for one file or one directory. -- Nicolas MICHEL ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] High inodes usage
That number is typically calculated. So it just could be bad arithmetic. But that should not affect the other ops. On Wed, Jul 3, 2013 at 12:40 PM, Nicolas Michel be.nicolas.mic...@gmail.com wrote: I don't know if it's the root cause of my problems or if it causes any problem at all. But I have some stability issues on the cluster so I'm investigating anything that could be suspect. My question is : is it a normal behavior to get inode usage with df -i showing high percentage like 98, 99 or 100%? (a touch on the filesystem with 100% inode usage still create a file so I suppose it is not causing any problem but I found it weird). 2013/7/3 Sunil Mushran sunil.mush...@gmail.com That is old. It just could be a minor bug is that release. Is it causing you any problems? On Wed, Jul 3, 2013 at 12:31 PM, Nicolas Michel be.nicolas.mic...@gmail.com wrote: Hello Sunil, I checked the inode usage with df -i I can't check the kernel version running on the system now because I'm not at work but it's a SLES 10 SP2, so a pretty old kernel I suppose. Nicolas 2013/7/3 Sunil Mushran sunil.mush...@gmail.com Hoe did you figure this out? Also, which version of the kernel are you using? On Wed, Jul 3, 2013 at 1:05 AM, Nicolas Michel be.nicolas.mic...@gmail.com wrote: Hello guys, I'm using OCFS2 for a shared storage (on SAN). I just saw that the inode usage is really high although these filesystems are used for Oracle DATA storage. So there are really a few big files. I don't understand why the inode usage is so high with such few big files (As an example : one of the filesystem have 16 files and directories but the ~26 million of inodes are almost used!) My questions : - is the inode usage can be a problem in such a situation - if it is : how can I reduce their number used? Or increase the pool of available inodes? - why so many inodes are used with such a few files? I was sure that there were traditionaly one inode used for one file or one directory. -- Nicolas MICHEL ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users -- Nicolas MICHEL -- Nicolas MICHEL ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Problems with volumes coming from RHEL5 going to OEL6
Can you dump the following using the 1.8 binary. debugfs.ocfs2 -R stats /dev/mapper/. On Fri, Jun 21, 2013 at 6:17 AM, Ulf Zimmermann u...@openlane.com wrote: We have a production cluster of 6 nodes, which are currently running RHEL 5.8 with OCFS2 1.4.10. We snapclone these volumes to multiple destinations, one of them is a RHEL4 machine with OCFS2 1.2.9. Because of that the volumes are set so that we can read them there. ** ** We are now trying to bring up a new server, this one has OEL 6.3 on it and it comes with OCFS2 1.8.0 and tools 1.8.0-10. I can use tunefs.ocfs2 –cloned-volume to reset the UUID, but when I try to change the label I get: ** ** [root@co-db03 ulf]# tunefs.ocfs2 -L /export/backuprecovery.AUCP /dev/mapper/aucp_data_bk_2_x tunefs.ocfs2: Invalid name for a cluster while opening device /dev/mapper/aucp_data_bk_2_x ** ** fsck.ocfs2 core dumps with the following, I also filed a bug on Bugzilla for that: ** ** [root@co-db03 ulf]# fsck.ocfs2 /dev/mapper/aucp_data_bk_2_x ** ** fsck.ocfs2 1.8.0 *** glibc detected *** fsck.ocfs2: double free or corruption (fasttop): 0x0197f320 *** === Backtrace: = /lib64/libc.so.6[0x3656475366] fsck.ocfs2[0x434c31] fsck.ocfs2[0x403bc2] /lib64/libc.so.6(__libc_start_main+0xfd)[0x365641ecdd] fsck.ocfs2[0x402879] === Memory map: 0040-0045 r-xp fc:00 12489 /sbin/fsck.ocfs2 0064f000-00651000 rw-p 0004f000 fc:00 12489 /sbin/fsck.ocfs2 00651000-00652000 rw-p 00:00 0 0085-00851000 rw-p 0005 fc:00 12489 /sbin/fsck.ocfs2 0197e000-0199f000 rw-p 00:00 0 [heap] 3655c0-3655c2 r-xp fc:00 8797 /lib64/ld-2.12.so 3655e1f000-3655e2 r--p 0001f000 fc:00 8797 /lib64/ld-2.12.so 3655e2-3655e21000 rw-p 0002 fc:00 8797 /lib64/ld-2.12.so 3655e21000-3655e22000 rw-p 00:00 0 365640-3656589000 r-xp fc:00 8798 /lib64/libc-2.12.so 3656589000-3656788000 ---p 00189000 fc:00 8798 /lib64/libc-2.12.so 3656788000-365678c000 r--p 00188000 fc:00 8798 /lib64/libc-2.12.so 365678c000-365678d000 rw-p 0018c000 fc:00 8798 /lib64/libc-2.12.so 365678d000-3656792000 rw-p 00:00 0 3659c0-3659c16000 r-xp fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3659c16000-3659e15000 ---p 00016000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3659e15000-3659e16000 rw-p 00015000 fc:00 8802 /lib64/libgcc_s-4.4.6-20120305.so.1 3d3e80-3d3e817000 r-xp fc:00 12028 /lib64/libpthread-2.12.so 3d3e817000-3d3ea17000 ---p 00017000 fc:00 12028 /lib64/libpthread-2.12.so 3d3ea17000-3d3ea18000 r--p 00017000 fc:00 12028 /lib64/libpthread-2.12.so 3d3ea18000-3d3ea19000 rw-p 00018000 fc:00 12028 /lib64/libpthread-2.12.so 3d3ea19000-3d3ea1d000 rw-p 00:00 0 3e2660-3e26603000 r-xp fc:00 426 /lib64/libcom_err.so.2.1 3e26603000-3e26802000 ---p 3000 fc:00 426 /lib64/libcom_err.so.2.1 3e26802000-3e26803000 r--p 2000 fc:00 426 /lib64/libcom_err.so.2.1 3e26803000-3e26804000 rw-p 3000 fc:00 426 /lib64/libcom_err.so.2.1 7fb063711000-7fb063714000 rw-p 00:00 0 7fb06371d000-7fb06372 rw-p 00:00 0 7fffd5b95000-7fffd5bb6000 rw-p 00:00 0 [stack] 7fffd5bc5000-7fffd5bc6000 r-xp 00:00 0 [vdso] ff60-ff601000 r-xp 00:00 0 [vsyscall] Abort (core dumped) ** ** I think one of the main question is what is the “Invalid name for a cluster while trying to join the group” or “Invalid name for a cluster while opening device”. I am pretty sure that /etc/sysconfig/o2cb and /etc/ocfs2/cluster.conf is correct. ** ** Ulf. ** ** ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to set the o2cb heartbeat to global
Support for global heartbeat was added in ocfs2-tools-1.8. On Tue, Jun 4, 2013 at 8:31 AM, Vineeth Thampi vineeth.tha...@gmail.comwrote: Hi, I have added heartbeat mode as global, but when I do a mkfs and mount, and then check the mount, it says I am in local mode. Even /sys/kernel/config/cluster/ocfs2/heartbeat/mode says local. I am running CentOS with 3.x kernel, with ocfs2-tools-1.6.4-1118. mkfs -t ocfs2 -b 4K -C 1M -N 16 --cluster-stack=o2cb /dev/sdb mount -t ocfs2 /dev/sdb /mnt -o noatime,data=writeback,nointr,commit=60,coherency=buffered == node: ip_port = ip_address = 10.81.2.108 number = 1 name = cam-st08 cluster = ocfs2 cluster: node_count = 2 heartbeat_mode = global name = ocfs2 == root@cam-st07 log # mount | grep sdb /dev/sdb on /mnt type ocfs2 (rw,_netdev,noatime,data=writeback,nointr,commit=60,coherency=buffered,heartbeat=local) Any help would be much appreciated. Thanks, Vineeth ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] What is the overhead/disk loss of formatting an ocfs2 filesystem?
-N 16 means 16 journals. I think it defaults to 256M journals. So that's 4G. Do you plan to mount it on 16 nodes? If not, reduce that. Other options is a smaller journal. But you have to be careful as a small journal could limit your write thruput. On Mon, Apr 15, 2013 at 1:37 PM, Jerry Smith jds...@sandia.gov wrote: Good afternoon, I have an OEL 6.3 box with a few ocfs2 mounts mounted locally, and was wondering what I should expect to lose via formatting etc from a disk usage standpoint. -bash-4.1$ df -h | grep ocfs2 /dev/dm-15 12G 1.3G 11G 11% /ocfs2/redo0 /dev/dm-13120G 4.2G 116G 4% /ocfs2/software-master /dev/dm-10 48G 4.1G 44G 9% /ocfs2/arch0 /dev/dm-142.5T 6.7G 2.5T 1% /ocfs2/ora01 /dev/dm-111.5T 5.7G 1.5T 1% /ocfs2/ora02 /dev/dm-17100G 4.2G 96G 5% /ocfs2/ora03 /dev/dm-12200G 4.3G 196G 3% /ocfs2/ora04 /dev/dm-163.0T 7.3G 3.0T 1% /ocfs2/orabak01 -bash-4.1$ For example ora04 is 196GB total, but with zero usage it shows 4.3GB used: [root@oeldb10 ~]#df -h /ocfs2/ora04 FilesystemSize Used Avail Use% Mounted on /dev/dm-12200G 4.3G 196G 3% /ocfs2/ora04 [root@oeldb10 ~]#find /ocfs2/ora04/ | wc -l 3 [root@oeldb10 ~]#find /ocfs2/ora04/ -exec du -sh {} \; 0/ocfs2/ora04/ 0/ocfs2/ora04/lost+found 0/ocfs2/ora04/db66snlux Filesystems formatted via mkfs -t ocfs2 -N 16 --fs-features=xattr,local -L ${device} ${device} Mount options [root@oeldb10 ~]#mount |grep ora04 /dev/dm-12 on /ocfs2/ora04 type ocfs2 (rw,_netdev,nointr,user_xattr,heartbeat=none) Thanks, --Jerry ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Significant Slowdown when writing and deleting files at the same time
Are you mounting -o writeback? On Fri, Mar 29, 2013 at 12:28 PM, Andy ary...@allantgroup.com wrote: I have been having performance issues from time to time on our production ocfs2 volumes, so I set up a test system to try to reproduce what I was seeing on the production systems. This is what I found out: I have a 2 node test system sharing a 2TB volume with a journal size of 256MB. I can easily trigger the slowdown by starting to processes to write a 10GB file each, then I delete a different large file (7GB+) while the other processes are writing. The slowdown is significant and very disruptive. Not only did it take over 3 minutes to delete the file, every else with pause when entering that directory too. A du command with stop and nfs access to that file system will think the server is not responding. Under heavier amounts of writes, I have had a delete takes 13mins for a 8GB file, and NFS mounts return I/O errors. We often deal with large files, so this situation above is fairly common. I would like any ideas that would provide smoother performance of the OCFS2 volume and somehow eliminate the long pauses during deletes. Thanks, Andy ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] [OCFS2] Crash at o2net_shutdown_sc()
[ 1481.620253] o2hb: Unable to stabilize heartbeart on region 1352E2692E704EEB8040E5B8FF560997 (vdb) What this means is that the device is suspect. o2hb writes are not hitting the disk. vdb is accepting and acknowledging the write but spitting out something else during the next read. Heartbeat detects this and aborts, as it should. Then we hit a race during socket close that triggers the oops. Yes, that needs to be fixed. But you also need to fix vdb... what appears to be a virtual device. On Fri, Mar 1, 2013 at 1:25 PM, richard -rw- weinberger richard.weinber...@gmail.com wrote: Hi! Using 3.8.1 OCFS2 crashes while joining nodes to the cluster. The cluster consists of 10 nodes, while node3 joins the kernel on node3 crashes. (Somtimes later...) See dmesg below. Is this a known issue? I didn't test older kernels so far. node1: [ 1471.881922] o2dlm: Joining domain 1352E2692E704EEB8040E5B8FF560997 ( 0 ) 1 nodes [ 1471.919522] JBD2: Ignoring recovery information on journal [ 1471.947027] ocfs2: Mounting device (253,16) on (node 0, slot 0) with ordered data mode. [ 1475.802497] o2net: Accepted connection from node node2 (num 1) at 192.168.66.2: [ 1481.814048] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 8 [ 1481.814955] o2net: No longer connected to node node2 (num 1) at 192.168.66.2: [ 1482.468827] o2net: Accepted connection from node node3 (num 2) at 192.168.66.3: [ 1511.904100] o2net: No connection established with node 1 after 30.0 seconds, giving up. [ 1514.472995] o2net: Connection to node node3 (num 2) at 192.168.66.3: shutdown, state 8 [ 1514.473960] o2net: No longer connected to node node3 (num 2) at 192.168.66.3: [ 1516.076044] o2net: Accepted connection from node node2 (num 1) at 192.168.66.2: [ 1520.181430] o2dlm: Node 1 joins domain 1352E2692E704EEB8040E5B8FF560997 ( 0 1 ) 2 nodes [ 1544.544030] o2net: No connection established with node 2 after 30.0 seconds, giving up. [ 1574.624029] o2net: No connection established with node 2 after 30.0 seconds, giving up. node2: [ 1475.613170] o2net: Connected to node node1 (num 0) at 192.168.66.1: [ 1481.620253] o2hb: Unable to stabilize heartbeart on region 1352E2692E704EEB8040E5B8FF560997 (vdb) [ 1481.622489] o2net: No longer connected to node node1 (num 0) at 192.168.66.1: [ 1515.886605] o2net: Connected to node node1 (num 0) at 192.168.66.1: [ 1519.992766] o2dlm: Joining domain 1352E2692E704EEB8040E5B8FF560997 ( 0 1 ) 2 nodes [ 1520.017054] JBD2: Ignoring recovery information on journal [ 1520.07] ocfs2: Mounting device (253,16) on (node 1, slot 1) with ordered data mode. [ 1520.159590] mount.ocfs2 (2186) used greatest stack depth: 2568 bytes left node3: [ 1482.836865] o2net: Connected to node node1 (num 0) at 192.168.66.1: [ 1482.837542] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1484.840952] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1486.844994] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1488.848952] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1490.853052] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1492.857046] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1494.861042] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1496.865024] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1498.869021] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1500.873016] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1502.877056] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1504.881042] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1506.885040] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1508.888991] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1510.893077] o2net: Connection to node node2 (num 1) at 192.168.66.2: shutdown, state 7 [ 1512.843172] (mount.ocfs2,2179,0):dlm_request_join:1477 ERROR: Error -107 when sending message 510 (key 0x666c6172) to node 1 [ 1512.845580] (mount.ocfs2,2179,0):dlm_try_to_join_domain:1653 ERROR: status = -107 [ 1512.847778] (mount.ocfs2,2179,0):dlm_join_domain:1955 ERROR: status = -107 [ 1512.849334] (mount.ocfs2,2179,0):dlm_register_domain:2214 ERROR: status = -107 [ 1512.850921] (mount.ocfs2,2179,0):o2cb_cluster_connect:368 ERROR: status = -107 [ 1512.852511] (mount.ocfs2,2179,0):ocfs2_dlm_init:3004 ERROR: status = -107 [ 1512.854090] (mount.ocfs2,2179,0):ocfs2_mount_volume:1881 ERROR: status = -107 [ 1512.855476] ocfs2: Unmounting device (253,16) on (node 0) [
Re: [Ocfs2-users] OCFS ..Inode contains a hole at offset...
This is probably a directory. debugs.ocfs2 -R 'stat 52663' /dev/ will dump the inode. Are you sure fsck is fixing it? Does the output show this block getting fixed? If not, you may want to run fsck.ocfs2 v1.8. I think a fix code was added for it. On Wed, Feb 20, 2013 at 1:01 AM, Fiorenza Meini fme...@esseweb.eu wrote: Hi there, I have a partition formatted with ocfs2 (1.6.3) on a 2.6.37 Linux Kernel system. This partition is managed by a cluster (corosync/pacemaker). The backend of this ocfs2 partition is drbd on Lvm. I see this line in the messages log file: ocfs2_read_virt_blocks:871 ERROR: Inode #52663 contains a hole at offset 69632 The error is reported more than once and the offset is the same.. When I do a check on this partition, errors are found and resolved, but in a short time the problems appears again. I can't understand at what level is the problem: * kernel ? * hardware ? * lvm + drbd ? There are tools that can be used to understand ? Any suggestion? Thanks and regards. Fiorenza -- Fiorenza Meini Spazio Web S.r.l. V. Dante Alighieri, 10 - 13900 Biella Tel.: 015.2431982 - 015.9526066 Fax: 015.2522600 Reg. Imprese, CF e P.I.: 02414430021 Iscr. REA: BI - 188936 Iscr. CCIAA: Biella - 188936 Cap. Soc.: 30.000,00 Euro i.v. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs cluster node keeps rebooting
1.2.5 is 6+ year old release. You may want to use something more current. On Mon, Jan 14, 2013 at 12:06 PM, Bill Zha lfl200...@yahoo.com wrote: Hi Sunil and All, We have a 10 Redhat4.2-node OCFS cluster running on version 1.2.5-6. One of the node started to rebooted almost everyday since last week. The entire cluster had been stable for the past 1 year or so. I captured the following console output, can you or someone had the similar issue let me know what the possible cause of these reboots? (25271,4):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1358156758.101016 now 1358156788.97593 dr 1358156758.101008 adv 1358156758.101022:1358156758.101024 func (5d21e188:507) 1357953447.247097:1357953447.247100) (25267,4):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1358156758.666788 now 1358156788.663604 dr 1358156760.666794 adv 1358156758.666793:1358156758.666795 func (5d21e188:505) 1357953453.107343:1357953453.107349) (25267,4):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1358156758.848933 now 1358156788.953367 dr 1358156760.847939 adv 1358156758.848939:1358156758.848941 func (0e6eb1eb:505) 1357965605.352156:1357965605.352162) (25267,4):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1358156759.108373 now 1358156789.243003 dr 1358156761.108392 adv 1358156759.108376:1358156759.108378 func (af22ae1f:502) 1357914301.741127:1357914301.741130) (25275,4):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1358156759.626366 now 1358156789.623629 dr 1358156789.622319 adv 1358156759.626369:1358156759.626371 func (abd851aa:505) 1357965605.363679:1357965605.363685) (25275,4):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1358156759.656350 now 1358156789.913330 dr 1358156761.656039 adv 1358156759.656354:1358156759.656355 func (0e6eb1eb:502) 1357907401.318584:1357907401.318587) (25275,4):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1358156759.663467 now 1358156790.203323 dr 1358156761.662745 adv 1358156759.663470:1358156759.663472 func (7dcded64:502) 1357875986.764566:1357875986.764568) (25275,4):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1358156759.987324 now 1358156790.493342 dr 1358156761.987117 adv 1358156759.987327:1358156759.987329 func (6bcd2bc6:502) 1357875995.47:1357875995.55) (25,7):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device dm-14 after 18 milliseconds Heartbeat thread (25) printing last 24 blocking operations (cur = 11): Heartbeat thread stuck at msleep, stuffing current time into that blocker (index 11) Index 12: took 0 ms to do allocating bios for read Index 13: took 0 ms to do bio alloc read Index 14: took 0 ms to do bio add page read Index 15: took 0 ms to do bio add page read Index 16: took 0 ms to do submit_bio for read Index 17: took 0 ms to do waiting for read completion Index 18: took 0 ms to do bio alloc write Index 19: took 0 ms to do bio add page write Index 20: took 0 ms to do submit_bio for write Index 21: took 0 ms to do checking slots Index 22: took 0 ms to do waiting for write completion Index 23: took 100897 ms to do msleep Index 0: took 0 ms to do allocating bios for read Index 1: took 0 ms to do bio alloc read Index 2: took 0 ms to do bio add page read Index 3: took 0 ms to do bio add page read Index 4: took 0 ms to do submit_bio for read Index 5: took 0 ms to do waiting for read completion Index 6: took 0 ms to do bio alloc write Index 7: took 0 ms to do bio add page write Index 8: took 0 ms to do submit_bio for write Index 9: took 0 ms to do checking slots Index 10: took 0 ms to do waiting for write completion Index 11: took 313 ms to do msleep *** ocfs2 is very sorry to be fencing this system by restarting *** Thank you so much for your help! Bill ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] asynchronous hwclocks
The fs does not care about time. It should have no effect on the cluster. However the apps may care and may behave erratically. On Jan 3, 2013, at 3:13 PM, Medienpark, Jakob Rößler roess...@medienpark.net wrote: Hello list, today I noticed huge differences between the hardware clocks in our cluster. Some details: root@www01:~# hwclock;date Do 03 Jan 2013 09:32:09 CET -0.626096 seconds Do 3. Jan 09:34:54 CET 2013 root@www02:~# hwclock;date Do 03 Jan 2013 09:32:09 CET -0.626091 seconds Do 3. Jan 09:34:54 CET 2013 root@www03:~# hwclock;date Do 03 Jan 2013 09:34:54 CET -0.625820 seconds Do 3. Jan 09:34:54 CET 2013 root@storage:~# hwclock;date Do 03 Jan 2013 08:34:54 CET -0.641532 seconds Do 3. Jan 09:34:54 CET 2013 The server 'storage' is the server which provides the iscsi device to www01-03. Because the cluster was very unstable during load peaks, I want to ask you what kind of effects it will have to ocfs2 if the hwclocks are asynchronous like shown above. Thanks in advance Jakob ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Is this a valid configuration?
This is normal. My only concern is the use of very old kernel/fs versions. On Wed, Dec 5, 2012 at 3:08 AM, Neil campbell.n...@hotmail.com wrote: Anyone? On 2012-11-28 00:47:56 + neil campbell campbell.n...@hotmail.com wrote: Hi list, I am running OCFS2 1.2.9-9.bug13439173 on RHEL 4 Kernel 2.6.9-89 # modinfo ocfs2 filename: /lib/modules/2.6.9-89.0.26.ELsmp/kernel/fs/ocfs2/ocfs2.ko license:GPL author: Oracle version:1.2.9 CF6A7A44EA2581415F3D612 description:OCFS2 1.2.9 Mon Dec 5 14:27:38 EST 2011 (build e5c3135c8cbf75f2620ff4c782d634f1) depends:ocfs2_nodemanager,ocfs2_dlm,jbd,debugfs vermagic: 2.6.9-89.0.26.ELsmp SMP gcc-3.4 # I just have some reservations about whether the following configuration, where I have mount points of different file system types over an initial mount point (/d0) would cause any issues? LUN1LUN2LUN3 LUN4 || | | || | | /d0 (ext3) /d0/app (ext3) /d0/ocfs (ocfs2) /d0/app/html (ocfs2) Many thanks, Neil ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ls taking ages on a directory containing 900000 files
strace -p PID -ttt -T Attach and get some timings. The simplest guess is that the system lacks memory to cache all the inodes and thus has to hit disk (and more importantly take cluster locks) for the same inode repeatedly. The user guide has a section in NOTES explaining this. On Tue, Dec 4, 2012 at 8:54 AM, Amaury Francois amaury.franc...@digora.comwrote: Hello, ** ** We are running OCFS2 1.8 and on a kernel UEK2. An ls on a directory containing approx. 1 million of files is very long (1H). The features we have activated on the filesystem are the following : ** ** [root@pa-oca-app10 ~]# debugfs.ocfs2 -R stats /dev/sdb1 Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Fri Nov 30 19:30:17 2012 Creator OS: 0 Feature Compat: 3 backup-super strict-journal-super Feature Incompat: 32592 sparse extended-slotmap inline-data metaecc xattr indexed-dirs refcount discontig-bg clusterinfo Tunefs Incomplete: 0 Feature RO compat: 1 unwritten Root Blknum: 5 System Dir Blknum: 6 First Cluster Group Blknum: 3 Block Size Bits: 12 Cluster Size Bits: 12 Max Node Slots: 8 Extended Attributes Inline Size: 256 Label: exchange2 UUID: 2375EAF4E4954C4ABB984BDE27AC93D5 Hash: 2880301520 (0xabade9d0) DX Seeds: 1678175851 1096448356 79406012 (0x6406ee6b 0x415a7964 0x04bba3bc) Cluster stack: o2cb Cluster name: appcluster Cluster flags: 1 Globalheartbeat Inode: 2 Mode: 00 Generation: 3567595533 (0xd4a5300d) FS Generation: 3567595533 (0xd4a5300d) CRC32: 0c996202 ECC: 0819 Type: Unknown Attr: 0x0 Flags: Valid System Superblock Dynamic Features: (0x0) User: 0 (root) Group: 0 (root) Size: 0 Links: 0 Clusters: 5242635 ctime: 0x508eac6b 0x0 -- Mon Oct 29 17:18:51.0 2012 atime: 0x0 0x0 -- Thu Jan 1 01:00:00.0 1970 mtime: 0x508eac6b 0x0 -- Mon Oct 29 17:18:51.0 2012 dtime: 0x0 -- Thu Jan 1 01:00:00 1970 Refcount Block: 0 Last Extblk: 0 Orphan Slot: 0 Sub Alloc Slot: Global Sub Alloc Bit: 65535 ** ** ** ** May inline-data or xattr be the source of the problem ? ** ** Thank you. ** ** [image: Description : Description : Description : cid:image001.png@01CD01F3.35091200] * * *Amaury FRANCOIS* • *Ingénieur* Mobile +33 (0)6 88 12 62 54 *amaury.franc...@digora.com * * * *Siège Social – 66 rue du Marché Gare – 67200 STRASBOURG* Tél : 0 820 200 217 - +33 (0)3 88 10 49 20 [image: Description : test] ** ** ** ** ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users image002.jpgimage001.png___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ls taking ages on a directory containing 900000 files
1.5 ms per inode. Times 900K files equals 22 mins. Large dirs are a problem is all file systems. The degree of problem depends on the overhead. An easy solution around is to shard the files into multilevel dirs. Like a 2 level structure of a 1000 files in 1000 dirs. Or, a 3 level structure with even fewer files per dir. Or you could use the other approach suggested. Avoids stat() by disabling color-ls. Or just use plain find. On Tue, Dec 4, 2012 at 3:16 PM, Erik Schwartz schwartz.eri...@gmail.comwrote: Amaury, you can see in strace output that it's performing a stat on every file. Try simply: $ /bin/ls My guess is you're using a system where ls is aliased to use options that are more expensive. Best regards - Erik On 12/4/12 5:12 PM, Amaury Francois wrote: The strace looks like this (on all files) : 1354662591.755319 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P069_F01589.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001389 1354662591.756775 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P035_F01592.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001532 1354662591.758376 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P085_F01559.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001429 1354662591.759873 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P027_F01569.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001377 1354662591.761317 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P002_F01581.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001420 1354662591.762804 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P050_F01568.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001345 1354662591.764216 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P089_F01567.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001541 1354662591.765828 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P010_F01594.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001358 1354662591.767252 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P045_F01569.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001396 1354662591.768715 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P036_F01592.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.002072 1354662591.770854 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P089_F01568.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001722 1354662591.772643 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P009_F01600.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001281 1354662591.773992 lstat64(TEW_STRESS_TEST_VM.1K_100P_1F.P022_F01583.txt, {st_mode=S_IFREG|0664, st_size=1000, ...}) = 0 0.001413 We are using a 32 bits architecture, can it be the cause of the kernel not having enough memory ? Any possibility to change this behavior ? Description : Description : Description : cid:image001.png@01CD01F3.35091200 * * *Amaury FRANCOIS* • *Ingénieur* Mobile +33 (0)6 88 12 62 54 *amaury.franc...@digora.com mailto:amaury.franc...@digora.com*** * * *Siège Social – 66 rue du Marché Gare – 67200 STRASBOURG* Tél : 0 820 200 217 - +33 (0)3 88 10 49 20 Description : test *De :*Sunil Mushran [mailto:sunil.mush...@gmail.com] *Envoyé :* mardi 4 décembre 2012 18:29 *À :* Amaury Francois *Cc :* ocfs2-users@oss.oracle.com *Objet :* Re: [Ocfs2-users] ls taking ages on a directory containing 90 files strace -p PID -ttt -T Attach and get some timings. The simplest guess is that the system lacks memory to cache all the inodes and thus has to hit disk (and more importantly take cluster locks) for the same inode repeatedly. The user guide has a section in NOTES explaining this. On Tue, Dec 4, 2012 at 8:54 AM, Amaury Francois amaury.franc...@digora.com mailto:amaury.franc...@digora.com wrote: Hello, We are running OCFS2 1.8 and on a kernel UEK2. An ls on a directory containing approx. 1 million of files is very long (1H). The features we have activated on the filesystem are the following : [root@pa-oca-app10 ~]# debugfs.ocfs2 -R stats /dev/sdb1 Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Fri Nov 30 19:30:17 2012 Creator OS: 0 Feature Compat: 3 backup-super strict-journal-super Feature Incompat: 32592 sparse extended-slotmap inline-data metaecc xattr indexed-dirs refcount discontig-bg clusterinfo Tunefs Incomplete: 0 Feature RO compat: 1 unwritten Root Blknum: 5 System Dir Blknum: 6 First Cluster Group Blknum: 3 Block Size Bits: 12 Cluster Size Bits: 12 Max Node Slots: 8 Extended Attributes Inline Size: 256 Label: exchange2 UUID: 2375EAF4E4954C4ABB984BDE27AC93D5 Hash: 2880301520
Re: [Ocfs2-users] Huge Problem ocfs2
IO error on channel means the system cannot talk to the block device. The problem is in the block layer. Maybe a loose cable or a setup problem. dmesg should show errors. On Fri, Nov 9, 2012 at 10:46 AM, Laurentiu Gosu l...@easic.ro wrote: Hi, I'm using ocfs2 cluster in a production environment since almost 1 year. During this time i had to run a fsck.ocfs2 few months ago due to some errors but they were fixed. Now i have a big problem: I'm not able to mount the volume on any of the nodes. I stopped all nodes except one. Some output bellow: *mount /mnt/ocfs2** **mount.ocfs2: I/O error on channel while trying to determine heartbeat information** ** **fsck.ocfs2 /dev/mapper/volgr1-lvol0** **fsck.ocfs2 1.6.3** **fsck.ocfs2: I/O error on channel while initializing the DLM** ** **fsck.ocfs2 -n /dev/mapper/volgr1-lvol0** **fsck.ocfs2 1.6.3** **Checking OCFS2 filesystem in /dev/mapper/volgr1-lvol0:** ** Label: SAN** ** UUID: B4CF8D4667AF43118F3324567B90A987** ** Number of blocks: 2901788672** ** Block size: 4096** ** Number of clusters: 45340448** ** Cluster size: 262144** ** Number of slots:10** ** **journal recovery: I/O error on channel while looking up the journal inode for slot 0** **fsck encountered unrecoverable errors while replaying the journals and will not continue* Can you give me some hints on how to debug the problem? Thank you, Laurentiu. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Huge Problem ocfs2
If global bitmap is gone. then the fs is unusable. But you can extract data using the rdump command in debugfs.ocfs. The success depends on how much of the device is still usable. On Fri, Nov 9, 2012 at 5:50 PM, Marian Serban mar...@easic.ro wrote: I tried hacking the fsck.ocfs2 source code by not considering metaecc flag. Then I ran into journal recovery: Bad magic number in inode while looking up the journal inode for slot 0 fsck encountered unrecoverable errors while replaying the journals and will not continue After bypassing journal replay function, I got Pass 0a: Checking cluster allocation chains pass0: Bad magic number in inode while looking up the global bitmap inode fsck.ocfs2: Bad magic number in inode while performing pass 0 Does it mean the filesystem is destroyed completely? On 10.11.2012 02:54, Marian Serban wrote: That's the kernel: Linux ro02xsrv003.bv.easic.ro 2.6.39.4 #6 SMP Mon Dec 12 12:09:49 EET 2011 x86_64 x86_64 x86_64 GNU/Linux Anyway, I tried disabling the metaecc feature, no luck. [root@ro02xsrv003 ~]# tunefs.ocfs2 --fs-features=nometaecc /dev/mapper/volgr1-lvol0 tunefs.ocfs2: I/O error on channel while opening device /dev/mapper/volgr1-lvol0 These are the last lines of strace corresponding to the tunefs.ocfs command: open(/sys/fs/ocfs2/cluster_stack, O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f54aad05000 read(4, o2cb\n, 4096) = 5 close(4)= 0 munmap(0x7f54aad05000, 4096)= 0 open(/sys/fs/o2cb/interface_revision, O_RDONLY) = 4 read(4, 5\n, 15) = 2 read(4, , 13) = 0 close(4)= 0 stat(/sys/kernel/config, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 statfs(/sys/kernel/config, {f_type=0x62656570, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0 open(/dev/mapper/volgr1-lvol0, O_RDONLY) = 4 ioctl(4, BLKSSZGET, 0x7fffce711454) = 0 close(4)= 0 pread(3, \0\0\v\25\37\1\200\200\202@\21\2\30\26\0\0\0,\17\272\241\4\340\210\311\377\17\300\327\332\373\17..., 4096, 532480) = 4096 close(3)= 0 write(2, tunefs.ocfs2, 12tunefs.ocfs2)= 12 write(2, : , 2: ) = 2 write(2, I/O error on channel, 20I/O error on channel)= 20 write(2, , 1 )= 1 write(2, while opening device \/dev/mappe..., 47while opening device /dev/mapper/volgr1-lvol0) = 47 write(2, \r\n, 2 On 10.11.2012 02:06, Sunil Mushran wrote: It's either that or a check sum problem. Disable metaecc. Not sure which kernel you are running. We had fixed few problems few years ago around this. If your kernel is older, then it could be a known issue. On Fri, Nov 9, 2012 at 12:50 PM, Marian Serban mar...@easic.ro wrote: Hi Sunil, Thank you for answering. Unfortunately, it doesn't seem like it's a hardware problem. There's no way a cable can be loose because it's iSCSI over 1G Ethernet (copper wires) environment. Also I performed dd if=/dev/ of=/dev/null and first 16GB or so are fine. Dmesg shows no errors. Also tried with debugfs.ocfs2: [root@ro02xsrv003 ~]# debugfs.ocfs2 /dev/mapper/volgr1-lvol0 debugfs.ocfs2 1.6.3 debugfs: ls ls: Bad magic number in inode '.' debugfs: slotmap slotmap: Bad magic number in inode while reading slotmap system file debugfs: stats Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Fri Nov 9 14:35:53 2012 Creator OS: 0 Feature Compat: 3 backup-super strict-journal-super Feature Incompat: 16208 sparse extended-slotmap inline-data metaecc xattr indexed-dirs refcount discontig-bg Tunefs Incomplete: 0 Feature RO compat: 7 unwritten usrquota grpquota Root Blknum: 129 System Dir Blknum: 130 First Cluster Group Blknum: 64 Block Size Bits: 12 Cluster Size Bits: 18 Max Node Slots: 10 Extended Attributes Inline Size: 256 Label: SAN UUID: B4CF8D4667AF43118F3324567B90A987 Hash: 3698209293 (0xdc6e320d) DX Seed[0]: 0x9f4a2bb7 DX Seed[1]: 0x501ddac0 DX Seed[2]: 0x6034bfe8 Cluster stack: classic o2cb Inode: 2 Mode: 00 Generation: 1093568923 (0x412e899b) FS Generation: 1093568923 (0x412e899b) CRC32: 46f2d360 ECC: 04d4 Type: Unknown Attr: 0x0 Flags: Valid System Superblock Dynamic Features: (0x0) User: 0 (root) Group: 0 (root) Size: 0 Links: 0 Clusters: 45340448 ctime: 0x4ee67f67 -- Tue Dec 13 00:25:43 2011 atime: 0x0 -- Thu Jan 1 02:00:00 1970 mtime
Re: [Ocfs2-users] Huge Problem ocfs2
Yes that should be enough for that. But that won't help if the real problem is device related. What does debugfs.ocfs2 -R ls -l / return? If that errors, means the root dir is gone. Maybe best to look into your backups. On Fri, Nov 9, 2012 at 6:01 PM, Marian Serban mar...@easic.ro wrote: Nope, rdump doesn't work either. debugfs: rdump -v / /tmp Copying to /tmp/ rdump: Bad magic number in inode while reading inode 129 rdump: Bad magic number in inode while recursively dumping inode 129 Could you please confirm that it's enough to just force the return value of 0 at ocfs2_validate_meta_ecc in order to bypass the ECC checks? On 10.11.2012 03:55, Sunil Mushran wrote: If global bitmap is gone. then the fs is unusable. But you can extract data using the rdump command in debugfs.ocfs. The success depends on how much of the device is still usable. On Fri, Nov 9, 2012 at 5:50 PM, Marian Serban mar...@easic.ro wrote: I tried hacking the fsck.ocfs2 source code by not considering metaecc flag. Then I ran into journal recovery: Bad magic number in inode while looking up the journal inode for slot 0 fsck encountered unrecoverable errors while replaying the journals and will not continue After bypassing journal replay function, I got Pass 0a: Checking cluster allocation chains pass0: Bad magic number in inode while looking up the global bitmap inode fsck.ocfs2: Bad magic number in inode while performing pass 0 Does it mean the filesystem is destroyed completely? On 10.11.2012 02:54, Marian Serban wrote: That's the kernel: Linux ro02xsrv003.bv.easic.ro 2.6.39.4 #6 SMP Mon Dec 12 12:09:49 EET 2011 x86_64 x86_64 x86_64 GNU/Linux Anyway, I tried disabling the metaecc feature, no luck. [root@ro02xsrv003 ~]# tunefs.ocfs2 --fs-features=nometaecc /dev/mapper/volgr1-lvol0 tunefs.ocfs2: I/O error on channel while opening device /dev/mapper/volgr1-lvol0 These are the last lines of strace corresponding to the tunefs.ocfs command: open(/sys/fs/ocfs2/cluster_stack, O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f54aad05000 read(4, o2cb\n, 4096) = 5 close(4)= 0 munmap(0x7f54aad05000, 4096)= 0 open(/sys/fs/o2cb/interface_revision, O_RDONLY) = 4 read(4, 5\n, 15) = 2 read(4, , 13) = 0 close(4)= 0 stat(/sys/kernel/config, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 statfs(/sys/kernel/config, {f_type=0x62656570, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0 open(/dev/mapper/volgr1-lvol0, O_RDONLY) = 4 ioctl(4, BLKSSZGET, 0x7fffce711454) = 0 close(4)= 0 pread(3, \0\0\v\25\37\1\200\200\202@\21\2\30\26\0\0\0,\17\272\241\4\340\210\311\377\17\300\327\332\373\17..., 4096, 532480) = 4096 close(3)= 0 write(2, tunefs.ocfs2, 12tunefs.ocfs2)= 12 write(2, : , 2: ) = 2 write(2, I/O error on channel, 20I/O error on channel)= 20 write(2, , 1 )= 1 write(2, while opening device \/dev/mappe..., 47while opening device /dev/mapper/volgr1-lvol0) = 47 write(2, \r\n, 2 On 10.11.2012 02:06, Sunil Mushran wrote: It's either that or a check sum problem. Disable metaecc. Not sure which kernel you are running. We had fixed few problems few years ago around this. If your kernel is older, then it could be a known issue. On Fri, Nov 9, 2012 at 12:50 PM, Marian Serban mar...@easic.ro wrote: Hi Sunil, Thank you for answering. Unfortunately, it doesn't seem like it's a hardware problem. There's no way a cable can be loose because it's iSCSI over 1G Ethernet (copper wires) environment. Also I performed dd if=/dev/ of=/dev/null and first 16GB or so are fine. Dmesg shows no errors. Also tried with debugfs.ocfs2: [root@ro02xsrv003 ~]# debugfs.ocfs2 /dev/mapper/volgr1-lvol0 debugfs.ocfs2 1.6.3 debugfs: ls ls: Bad magic number in inode '.' debugfs: slotmap slotmap: Bad magic number in inode while reading slotmap system file debugfs: stats Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Fri Nov 9 14:35:53 2012 Creator OS: 0 Feature Compat: 3 backup-super strict-journal-super Feature Incompat: 16208 sparse extended-slotmap inline-data metaecc xattr indexed-dirs refcount discontig-bg Tunefs Incomplete: 0 Feature RO compat: 7 unwritten usrquota grpquota Root Blknum: 129 System Dir Blknum: 130 First Cluster Group Blknum: 64 Block Size Bits: 12 Cluster Size Bits: 18 Max Node Slots: 10 Extended Attributes Inline Size: 256
Re: [Ocfs2-users] HA-OCFS2?
cfs != storage You need to get a highly available storage that is concurrently accessible from multiple nodes. ocfs2 will allow multiple nodes to concurrently access the same storage. With posix semantics. If a node dies, the remaining nodes will pause to recover and then continue functioning. The dead node can then restart and rejoin the cluster. On Thu, Sep 13, 2012 at 5:02 PM, Eric epretori...@yahoo.com wrote: Is it possible to create a highly-available OCFS2 cluster (i.e., A storage cluster that mitigates the single point of failure [SPoF] created by storing an OCFS2 volume on a single LUN)? The OCFS2 Project Page makes this claim... OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing both *high performance* and *high availability*. ...but without backing-up the claim of high availability storage (at either the HDD- or the node-level). I've found a couple of articles hinting at using Linux Multipathing or DRBD but very little detailed information about either. TIA, Eric Pretorious Truckee, CA ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Ocfs2-users Digest, Vol 105, Issue 4
On Wed, Sep 12, 2012 at 9:45 AM, Asanka Gunasekera asanka_gunasek...@yahoo.co.uk wrote: Load O2CB driver on boot (y/n) [y]: Cluster stack backing O2CB [o2cb]: Cluster to start on boot (Enter none to clear) [ocfs2]: Specify heartbeat dead threshold (=7) [31]: Specify network idle timeout in ms (=5000) [3]: Specify network keepalive delay in ms (=1000) [2000]: Specify network reconnect delay in ms (=2000) [2000]: Writing O2CB configuration: OK Loading filesystem configfs: OK Mounting configfs filesystem at /sys/kernel/config: OK Loading filesystem ocfs2_dlmfs: OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting O2CB cluster ocfs2: Failed Cluster ocfs2 created Node ocfsn1 added o2cb_ctl: Internal logic failure while adding node ocfsn2 Stopping O2CB cluster ocfs2: OK Something wrong with your cluster.conf. Overlapping node numbers, maybe. abd in the messages I time to time get below and I saw in a post that I can ignore this. modprobe: FATAL: Module ocfs2_stackglue not found. Yes, this is harmless. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] test inode bit failed -5
nfsd encountered an error reading the device. So something in the io path below the fs encountered a problem. If it just happened once, then you can ignore it. On Fri, Aug 31, 2012 at 2:23 AM, Hideyasu Kojima hid.koj...@ms.scsk.jpwrote: Hi I using ocfs2 cluster as NFS Server. Only once,I got a bellow error,and write error from NFS Client. What happend? kernel: (nfsd,12870,0):ocfs2_get_suballoc_slot_bit:2096 ERROR: read block 24993224 failed -5 kernel: (nfsd,12870,0):ocfs2_test_inode_bit:2207 ERROR: get alloc slot and bit failed -5 kernel: (nfsd,12870,0):ocfs2_get_dentry:96 ERROR: test inode bit failed -5 I currently use kernel 2.6.18-164.el5 OCFS2 : 1.4.7 ocfs2-tool: 1.4.4 Thanks. -- ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Issue with files and folder ownership
I would recommend pacemaker if the distribution you are using has all the bits. Manual building gets messy. Suse based distros have all the bits required for ocfs2+pacemaker. On Tue, Aug 28, 2012 at 10:40 PM, Emilien Macchi emilien.mac...@stackops.com wrote: Hi, On Wed, Aug 29, 2012 at 7:25 AM, Sunil Mushran sunil.mush...@gmail.comwrote: Isn't the mount point is local to the machine? I use iSCSI for the Block device and I mount the device (/dev/sdc1) at /var/lib/nova/instances. I've formated /dev/sdc1 in OCFS2 FS. Should I use Pacemaker to manage OCFS2 ? Thanks, -Emilien On Tue, Aug 28, 2012 at 10:14 PM, Emilien Macchi emilien.mac...@stackops.com wrote: Hi, On Wed, Aug 29, 2012 at 12:36 AM, Sunil Mushran sunil.mush...@gmail.com wrote: Permissions on the mount point should be local to a machine. That's unthinkable if you consider that's a cluster FS which respects POSIX rules. -Emilien AFAIK. On Mon, Aug 27, 2012 at 3:08 AM, Emilien Macchi emilien.mac...@stackops.com wrote: Hi, I'm working on a two nodes cluster with the goal to store virtual machines managed by OpenStack services and KVM Hypervisor. I also use iSCSI Multi-Pathing for the block device. My cluster is running and I can mount the device (/dev/sdd1). I'm having some problems with POSIX rights : - *chmod* on a file or folder is working. - *chown* on a file or folder is not working as I want : I'm trying to change the ownership of */var/lib/nova/instances* which is my mount point, but when I do that, the ownership setting is not applied on the second node. I can't use yet OpenStack + KVM because the mount point should have the nova user as POSIX owner. Here is my *cluster.conf* : http://paste.openstack.org/show/oPQR5pjZETz7xSAR04so/ And my mount point : */dev/sdd1 on /var/lib/nova/instances type ocfs2 (rw,_netdev,heartbeat=local)* In advance thank you for your help. Best regards -- Emilien Macchi *System Engineer* *www.stackops.com | *emilien.mac...@stackops.com** *|* skype:emilien.macchi* * http://www.stackops.com * * ADVERTENCIA LEGAL Le informamos, como destinatario de este mensaje, que el correo electrónico y las comunicaciones por medio de Internet no permiten asegurar ni garantizar la confidencialidad de los mensajes transmitidos, así como tampoco su integridad o su correcta recepción, por lo que STACKOPS TECHNOLOGIES S.L. no asume responsabilidad alguna por tales circunstancias. Si no consintiese en la utilización del correo electrónico o de las comunicaciones vía Internet le rogamos nos lo comunique y ponga en nuestro conocimiento de manera inmediata. Este mensaje va dirigido, de manera exclusiva, a su destinatario y contiene información confidencial y sujeta al secreto profesional, cuya divulgación no está permitida por la ley. En caso de haber recibido este mensaje por error, le rogamos que, de forma inmediata, nos lo comunique mediante correo electrónico remitido a nuestra atención y proceda a su eliminación, así como a la de cualquier documento adjunto al mismo. Asimismo, le comunicamos que la distribución, copia o utilización de este mensaje, o de cualquier documento adjunto al mismo, cualquiera que fuera su finalidad, están prohibidas por la ley. * PRIVILEGED AND CONFIDENTIAL We hereby inform you, as addressee of this message, that e-mail and Internet do not guarantee the confidentiality, nor the completeness or proper reception of the messages sent and, thus, STACKOPS TECHNOLOGIES S.L. does not assume any liability for those circumstances. Should you not agree to the use of e-mail or to communications via Internet, you are kindly requested to notify us immediately. This message is intended exclusively for the person to whom it is addressed and contains privileged and confidential information protected from disclosure by law. If you are not the addressee indicated in this message, you should immediately delete it and any attachments and notify the sender by reply e-mail. In such case, you are hereby notified that any dissemination, distribution, copying or use of this message or any attachments, for any purpose, is strictly prohibited by law. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users -- Emilien Macchi *System Engineer* *www.stackops.com | *emilien.mac...@stackops.com** *|* skype:emilien.macchi* * http://www.stackops.com * * ADVERTENCIA LEGAL Le informamos, como destinatario de este mensaje, que el correo electrónico y las comunicaciones por medio de Internet no permiten asegurar ni garantizar la confidencialidad de los mensajes transmitidos, así como tampoco su integridad o su correcta recepción, por lo que STACKOPS TECHNOLOGIES
Re: [Ocfs2-users] Issue with OCFS2 mount
Forgot to add that this issue is limited to metaecc. So you could avoid the issue in your same setup by not enabling metaecc on the volume. And last I checked mkfs did not enable it by default. On Mon, Aug 27, 2012 at 10:35 AM, Sunil Mushran sunil.mush...@gmail.comwrote: So you are running into a bug that has been fixed in 2.6.36. Upgrade to that version, if not something more current. $ git describe --tags 13ceef09 v2.6.35-rc3-14-g13ceef0 commit 13ceef099edd2b70c5a6f3a9ef5d6d97cda2e096 Author: Jan Kara j...@suse.cz Date: Wed Jul 14 07:56:33 2010 +0200 jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions OCFS2 uses t_commit trigger to compute and store checksum of the just committed blocks. When a buffer has b_frozen_data, checksum is computed for it instead of b_data but this can result in an old checksum being written to the filesystem in the following scenario: 1) transaction1 is opened 2) handle1 is opened 3) journal_access(handle1, bh) - This sets jh-b_transaction to transaction1 4) modify(bh) 5) journal_dirty(handle1, bh) 6) handle1 is closed 7) start committing transaction1, opening transaction2 8) handle2 is opened 9) journal_access(handle2, bh) - This copies off b_frozen_data to make it safe for transaction1 to commit. jh-b_next_transaction is set to transaction2. 10) jbd2_journal_write_metadata() checksums b_frozen_data 11) the journal correctly writes b_frozen_data to the disk journal 12) handle2 is closed - There was no dirty call for the bh on handle2, so it is never queued for any more journal operation 13) Checkpointing finally happens, and it just spools the bh via normal buffer writeback. This will write b_data, which was never triggered on and thus contains a wrong (old) checksum. This patch fixes the problem by calling the trigger at the moment data is frozen for journal commit - i.e., either when b_frozen_data is created by do_get_write_access or just before we write a buffer to the log if b_frozen_data does not exist. We also rename the trigger to t_frozen as that better describes when it is called. Signed-off-by: Jan Kara j...@suse.cz Signed-off-by: Mark Fasheh mfas...@suse.com Signed-off-by: Joel Becker joel.bec...@oracle.com On Mon, Aug 27, 2012 at 5:10 AM, Rory Kilkenny rory.kilke...@ticoon.comwrote: # uname -a Linux FILEt1 2.6.34.7-0.7-desktop #1 SMP PREEMPT 2010-12-13 11:13:53 +0100 x86_64 x86_64 x86_64 GNU/Linux # modinfo ocfs2 filename: /lib/modules/2.6.34.7-0.7-desktop/kernel/fs/ocfs2/ocfs2.ko license:GPL author: Oracle version:1.5.0 description:OCFS2 1.5.0 srcversion: B13569B35F99D43FA80D129 depends:jbd2,ocfs2_stackglue,quota_tree,ocfs2_nodemanager vermagic: 2.6.34.7-0.7-desktop SMP preempt mod_unload modversions # mkfs.ocfs2 --version mkfs.ocfs2 1.4.3 On 12-08-24 5:44 PM, Sunil Mushran sunil.mush...@gmail.com wrote: What is the version of the kernel, ocfs2 and ocfs2 tools? uname -a modinfo ocfs2 mkfs.ocfs2 --version On Fri, Aug 24, 2012 at 1:09 PM, Rory Kilkenny rory.kilke...@ticoon.com wrote: We have an HP P2000 G3 Storage array, fiber connected. The storage array has a RAID5 array broken into 2 physical OCFS2 volumes (A B). A B are both mounted and formatted as NTFS. One of the volumes is NFS mounted. Every couple of months or so we start getting tons of errors on the NFS mounted volume: Aug 24 09:48:13 FILEt2 kernel: [2234285.848940] (ocfs2_wq,13844,7):ocfs2_block_check_validate:443 ERROR: CRC32 failed: stored: 0, computed 1467126086. Applying ECC. Aug 24 09:48:13 FILEt2 kernel: [2234285.849252] (ocfs2_wq,13844,7):ocfs2_block_check_validate:457 ERROR: Fixed CRC32 failed: stored: 0, computed 3828104806 Aug 24 09:48:13 FILEt2 kernel: [2234285.849256] (ocfs2_wq,13844,7):ocfs2_validate_extent_block:903 ERROR: Checksum failed for extent block 1169089 Aug 24 09:48:13 FILEt2 kernel: [2234285.849261] (ocfs2_wq,13844,7):__ocfs2_find_path:1861 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849264] (ocfs2_wq,13844,7):ocfs2_find_leaf:1958 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849267] (ocfs2_wq,13844,7):ocfs2_find_new_last_ext_blk:6655 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849270] (ocfs2_wq,13844,7):ocfs2_do_truncate:6900 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849274] (ocfs2_wq,13844,7):ocfs2_commit_truncate:7556 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849280] (ocfs2_wq,13844,7):ocfs2_truncate_for_delete:593 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849284] (ocfs2_wq,13844,7):ocfs2_wipe_inode:769 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849287] (ocfs2_wq,13844,7):ocfs2_delete_inode:1067
Re: [Ocfs2-users] Issue with OCFS2 mount
What is the version of the kernel, ocfs2 and ocfs2 tools? uname -a modinfo ocfs2 mkfs.ocfs2 --version On Fri, Aug 24, 2012 at 1:09 PM, Rory Kilkenny rory.kilke...@ticoon.comwrote: We have an HP P2000 G3 Storage array, fiber connected. The storage array has a RAID5 array broken into 2 physical OCFS2 volumes (A B). A B are both mounted and formatted as NTFS. One of the volumes is NFS mounted. Every couple of months or so we start getting tons of errors on the NFS mounted volume: Aug 24 09:48:13 FILEt2 kernel: [2234285.848940] (ocfs2_wq,13844,7):ocfs2_block_check_validate:443 ERROR: CRC32 failed: stored: 0, computed 1467126086. Applying ECC. Aug 24 09:48:13 FILEt2 kernel: [2234285.849252] (ocfs2_wq,13844,7):ocfs2_block_check_validate:457 ERROR: Fixed CRC32 failed: stored: 0, computed 3828104806 Aug 24 09:48:13 FILEt2 kernel: [2234285.849256] (ocfs2_wq,13844,7):ocfs2_validate_extent_block:903 ERROR: Checksum failed for extent block 1169089 Aug 24 09:48:13 FILEt2 kernel: [2234285.849261] (ocfs2_wq,13844,7):__ocfs2_find_path:1861 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849264] (ocfs2_wq,13844,7):ocfs2_find_leaf:1958 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849267] (ocfs2_wq,13844,7):ocfs2_find_new_last_ext_blk:6655 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849270] (ocfs2_wq,13844,7):ocfs2_do_truncate:6900 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849274] (ocfs2_wq,13844,7):ocfs2_commit_truncate:7556 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849280] (ocfs2_wq,13844,7):ocfs2_truncate_for_delete:593 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849284] (ocfs2_wq,13844,7):ocfs2_wipe_inode:769 ERROR: status = -5 Aug 24 09:48:13 FILEt2 kernel: [2234285.849287] (ocfs2_wq,13844,7):ocfs2_delete_inode:1067 ERROR: status = -5 If we pull all the data off, destroy the volume, rebuilt it, and copy our data back, all works fine; for a while. This issue does not happen on the non NFS mounted volume. I am currently assuming the issue is with NFS and how we have it configured (which to the best of my knowledge is default). Has anyone had a similar experience and be able to share some insight and knowledge on any tricks with NFS and OCFS2 volumes? Thanks in advance. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 and util_file
You are probably mounting the volume with the datavolume option. Instead use the init.ora param, filesystemio_options for force odirect and mount the volume without the datavolume option. This is documented in the user's guide. On Thu, Aug 23, 2012 at 8:14 AM, Maki, Nancy nancy.m...@suny.edu wrote: We are getting an error ORA-29284 when using utl_file.get_line to read an OCFS2 file of larger than 3896 characters. Has anyone encountered this before? We are at OCFS2 2.6 running on OEL 5.6. ** ** Thanks, Nancy ** ** [image: circle] http://www.suny.edu/** *Nancy Maki* *Manager of Database Services* Office of Information Technology The State University of New York State University Plaza - Albany, New York 12246 Tel: 518.320.1213 Fax: 518.320.1550 eMail: nancy.m...@suny.edu *Be a part of Generation SUNY: **Facebook*http://www.facebook.com/generationsuny * - **Twitter* http://www.twitter.com/generationsuny* - **YouTube*http://www.youtube.com/generationsuny ** ** ** ** ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users image001.gif___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 and util_file
On Thu, Aug 23, 2012 at 10:58 AM, Maki, Nancy nancy.m...@suny.edu wrote: By default we mount all our OCFS2 volumes with datavolume. To be more specific, the volume that we are having the issue with is not a database volume but a shared drive for developers to read and write other types of files. Would it be appropriate to remove the datavolume mount option from this particular volume only and leave it on our database volumes? Yes. datavolume was only meant for db volumes. Other volumes have never needed it. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] null pointer dereference
You may want to run a full fsck on the fs. fsck.ocfs2 -fy /dev/ On Tue, Aug 21, 2012 at 12:49 AM, Pawel pzl...@mp.pl wrote: Hi, After upgrading ocfs2 my cluster is instable. At least ones per week I can see: kernel panic: Null pointer dereference at 00048 o2dlm_blocking_ast_wrapper + 0x8/0x20 [ocfs2_stack_o2cb] stack: dlm_do_local_bast [ocfs2_dlm] dlm_lookup_lockers [ocfs2_dlm] dlm_proxy_ast_handler add_timer .. After that sometimes deadlock happens on another nodes. Entire cluster restart solve the issue. I see in log: (dlm_thread,7227,3):dlm_send_proxy_ast_msg:484 ERROR: ECB9442E19A94EAC896641BFADD55E4B: res M0001f411c9, error -107 send AST to node 4 (dlm_thread,7227,3):dlm_flush_asts:605 ERROR: status = -107 o2net: No connection established with node 4 after 10.0 seconds, giving up. o2net: No connection established with node 4 after 10.0 seconds, giving up. o2net: No connection established with node 4 after 10.0 seconds, giving up. (dlm_thread,7227,4):dlm_send_proxy_ast_msg:484 ERROR: ECB9442E19A94EAC896641BFADD55E4B: res M0001f411c9, error -107 send AST to node 4 (dlm_thread,7227,4):dlm_flush_asts:605 ERROR: status = -107 o2cb: o2dlm has evicted node 4 from domain ECB9442E19A94EAC896641BFADD55E4B o2cb: o2dlm has evicted node 4 from domain ECB9442E19A94EAC896641BFADD55E4B o2dlm: Begin recovery on domain ECB9442E19A94EAC896641BFADD55E4B for node 4 o2dlm: Node 5 (he) is the Recovery Master for the dead node 4 in domain ECB9442E19A94EAC896641BFADD55E4B o2dlm: End recovery on domain ECB9442E19A94EAC896641BFADD55E4B Additionaly ~4 times per day I see: ocfs2_check_dir_for_entry:2119 ERROR: status = -17 ocfs2_mknod:459 ERROR: status = -17 ocfs2_create:629 ERROR: status = -17 I currently use kernel 3.4.2 my filesystem has been created with: -N 8-b 4096 -C 32768 --fs-features backup-super,strict-journal-super,sparse,extended-slotmap,inline-data,metaecc,xattr,indexed-dirs,refcount,discontig-bg,unwritten,usrquota,grpquota Could you tell me what could make my system instable? Which feature ? Thanks for any help Pawel ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2 problem journal size
The 4 journal inodes got zeroed out. Do you know how/why? Have you tried running fsck with -fy (enable writes). fsck.ocfs2 does have a check for bad journals that it will regenerate. JOURNAL_FILE_INVALID OCFS2 uses JDB for journalling and some journal files exist in the system directory. Fsck has found some journal files that are invalid. Answering yes to this question will regenerate the invalid journal files. But that may still not work as fsck is currently bailing out during journal recovery that happens much earlier on. Try with -fy. If that does not work, we'll have to reconstruct empty inodes as placeholders to allow fsck to complete journal recovery followed by journal recreation. On Wed, Aug 1, 2012 at 6:41 PM, Christophe BOUDER christophe.bou...@lip6.fr wrote: Hello, i use ocfs2 1.6.3 kernel 3.4.4 on debian testing i had problem on my infortrend device media error on a disk the result i can't mount my ocfs2 file but i can read the files with debugfs.ocfs2 and my question is can i recover or recreate the journal size for node 8 9 10 11 ? thank for your help here's some log : # mount /data mount.ocfs2: Internal logic failure while trying to join the group # fsck.ocfs2 -n /dev/sdc1 fsck.ocfs2 1.6.3 Checking OCFS2 filesystem in /dev/sdc1: Label: data UUID: 9B655B51E6874480BBC1309DCA048A39 Number of blocks: 4027690992 Block size: 4096 Number of clusters: 251730687 Cluster size: 65536 Number of slots:32 journal recovery: I/O error on channel while reading cached inode 112 for slot 8's journal fsck encountered unrecoverable errors while replaying the journals and will not continue # echo ls -l // | debugfs.ocfs2 /dev/sdc1 |grep journal debugfs.ocfs2 1.6.3 55 -rw-r--r-- 1 0 0 268435456 23-Jun-2007 21:30 journal: 56 -rw-r--r-- 1 0 0 268435456 23-Jun-2007 21:30 journal:0001 57 -rw-r--r-- 1 0 0 268435456 23-Jun-2007 21:30 journal:0002 58 -rw-r--r-- 1 0 0 268435456 23-Jun-2007 21:30 journal:0003 59 -rw-r--r-- 1 0 0 268435456 23-Jun-2007 21:31 journal:0004 79 -rw-r--r-- 1 0 0 268435456 31-Aug-2007 00:45 journal:0005 80 -rw-r--r-- 1 0 0 268435456 31-Aug-2007 00:45 journal:0006 81 -rw-r--r-- 1 0 0 268435456 31-Aug-2007 00:45 journal:0007 112 -- 0 0 0 0 1-Jan-1970 01:00 journal:0008 113 -- 0 0 0 0 1-Jan-1970 01:00 journal:0009 114 -- 0 0 0 0 1-Jan-1970 01:00 journal:0010 115 -- 0 0 0 0 1-Jan-1970 01:00 journal:0011 116 -rw-r--r-- 1 0 0 268435456 31-Aug-2007 00:46 journal:0012 117 -rw-r--r-- 1 0 0 268435456 31-Aug-2007 00:47 journal:0013 118 -rw-r--r-- 1 0 0 268435456 31-Aug-2007 00:47 journal:0014 119 -rw-r--r-- 1 0 0 268435456 31-Aug-2007 00:47 journal:0015 142 -rw-r--r-- 1 0 0 268435456 29-May-2009 22:53 journal:0016 143 -rw-r--r-- 1 0 0 268435456 29-May-2009 22:54 journal:0017 166 -rw-r--r-- 1 0 0 268435456 31-Jan-2010 15:36 journal:0018 167 -rw-r--r-- 1 0 0 268435456 31-Jan-2010 15:36 journal:0019 168 -rw-r--r-- 1 0 0 268435456 31-Jan-2010 15:37 journal:0020 169 -rw-r--r-- 1 0 0 268435456 31-Jan-2010 15:37 journal:0021 170 -rw-r--r-- 1 0 0 268435456 31-Jan-2010 15:38 journal:0022 171 -rw-r--r-- 1 0 0 268435456 31-Jan-2010 15:38 journal:0023 208 -rw-r--r-- 1 0 0 268435456 21-Nov-2010 19:35 journal:0024 209 -rw-r--r-- 1 0 0 268435456 21-Nov-2010 19:35 journal:0025 210 -rw-r--r-- 1 0 0 268435456 21-Nov-2010 19:36 journal:0026 211 -rw-r--r-- 1 0 0 268435456 21-Nov-2010 19:36 journal:0027 212 -rw-r--r-- 1 0 0 268435456 21-Nov-2010 19:36 journal:0028 213 -rw-r--r-- 1 0 0 268435456 21-Nov-2010 19:36 journal:0029 214 -rw-r--r-- 1 0 0 268435456 21-Nov-2010 19:37 journal:0030 215 -rw-r--r-- 1 0 0 268435456
Re: [Ocfs2-users] ocfs2 problem journal size
oh crap. The dlm lock needs to lock the journals. So you need to recreate the journal inodes with i_size 0. dd a good journal inode and edit it using binary editor. Change the inode num to the block number, zero out the i_size and next_free_extent. Repeat for the 4 inodes. Hopefully some one on the list has the time to help you further. On Thu, Aug 2, 2012 at 10:50 AM, Christophe BOUDER christophe.bou...@lip6.fr wrote: hello, The 4 journal inodes got zeroed out. Do you know how/why? raid6 with 2 bad disk and a third who got problem reinsert it in the device it appears good but it also crash the device not recognize by the system. Have you tried running fsck with -fy (enable writes). yes but without success #fsck.ocfs2 -fy /dev/sdc1 fsck.ocfs2 1.6.3 fsck.ocfs2: Internal logic failure while initializing the DLM Try with -fy. If that does not work, we'll have to reconstruct empty inodes as placeholders to allow fsck to complete journal recovery followed by journal recreation. ok how can i do that ? -- Christophe Bouder, ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2-tools git: broken after commit deb5ade9145f8809f1fde19cf53bdfdf1fb7963e
On Thu, Jul 26, 2012 at 6:37 AM, Dzianis Kahanovich maha...@bspu.unibel.bywrote: ocfs2-tools git wrong commit: deb5ade9145f8809f1fde19cf53bdfdf1fb7963e. After cleanup unused variable: -else -tmp = g_list_append(elem, cfs); o2cb_ctl starts to ignore 1 node. Good commit must be: else -tmp = g_list_append(elem, cfs); +g_list_append(elem, cfs); Attached patch. Thanks. Acked-by: Sunil Mushran sunil.mush...@gmail.com ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Removing a node from cluster.conf (on a specific node)
Online add/remove of nodes and of global heartbeat devices has been in mainline for over a year. I think 2.6.38+ and tools 1.8. The ocfs2-tools tree hosted on oss.oracle.com/git has a 1.8.2 tag that can be used safely. It has been fully tested. The user's guide has been moved to man pages bundled with the tools. Do man ocfs2 after building and installing the tools. On Apr 29, 2012, at 1:21 PM, Sébastien Riccio s...@swisscenter.com wrote: Hi dear list, I think the subjet might already been discussed, but I can only found old threads about removing a node from the cluster. I was hoping that in 2012 it would be possible to dynamically add/remove nodes from a shared filesystem but this evening I had this problem: I wanted to add a node to our ocfs2 cluster, node named xen-blade11 with ip 10.111.10.111 So on every other node I ran this command: o2cb_ctl -C -i -n xen-blade11 -t node -a number=5 -a ip_address=10.111.10.111 -a ip_port= -a cluster=ocfs2 Which successfully added the node to every cluster node, except on xen-server16 On every node the original cluster.conf was: node: ip_port = ip_address = 10.111.10.116 number = 0 name = xen-blade16 cluster = ocfs2 node: ip_port = ip_address = 10.111.10.115 number = 1 name = xen-blade15 cluster = ocfs2 node: ip_port = ip_address = 10.111.10.114 number = 2 name = xen-blade14 cluster = ocfs2 node: ip_port = ip_address = 10.111.10.113 number = 3 name = xen-blade13 cluster = ocfs2 node: ip_port = ip_address = 10.111.10.112 number = 4 name = xen-blade12 cluster = ocfs2 cluster: node_count = 5 name = ocfs2 After adding the node, on every cluster.conf I can see that this was added: node: ip_port = ip_address = 10.111.10.111 number = 5 name = xen-blade11 cluster = ocfs2 cluster: node_count = 6 name = ocfs2 EXCEPT on xen-blade16 It added like this: node: ip_port = ip_address = 10.111.10.111 number = 6 name = xen-blade11 cluster = ocfs2 cluster: node_count = 6 name = ocfs2 (Notice the number = 6 instead of number = 5) So now when i'm trying to connect the xen-blade11 every host accept the connection except the xen-blade16, and the cluster joining is being rejected. as we can see in the kernel messages on xen-blade11 [ 1852.729539] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1852.729892] o2net: Connected to node xen-blade12 (num 4) at 10.111.10.112: [ 1852.737122] o2net: Connected to node xen-blade14 (num 2) at 10.111.10.114: [ 1852.741408] o2net: Connected to node xen-blade15 (num 1) at 10.111.10.115: [ 1854.733759] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1856.737129] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1856.764520] OCFS2 1.5.0 [ 1858.740877] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1860.744847] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1862.748919] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1864.752929] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1866.756825] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1868.760809] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1870.764937] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1872.768905] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1874.772947] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1876.776928] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1878.780828] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1880.784974] o2net: Connection to node xen-blade16 (num 0) at 10.111.10.116: shutdown, state 7 [ 1882.784529] o2net: No connection established with node 0 after 30.0 seconds, giving up. [ 1912.864531] o2net: No connection established with node 0 after 30.0 seconds, giving up. [ 1917.028531] o2cb: This node could not connect to nodes: 0. [ 1917.028684] o2cb: Cluster check failed. Fix errors before retrying. [ 1917.028758] (mount.ocfs2,4238,4):ocfs2_dlm_init:3001 ERROR: status = -107 [ 1917.028880] (mount.ocfs2,4238,4):ocfs2_mount_volume:1879 ERROR:
Re: [Ocfs2-users] Permission denied on ocfs2 cluster
Could be selinux related. I mean it is a permission issue. So you have to look at all the security regimes. rwx, posix acl, selinux, etc. On Mar 16, 2012, at 8:00 AM, зоррыч zo...@megatrone.ru wrote: Any idea? -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of зоррыч Sent: Thursday, March 15, 2012 11:26 PM To: 'Sunil Mushran' Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Permission denied on ocfs2 cluster [root@noc-1-synt /]# ls -lh | grep ocfs drwxr-xr-x. 3 root root 3.9K Mar 15 02:20 ocfs [root@noc-1-synt /]# chmod -R gou+rwx ./ocfs/ [root@noc-1-synt /]# ls -lh | grep ocfs drwxrwxrwx. 3 root root 3.9K Mar 15 02:20 ocfs [root@noc-1-synt /]# cd ./ocfs/ [root@noc-1-synt ocfs]# mkdir 1233 mkdir: cannot create directory `1233': Permission denied [root@noc-1-synt ocfs]# Strace: [root@noc-1-synt ocfs]# strace mkdir 1233 execve(/bin/mkdir, [mkdir, 1233], [/* 28 vars */]) = 0 brk(0) = 0x2132000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbd67514000 access(/etc/ld.so.preload, R_OK) = -1 ENOENT (No such file or directory) open(/etc/ld.so.cache, O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=45938, ...}) = 0 mmap(NULL, 45938, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbd67508000 close(3)= 0 open(/lib64/libselinux.so.1, O_RDONLY) = 3 read(3, \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0PX\0D2\0\0\0..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=124624, ...}) = 0 mmap(0x324400, 2221912, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x324400 mprotect(0x324401d000, 2093056, PROT_NONE) = 0 mmap(0x324421c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c000) = 0x324421c000 mmap(0x324421e000, 1880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x324421e000 close(3)= 0 open(/lib64/libc.so.6, O_RDONLY) = 3 read(3, \177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\360\355\201B2\0\0\0..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1979000, ...}) = 0 mmap(0x324280, 3803304, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x324280 mprotect(0x3242997000, 2097152, PROT_NONE) = 0 mmap(0x3242b97000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x197000) = 0x3242b97000 mmap(0x3242b9c000, 18600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3242b9c000 close(3)= 0 open(/lib64/libdl.so.2, O_RDONLY) = 3 read(3, \177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\0\1\0\0\0\340\r\300B2\0\0\0..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=22536, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbd67507000 mmap(0x3242c0, 2109696, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x3242c0 mprotect(0x3242c02000, 2097152, PROT_NONE) = 0 mmap(0x3242e02000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x3242e02000 close(3)= 0 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbd67505000 arch_prctl(ARCH_SET_FS, 0x7fbd675057a0) = 0 mprotect(0x324421c000, 4096, PROT_READ) = 0 mprotect(0x3242b97000, 16384, PROT_READ) = 0 mprotect(0x3242e02000, 4096, PROT_READ) = 0 mprotect(0x324261f000, 4096, PROT_READ) = 0 munmap(0x7fbd67508000, 45938) = 0 statfs(/selinux, {f_type=0xf97cff8c, f_bsize=4096, f_blocks=0, f_bfree=0, f_bavail=0, f_files=0, f_ffree=0, f_fsid={0, 0}, f_namelen=255, f_frsize=4096}) = 0 brk(0) = 0x2132000 brk(0x2153000) = 0x2153000 open(/usr/lib/locale/locale-archive, O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=99158704, ...}) = 0 mmap(NULL, 99158704, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fbd61674000 close(3)= 0 mkdir(1233, 0777) = -1 EACCES (Permission denied) open(/usr/share/locale/locale.alias, O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbd67513000 read(3, # Locale name alias data base.\n#..., 4096) = 2512 read(3, , 4096) = 0 close(3)= 0 munmap(0x7fbd67513000, 4096)= 0 open(/usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo, O_RDONLY) = -1 ENOENT (No such file or directory) open(/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo, O_RDONLY) = -1 ENOENT (No such file or directory) open(/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo, O_RDONLY) = -1 ENOENT (No such file or directory) open(/usr/share/locale/en.UTF-8/LC_MESSAGES/coreutils.mo, O_RDONLY) = -1 ENOENT (No such file or directory
Re: [Ocfs2-users] Permission denied on ocfs2 cluster
strace may show more. I would first confirm that my perms are correct. On 03/15/2012 07:58 AM, ?? wrote: I am testing the scheme of drbd and ocfs2 If you attempt to write to the cluster error: [root@noc-1-m77 share]# mkdir 12 mkdir: cannot create directory `12': Permission denied [root@noc-1-m77 share]# Config: [root@noc-1-m77 /]# cat /etc/ocfs2/cluster.conf cluster: node_count = 2 name = cluster-ocfs2 node: ip_port = ip_address = 10.1.20.10 number = 0 name = noc-1-synt.rutube.ru cluster = cluster-ocfs2 node: ip_port = ip_address = 10.2.20.9 number = 1 name = noc-1-m77.rutube.ru cluster = cluster-ocfs2 logs: Mar 15 05:42:04 noc-1-synt kernel: OCFS2 1.5.0 Mar 15 05:42:04 noc-1-synt kernel: o2dlm: Nodes in domain 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 Mar 15 05:42:04 noc-1-synt kernel: ocfs2: Mounting device (147,0) on (node 1, slot 0) with ordered data mode. Mar 15 05:42:07 noc-1-synt kernel: o2net: accepted connection from node noc-1-m77.rutube.ru (num 2) at 10.2.20.9: Mar 15 05:42:11 noc-1-synt kernel: o2dlm: Node 2 joins domain 5426CCF9AC414CD59E78F3AE48B9DE2C Mar 15 05:42:11 noc-1-synt kernel: o2dlm: Nodes in domain 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 2 Mar 15 05:50:54 noc-1-synt kernel: o2dlm: Node 2 leaves domain 5426CCF9AC414CD59E78F3AE48B9DE2C Mar 15 05:50:54 noc-1-synt kernel: o2dlm: Nodes in domain 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 Mar 15 05:50:56 noc-1-synt kernel: o2net: connection to node noc-1-m77.rutube.ru (num 2) at 10.2.20.9: shutdown, state 8 Mar 15 05:50:56 noc-1-synt kernel: o2net: no longer connected to node noc-1-m77.rutube.ru (num 2) at 10.2.20.9: Mar 15 05:51:12 noc-1-synt kernel: ocfs2: Unmounting device (147,0) on (node 1) Mar 15 05:51:45 noc-1-synt kernel: o2net: accepted connection from node noc-1-m77.rutube.ru (num 2) at 10.2.20.9: Mar 15 05:51:47 noc-1-synt kernel: o2dlm: Nodes in domain 5426CCF9AC414CD59E78F3AE48B9DE2C: 1 2 Mar 15 05:51:47 noc-1-synt kernel: ocfs2: Mounting device (147,0) on (node 1, slot 1) with ordered data mode. How do I fix this? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] ocfs2-1.4.7 is not binding in scientific linux 6.2
ocfs2 1.4 will not build with 2.6.32. A better solution is to just enable ocfs2 in the 2.6.32 kernel src tree and build. On 03/11/2012 07:37 AM, зоррыч wrote: Hi. I use scientific linux 6.2: [root@noc-1-m77 ocfs2-1.4.7]# cat /etc/redhat-release Scientific Linux release 6.2 (Carbon) [root@noc-1-m77 ocfs2-1.4.7]# uname -r 2.6.32-220.4.1.el6.x86_64 Does not compile: [root@noc-1-m77 ocfs2-1.4.7]# ./configure --with-kernel=/usr/src/kernels/2.6.32-220.7.1.el6.x86_64 checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed checking how to run the C preprocessor... gcc -E checking for a BSD-compatible install... /usr/bin/install -c checking whether ln -s works... yes checking for egrep... grep -E checking for ANSI C header files... yes checking for an ANSI C-conforming const... yes checking for vendor... not found checking for vendor kernel... not supported checking for debugging... no checking for directory with kernel build tree... /usr/src/kernels/2.6.32-220.7.1.el6.x86_64 checking for kernel version... 2.6.32-220.7.1.el6.x86_64 checking for directory with kernel sources... /usr/src/kernels/2.6.32-220.7.1.el6.x86_64 checking for kernel source version... 2.6.32-220.7.1.el6.x86_64 checking for struct delayed_work in workqueue.h... yes checking for uninitialized_var() in compiler-gcc4.h... yes checking for zero_user_page() in highmem.h... no checking for do_sync_mapping_range() in fs.h... yes checking for fault() in struct vm_operations_struct in mm.h... yes checking for f_path in fs.h... yes checking for enum umh_wait in kmod.h... yes checking for inc_nlink() in fs.h... yes checking for drop_nlink() in fs.h... yes checking for kmem_cache_create() with dtor arg in slab.h... no checking for kmem_cache_zalloc in slab.h... yes checking for flag FS_RENAME_DOES_D_MOVE in fs.h... yes checking for enum FS_OCFS2 in sysctl.h... yes checking for configfs_depend_item() in configfs.h... yes checking for register_sysctl() with two args in sysctl.h... no checking for su_mutex in struct configfs_subsystem in configfs.h... yes checking for struct subsystem in kobject.h... no checking for is_owner_or_cap() in fs.h... yes checking for fallocate() in fs.h... yes checking for struct splice_desc in splice.h... yes checking for MNT_RELATIME in mount.h... yes checking for should_remove_suid() in fs.h... no checking for generic_segment_checks() in fs.h... no checking for s_op declared as const in struct super_block in fs.h... yes checking for i_op declared as const in struct inode in fs.h... yes checking for f_op declared as const in struct file in fs.h... yes checking for a_ops declared as const in struct address_space in fs.h... yes checking for aio_read() in struct file_operations using iovec in fs.h... yes checking for __splice_from_pipe() in splice.h... yes checking for old bio_end_io_t in bio.h... no checking for b_size is u32 struct buffer_head in buffer_head.h... no checking for exportfs.h... yes checking for linux/lockdep.h... yes checking for mandatory_lock() in fs.h... yes checking for range prefix in struct writeback_control... yes checking for SYNC_FILE_RANGE flags... yes checking for blkcnt_t in types.h... yes checking for i_private in struct inode... yes checking for page_mkwrite in struct vm_operations_struct... no checking for get_sb_bdev() with 5 arguments in fs.h... no checking for read_mapping_page in pagemap.h... yes checking for ino_t in filldir_t in fs.h... no checking for invalidatepage returning int in fs.h... no checking for get_blocks_t type... no checking for linux/uaccess.h... yes checking for system_utsname in utsname.h... no checking for MS_LOOP_NO_AOPS flag defined... no checking for fops-sendfile() in fs.h... no checking for task_pid_nr in sched.h... yes checking for confirm() in struct pipe_buf_operations in pipe_fs_i.h... yes checking for mutex_lock_nested() in mutex.h... yes checking for inode_double_lock) in fs.h... no checking for splice_read() in fs.h... yes checking for sops-statfs takes struct super_block * in fs.h... no checking for le16_add_cpu() in byteorder/generic.h... yes checking for le32_add_cpu() in byteorder/generic.h... yes checking for le64_add_cpu() in byteorder/generic.h... yes checking for be32_add_cpu() in byteorder/generic.h... yes checking for clear_nlink() in fs.h... yes configure: creating ./config.status config.status: creating Config.make
Re: [Ocfs2-users] ocfs2console hangs on startup
ocfs2console has been obsoleted. Just use the utilities directly. To detect ocfs2 volumes, use blkid. You can use it to restrict the lookup paths. Refer its manpage. On 03/09/2012 06:15 PM, John Major wrote: Hi, Hope this is the right place to ask this. I have set up 2 ubuntu lts machines with an IBM iscsi san. I have set up multipathd and ocfs2 and it seems to be working. The problem is that when I run up ocfs2console it hangs (the console app, not the system). Using strace, I can see that it is running through all the /dev/sdx devices and loops trying to access the first one in 'ghost' state per 'multipath -ll'. Is there a way to restrict which devices the app looks at as it starts to say /dev/mapper/mpath* since I don't actually want it to access any of the /dev/sd.. devices directly? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 1.2/1.6
The file system on-disk image has not changed. So the 1.6 file system software can mount the volume created with 1.2 mkfs. What you cannot do is concurrently mount the same volume with nodes running 1.2 and 1.6 versions of the file system software. It is not mixed mode. The 1.6 fs software will read the on-disk features on the 1.2 volume and limit the functioning on that volume to just that. Perfectly normal. Yes, you can add the tablespace on the 1.2 volume. For the 1.2 volume to be able to use 1.6 features, the said features will have to be enabled. Once you do enable those features, the volume will not be mountable on the older RHEL4 boxes unless those features are disabled. There is a whole section in the users' guide that explains this in more detail. On 03/02/2012 08:09 AM, Maki, Nancy wrote: We are in the process of migrating to new database servers. Our current RAC clusters are running OCFS2 1.2.9 on RHEL 4. Our new servers are running OCFS2 1.6 OEL5. If possible, we would like to minimize the amount of data that needs to move as we migrate to the new servers. We have the following questions: 1.Can we mount an existing OCFS2 1.2 volume on a servers running OCFS2 1.6? 2.Are there any negative implications of being in a mixed mode? 3.If we need to add a OCFS2 1.6 volume to increase a tablespace size, can we have one datafile be OCFS2 1.2 and another be OCFS2 1.6 for the same tablespace? 4.Can we use OCFS2 1.6 features against an OCFS2 1.2 volume mounted on OCFS2 1.6? Thank you, Nancy circle http://www.suny.edu/** *Nancy Maki* /Manager of Database Services/ Office of Information Technology The State University of New York State University Plaza - Albany, New York 12246 Tel: 518.320.1213 Fax: 518.320.1550 eMail: nancy.m...@suny.edu */Be a part of Generation SUNY: /**/Facebook/* http://www.facebook.com/generationsuny*/- /**/Twitter/* http://www.twitter.com/generationsuny*/- /**/YouTube/* http://www.youtube.com/generationsuny ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Ocfs2-users Digest, Vol 98, Issue 9
On 02/29/2012 04:10 PM, David Johle wrote: I too have seen some serious performance issues under 1.4, especially with writes. I'll share some info I've gathered on this topic, take it however you wish... In the past I never really thought about running benchmarks against the shared block device as a baseline to compare with the filesystem. So today I did run several dd tests of my own (both read and write) against a shared block device (different LUN, but using the exact same storage hardware including specific disks as the one with OCFS2). My tests were not in line with those of Erik Schwartz, as I determined the performance degradations to be OCFS2 related. I have a a fs shared by 2 nodes, both are dual quad core xeon systems with 2 dedicated storage NICs per box. Storage is a Dell/EqualLogic iSCSI SAN with 3 gigE NICs, dedicated gigE switches, using jumbo frames. I'm using dm-multipath as well. RHEL5 (2.6.18-194.3.1.el5 kernel) ocfs2-2.6.18-194.11.4.el5-1.4.7-1.el5 ocfs2-tools-1.4.4-1.el5 Using the individual /dev/sdX vs. the /dev/mapper/mpathX devices indicates that multipath is working properly as the numbers are close to double what the separates each give. Given the hardware, I'd consider 200MB/s a limit for a single box and 300MB/s the limit for the SAN. Block device: Sequential reads tend to be in the 180-190MB/s range with just one node reading. Both nodes simultaneously reading gives about 260-270MB/s total throughput. Sequential writes tend to be in the 115-140MB/s range with just one node writing. Both nodes simultaneously writing gives about 200-230MB/s total throughput. OCFS2: Sequential reads tend to be in the 80-95MB/s range with just one node reading. Both nodes simultaneously reading gives about 125-135MB/s total throughput. Sequential writes tend to be in the 5-20MB/s range with just one node writing. Both nodes simultaneously writing (different files) gives unbearably slow performance of less than 1MB/s total throughput. Now one thing I will say is that I was testing on a mature filesystem that has been in use for quite some time. Tons of file directory creation, reading, updating, deleting, over the course of a couple years. So to see how that might affect things, I then created a new filesystem on that same block device I used above (with same options as the mature one) and ran the set of dd-based fs tests on that. Create params: -b 4K -C 4K --fs-features=backup-super,sparse,unwritten,inline-data Mount params: -o noatime,data=writeback Fresh OCFS2: Sequential reads tend to be in the 100-125MB/s range with just one node reading. Both nodes simultaneously reading gives about 165-180MB/s total throughput. Sequential writes tend to be in the 120-140MB/s range with just one node writing. Both nodes simultaneously writing (different files) gives reasonable performance of around 100MB/s total throughput. Wow, what a difference! I will say that, for the mature filesystem above that is performing poorly, it has definitely gotten worse over time. It seems to me that the filesystem itself has some time or usage based performance degradation issues. I'm actually thinking it would be to the benefit of my cluster to create a new volume, shut down all applications, copy the contents over, shuffle mount points, and start it all back up. The only problem is that this will make for some highly unappreciated downtime! Also, I'm concerned that all that copying and loading it up with contents may just result in the same performance losses, making the whole process just wasted effort. We have worked on reducing fragmentation in later releases. One specific feature added was allocation reservation (in 2.6.35). It is available in prod releases starting 1.6. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Concurrent write performance issues with OCFS2
In 1.4, the local allocator window is small. 8MB. Meaning the node has to hit the global bitmap after every 8MB. In later releases, the window is much larger. Second, a single node is not a good baseline. A better baseline is multiple nodes writing concurrently to the block device. Not fs. Use dd. Set different write offsets. This should help figure out how the shared device works with multiple nodes. On 2/28/2012 9:24 AM, Erik Schwartz wrote: I have a two-node RHEL5 cluster that runs the following Linux kernel and accompanying OCFS2 module packages: * kernel-2.6.18-274.17.1.el5 * ocfs2-2.6.18-274.17.1.el5-1.4.7-1.el5 A 2.5TB LUN is presented to both nodes via DM-Multipath. I have carved out a single partition (using the entire LUN), and formatted it with OCFS2: # mkfs.ocfs2 -N 2 -L 'foofs' -T datafiles /dev/mapper/bams01p1 Finally, the filesystem is mounted to both nodes with the following options: # mount | grep bams01 /dev/mapper/bams01p1 on /foofs type ocfs2 (rw,_netdev,noatime,data=writeback,heartbeat=local) -- When a single node is writing arbitrary data (i.e. dd(1) with /dev/zero as input) to a large (say, 10 GB) file in /foofs, I see the expected performance of ~850 MB/sec. If both nodes are concurrently writing large files full of zeros to /foofs, performance drops way down to ~45 MB/s. I experimented with each node writing to /foofs/test01/ and /foofs/test02/ subdirectories, respectively, and found that performance increased slightly to a - still poor - 65 MB/s. -- I understand from searching past mailing list threads that the culprit is likely related to the negotiation of file locks, and waiting for data to be flushed to journal / disk. My two questions are: 1. Does this dramatic write performance slowdown sound reasonable and expected? 2. Are there any OCFS2-level steps I can take to improve this situation? Thanks - ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?
On 02/01/2012 07:02 AM, Mark wrote: One more thing. When I straced one of the application processes (these are the processes that create the files) I saw this: % time seconds usecs/callcalls errors syscall --- -- -- -- --- 68.94 3.002017 11127154open 18.93 0.929679 2 418108read 12.40 0.543714 2 257548write So it seams that inode creation is the biggest time consumer by far. Yes. open() triggers cluster lock creation which cannot be skipped. Reads and writes could skip cluster activity if the node already has the appropriate lock level. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Extend space on ocfs mount point
I am not aware of any downsizes in resizing. On 02/01/2012 09:57 AM, Kalra, Pratima wrote: We have a ucm installation on ocfs mount point and we need to increase the space on that mount point from 20gb to 30 gb. Is this possible without resulting in any after effects? Pratima. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?
debugfs.ocfs2 -R stats /dev/mapper/... I want to see the features enabled. The main issue with large metdata is the fsck timing. The recently tagged 1.8 release of the tools has much better fsck performance. On 02/01/2012 05:25 AM, Mark Hampton wrote: We have an application that has many processing threads writing more than a billion files ranging from 2KB – 50KB, with 50% under 8KB (currently there are 700 million files). The files are never deleted or modified – they are written once, and read infrequently. The files are hashed so that they are evenly distributed across ~1,000,000 subdirectories up to 3 levels deep, with up to 1000 files per directory. The directories are structured like this: 0/00/00 0/00/01 … F/FF/FE F/FF/FF The files need to be readable and writable across a number of servers. The NetApp filer we purchased for this project has both NFS and iSCSI capabilities. We first tried doing this via NFS. After writing 700 million files (12 TB) into a single NetApp volume, file-write performance became abysmally slow. We can't create more than 200 files per second on the NetApp volume, which is about 20% of our required performance target of 1000 files per second. It appears that most of the file-write time is going towards stat and inode-create operations. So I now I’m trying the same thing with OCFS2 over iSCSI. I created 16 luns on the NetApp. The 16 luns became 16 OCFS2 filesystems with 16 different mount points on our servers. With this configuration I was initially able to write ~1800 files per second. Now that I have completed 100 million files, performance has dropped to ~1500 files per second. I’m using OEL 6.1 (2.6.32-100 kernel) with OCFS2 version 1.6. The application servers have 128GB of memory. I created my OCFS2 filesystems as follows: mkfs.ocfs2 –T mail –b 4k –C 4k –L my label --fs-features=indexed-dirs –fs-feature-level=max-features /dev/mapper/my device And I mount them with these options: _netdev,commit=30,noatime,localflocks,localalloc=32 So my questions are these: 1) Given a billion files sized 2KB – 50KB, with 50% under 8KB, do I have the optimal OCFS2 filesystem and mount-point configurations? 2) Should I split the files across even more filesystems? Currently I have them split across 16 OCFS2 filesystems. Thanks a billion! ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?
On 02/01/2012 10:24 AM, Mark Hampton wrote: Here's what I got from debugfs.ocfs2 -R stats. I have to type it out manually, so I'm only including the features lines: Feature Compat: 3 backup-super strict-journal-super Feature Incompat: 16208 sparse extended-slotmap inline-data metaecc xattr indexed-dirs refcount discontig-bg Feature RO compat: 7 unwritten usrquota grpquota Some other info that may be interesting: Links: 0 Clusters: 52428544 I would disable quotas. That line suggests the vol is 200G is size. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Bad magic number in inode
inode#11 is in the system directory. fsck cannot fix this automatically. If the corruption is limited, there is a chance the inodes could be recreated manually. But do look at backups to restore. On 02/01/2012 10:20 AM, Werner Flamme wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, when I try to mount an OCFS2 volume, I get - ---snip--- [12212.195823] OCFS2: ERROR (device sde1): ocfs2_validate_inode_block: Invalid dinode #11: signature = [12212.195825] [12212.195827] File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. [12212.195832] (mount.ocfs2,9772,0):ocfs2_read_locked_inode:499 ERROR: status = -22 [12212.195842] (mount.ocfs2,9772,0):_ocfs2_get_system_file_inode:158 ERROR: status = -116 [12212.195853] (mount.ocfs2,9772,0):ocfs2_init_global_system_inodes:475 ERROR: status = -22 [12212.195860] (mount.ocfs2,9772,0):ocfs2_init_global_system_inodes:478 ERROR: Unable to load system inode 4, possibly corrupt fs? [12212.195862] (mount.ocfs2,9772,0):ocfs2_initialize_super:2379 ERROR: status = -22 [12212.195864] (mount.ocfs2,9772,0):ocfs2_fill_super:1064 ERROR: status = -22 [12212.195869] ocfs2: Unmounting device (8,65) on (node 0) - ---pins--- And doing an fsck, it looks like this: - ---snip--- # fsck.ocfs2 -f /dev/disk/by-label/ERSATZ fsck.ocfs2 1.8.0 Checking OCFS2 filesystem in /dev/disk/by-label/ERSATZ: Label: ERSATZ UUID: AEB995484F2D4D19835AA380CAE0683A Number of blocks: 268434093 Block size: 4096 Number of clusters: 268434093 Cluster size: 4096 Number of slots:40 /dev/disk/by-label/ERSATZ was run with -f, check forced. Pass 0a: Checking cluster allocation chains pass0: Bad magic number in inode reading inode alloc inode 11 for verification fsck.ocfs2: Bad magic number in inode while performing pass 0 - ---pins--- Any chance to access the filesystem other that reformatting it? The node ist the only node that can access this volume. I plan to share it via iSCSI, but first it must be mountable... There are 3 other volumes in this cluster, mounted by about a dozen nodes. Regards, Werner ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware
Symmetric clustering works best when the nodes are comparable because all nodes have to work in sync. NFS may be more suitable for your needs. On 01/26/2012 05:51 PM, Jorge Adrian Salaices wrote: I have been working on trying to convince Mgmt at work that we want to go to OCFS2 away from NFS for the sharing of the Application Layer of our Oracle EBS (Enterprise Business Suite), and for just general Backup Share, but general instability in my setup has dissuaded me to recommend it. I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and something as simple as an umount has triggered random Node reboots, even on nodes that have Other OCFS2 mounts not shared by the rebooting nodes. You see the problem I have is that I have disparate hardware and some of these servers are even VM's. Several documents state that nodes have to be somewhat equal of power and specs and in my case that will never be. Unfortunately for me, I have had several other events of random Fencing that have been unexplained by common checks. i.e. My Network has never been the problem yet one server may see another one go away when all of the other services on that node may be running perfectly fine. I can only surmise that the reason why that may have been is because of an elevated load on the server that starved the Heartbeat process preventing it from sending Network packets to other nodes. My config has about 40 Nodes on it, I have 4 or 5 different shared LUNs out of our SAN and not all servers share all Mounts. meaning only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3 share a third, unfortunately the complexity is such that a server may intersect with some of the servers but not all. perhaps a change in my config to create separate clusters may be the solution but only if a node can be part of multiple clusters: /node: ip_port = ip_address = 172.20.16.151 number = 1 name = txri-oprdracdb-1.tomkinsbp.com cluster = ocfs2-back node: ip_port = ip_address = 172.20.16.152 number = 2 name = txri-oprdracdb-2.tomkinsbp.com cluster = ocfs2-back node: ip_port = ip_address = 10.30.12.172 number = 4 name = txri-util01.tomkinsbp.com cluster = ocfs2-util, ocfs2-back node: ip_port = ip_address = 10.30.12.94 number = 5 name = txri-util02.tomkinsbp.com cluster = ocfs2-util, ocfs2-back cluster: node_count = 2 name = ocfs2-back cluster: node_count = 2 name = ocfs2-util / Is this even Legal, or can it be done some other way ? or is this done based on the Different DOMAINS that are created once a mount is done . How can I make the cluster more stable ? and Why does a node fence itself on the cluster even if it does Not have any locks on the shared LUN ? It seems to be that the node may be fenceable simply by having the OCFS2 services turned ON, without a mount . is this correct ? Another question I have been having as well is: can the Fencing method be other than Panic or restart ? Can a third party or a Userland event be triggered to recover from what may be construed by the Heartbeat or Network tests as a downed node ? Thanks for any of the help you can give me. -- Jorge Adrian Salaices Sr. Linux Engineer Tomkins Building Products ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] One node, two clusters?
You don't need to have two clusters for this. This can be accomplished with one cluster with the default local heartbeat. Create one cluster.conf with all the nodes. All nodes, except the one machine, will mount from just one san. The common node will mount from both sans. If you look at the cluster membership, other than the common node, all nodes will be interacting (network connection, etc.) with nodes that they can see on the san. On 12/22/2011 09:40 AM, Werner Flamme wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Kushnir, Michael (NIH/NLM/LHC) [C] [22.12.2011 18:20]: Is it possible to have one machine be part of two different ocfs2 clusters with two different sans? Kind of to serve as a bridge for moving data between two clusters but without actually fully combining the two clusters? Thanks, Michael Michael, I asked this two years ago and the answer was no. When I look at /etc/ocfs2/cluster.conf, I do not see a possibility to configure a second cluster. Though the nodes must be assigned to a cluster (and exactly one cluster, this is), there ist only one entry cluster: in the file, and so there is no way to define a second one. We synced via rsync :-( HTH Werner -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk7za4EACgkQk33Krq8b42MvSwCfQAXzqVQRPyhOdFrKM8PCPqbf g0cAn20CV4rjzXNrTa/YGaUeNlO3+rmc =CBmQ -END PGP SIGNATURE- ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] One node, two clusters?
On 12/22/2011 10:39 AM, Kushnir, Michael (NIH/NLM/LHC) [C] wrote: Is there a separate DLM instance for each ocfs2 volume? I have two sub-clusters in the same cluster... A 10 node Hadoop cluster sharing a SATA RAID10 and a Two node web server cluster sharing a SSD RAID0. One server mounts both volumes to move data between as necessary. This morning I got the following error (see end of message), and all nodes lost access to all storage. I'm trying to mitigate risk of this happening again. My hadoop nodes are used to generate search engine indexes, so they can go down. But my web servers provide the search engine service so I need them to not be tied to my hadoop nodes. I just feel safer that way. At the same time, I need a bridge node to move data between the two. I can do it via NFS or SCP, but I figured it'd be worth while to ask if one node can be in two different clusters. Dec 22 09:15:42 lhce-imed-web1 kernel: (updatedb,1832,1):dlm_get_lock_resource:898 042F68B6AF134E5C9A9EDF4D7BD7BE99:O0013d2ef94: at least one node (11) to recover before lock mastery can begin You should add ocfs2 to PRUNEFS in /etc/updatedb.conf. updatedb generates a lot of io and network traffic. And it will happen around the same time on all nodes. Yes, each volume has a different dlm domain (instance). ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] reflink status
First we have to get the new syscall added to the kernel. The first attempt failed because people overloaded the call with extraneous stuff. Recently there is another attempt to go back to the original proposal. Hopefully, next kernel release. The reflink utility should work. So what it is based on an older coreutils. It is derived from the hard link (ln) utility. On 12/17/2011 4:15 AM, richard -rw- weinberger wrote: Hi! What do I need to use reflinks on OCFS2 1.6? coreutils 8.4's cp --reflink=always doesn't seem too work. I found http://oss.oracle.com/git/?p=jlbec/reflink.git;a=shortlog and http://oss.oracle.com/~smushran/reflink-tools/ Both contain a patched and outdated coreutils package. Are there any plans to merge it upstream? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] reflink status
On 12/17/2011 12:05 PM, richard -rw- weinberger wrote: The reflink utility should work. So what it is based on an older coreutils. It is derived from the hard link (ln) utility. So, building it from http://oss.oracle.com/git/?p=jlbec/reflink.git;a=shortlog via reflink.spec is the way to go? For now, yes. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 cluster won't come up and stay up
To analyze one needs the logs. And a bugzilla is a good place holder for the logs. On Dec 1, 2011, at 6:05 PM, Tony Rios t...@tonyrios.com wrote: Sunil, Is submitting a bug report the only answer? I'm happy to send in this information, but can I take the cluster down entirely and sort of reset it so we can get these servers back online and talking again in the meanwhile? Tony On Dec 1, 2011, at 5:05 PM, Sunil Mushran wrote: Node 3 is joining the domain. It is having problms getting the superblock cluster lock. Create a bugzilla on oss.oracle.com and attach the /var/logs/messages from all nodes. If you have netconsole setup, attach those logs too. On 12/01/2011 04:55 PM, Tony Rios wrote: I'm having an issue today where I just can't seem to keep all the servers in the cluster online. They aren't losing network connectivity and I can ping the iSCSI host just fine and the host is logged in. These are the errors form the dmesg when I try to mount the filesystem: root@pedge36:~# dmesg [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Linux version 2.6.38-10-generic (buildd@yellow) (gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ) #46-Ubuntu SMP Tue Jun 28 15:07:17 UTC 2011 (Ubuntu 2.6.38-10.46-generic 2.6.38.7) [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-2.6.38-10-generic root=UUID=3cd859b8-2605-4a38-8767-a6d1f99d53bd ro debug ignore_loglevel [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: - 000a (usable) [0.00] BIOS-e820: 0010 - effc (usable) [0.00] BIOS-e820: effc - effcfc00 (ACPI data) [0.00] BIOS-e820: effcfc00 - e000 (reserved) [0.00] BIOS-e820: f000 - f400 (reserved) [0.00] BIOS-e820: fec0 - fed00400 (reserved) [0.00] BIOS-e820: fed13000 - feda (reserved) [0.00] BIOS-e820: fee0 - fee1 (reserved) [0.00] BIOS-e820: ffb0 - 0001 (reserved) [0.00] BIOS-e820: 0001 - 0001e000 (usable) [0.00] BIOS-e820: 0001e000 - 0002 (reserved) [0.00] BIOS-e820: 0002 - 00021000 (usable) [0.00] debug: ignoring loglevel setting. [0.00] NX (Execute Disable) protection: active [0.00] DMI 2.3 present. [0.00] DMI: Dell Computer Corporation PowerEdge 850/0Y8628, BIOS A04 08/22/2006 [0.00] e820 update range: - 0001 (usable) == (reserved) [0.00] e820 remove range: 000a - 0010 (usable) [0.00] No AGP bridge found [0.00] last_pfn = 0x21 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-CBFFF write-protect [0.00] CC000-EBFFF uncachable [0.00] EC000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 0 mask E write-back [0.00] 1 base 2 mask FF000 write-back [0.00] 2 base 0F000 mask FF000 uncachable [0.00] 3 disabled [0.00] 4 disabled [0.00] 5 disabled [0.00] 6 disabled [0.00] 7 disabled [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 [0.00] e820 update range: f000 - 0001 (usable) == (reserved) [0.00] last_pfn = 0xeffc0 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [880fe710] fe710 [0.00] initial memory mapped : 0 - 2000 [0.00] init_memory_mapping: -effc [0.00] 00 - 00efe0 page 2M [0.00] 00efe0 - 00effc page 4k [0.00] kernel direct mapping tables up to effc @ 1fffa000-2000 [0.00] init_memory_mapping: 0001-00021000 [0.00] 01 - 021000 page 2M [0.00] kernel direct mapping tables up to 21000 @ effb6000-effc [0.00] RAMDISK: 366d - 3736 [0.00] ACPI: RSDP 000fd160 00014 (v00 DELL ) [0.00] ACPI: RSDT 000fd174 00038 (v01 DELL PE850 0001 MSFT 010A) [0.00] ACPI: FACP 000fd1b8 00074 (v01 DELL PE850 0001 MSFT 010A) [0.00] ACPI: DSDT effc 01C19 (v01 DELL PE830 0001 MSFT 010E) [0.00] ACPI: FACS effcfc00 00040 [0.00] ACPI: APIC 000fd22c 00074 (v01 DELL PE850 0001 MSFT 010A) [0.00] ACPI: SPCR
Re: [Ocfs2-users] Monitoring progress of fsck.ocfs2
Do: cat /proc/PID/stack It is probably stuck in the block layer. On 11/18/2011 08:33 AM, Nick Khamis wrote: Hello Everyone, I just ran fsck.ocfs2 on /dev/drbd0 which is a one gig partition on a vm with limited resource (100meg of ram). I am worried that the process crashed because it has not responded in the past hour or so? fsck.ocfs2 /dev/drbd0 fsck.ocfs2 1.6.4 [RECOVER_CLUSTER_INFO] The running cluster is using the cman stack with the cluster name ASTCluster, but the filesystem is configured for the classic o2cb stack. Thus, fsck.ocfs2 cannot determine whether the filesystem is in use. fsck.ocfs2 can reconfigure the filesystem to use the currently running cluster configuration. DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION. Recover cluster configuration information the running cluster?n y ps -uroot 8040 pts/000:00:00 fsck.ocfs2 I want to mention that I did issue a ctrl+c and ctrl+x when I paniced. But I do not think anything happened. Nick ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Number of Nodes defined
It must be the same fragmentation issue that we've addressed in 1.6 and later. Is this 1.4? On 11/17/2011 08:45 AM, David wrote: Sunil, et al, The reason I needed to make this changed was because the ocfs2 partition, which is 101G in size with 41G currently in use ran out of disk space even though the OS was reporting 60G available. I had this issue once before and found that the node slot of that cluster was set to 4 even though there were only 2 nodes in the cluster. When i reduced the node slots to 2 disk space was freed up. I made these changes to this cluster; reduced the node slots to 2 and everything worked until this morning when the same error returned No space left on device. The OS is still showing available disk space but as the error suggests i can't write to the partition. Any idea what could be happening? On 11/16/2011 05:45 PM, Sunil Mushran wrote: Reducing node-slots frees up the journal and distributes the metadata that that slot was tracking to the remaining slots. I am not aware of any reason why there should be an impact. On 11/16/2011 03:07 PM, David wrote: I did read the man page for tunefs.ocfs2 but I didn't see anything indicating what the impact to the fs would be when making a change to an existing fs such as reducing the node slots. Anyway, thank you for the feedback, I was able to make the changes with no impact to the fs. David On 11/16/2011 12:12 PM, Sunil Mushran wrote: man tunefs.ocfs2 It cannot be done in an active cluster. But it can be done without having to reformat the volume. On 11/16/2011 10:08 AM, David wrote: I wasn't able to find any documentation that answers whether or not the number of nodes defined for a cluster, can be reduced on an active cluster as seen via: tunefs.ocfs2 -Q %B %T %N\n Does anyone know if this can be done, or do I have to copy the data off of the fs, make the changes, reformat the fs and copy the data back? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] [Ocfs2-devel] vmstore option - mkfs
fstype is a handy way to format the volume with parameters that are thought to be useful for that use-case. The result of this is printed during format by way of the parameters selected. man mkfs.ocfs2 has a blurb about the features it enabled by default. On 11/16/2011 08:45 AM, Artur Baruchi wrote: Hi. I tried to find some information about the option vmstore when formating a device, but didnt found anything about it (no documentation, I did some greps inside the source code, but nothing returned). My doubts about this: - What kind of optimization this option creates in my file system to store vm images? I mean.. what does exactly this option do? - Where, in source code, I can find the part that makes this optimization? Thanks in advance. Att. Artur Baruchi ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] [Ocfs2-devel] vmstore option - mkfs
Yes. But this is just the features. It also selects the appropriate cluster size, block size, journal size, etc. All the params selected are printed by mkfs. You also have the option of running with the --dry-option to see the params. On 11/16/2011 09:41 AM, Artur Baruchi wrote: I just found this: + {OCFS2_FEATURE_COMPAT_BACKUP_SB | OCFS2_FEATURE_COMPAT_JBD2_SB, + OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC | + OCFS2_FEATURE_INCOMPAT_INLINE_DATA | + OCFS2_FEATURE_INCOMPAT_XATTR | + OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE, + OCFS2_FEATURE_RO_COMPAT_UNWRITTEN}, /* FS_VMSTORE */ These options are the ones that, when choosing for vmstore, are enabled by default. Is this correct? Thanks. Att. Artur Baruchi On Wed, Nov 16, 2011 at 3:26 PM, Sunil Mushransunil.mush...@oracle.com wrote: fstype is a handy way to format the volume with parameters that are thought to be useful for that use-case. The result of this is printed during format by way of the parameters selected. man mkfs.ocfs2 has a blurb about the features it enabled by default. On 11/16/2011 08:45 AM, Artur Baruchi wrote: Hi. I tried to find some information about the option vmstore when formating a device, but didnt found anything about it (no documentation, I did some greps inside the source code, but nothing returned). My doubts about this: - What kind of optimization this option creates in my file system to store vm images? I mean.. what does exactly this option do? - Where, in source code, I can find the part that makes this optimization? Thanks in advance. Att. Artur Baruchi ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Number of Nodes defined
man tunefs.ocfs2 It cannot be done in an active cluster. But it can be done without having to reformat the volume. On 11/16/2011 10:08 AM, David wrote: I wasn't able to find any documentation that answers whether or not the number of nodes defined for a cluster, can be reduced on an active cluster as seen via: tunefs.ocfs2 -Q %B %T %N\n Does anyone know if this can be done, or do I have to copy the data off of the fs, make the changes, reformat the fs and copy the data back? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Number of Nodes defined
Reducing node-slots frees up the journal and distributes the metadata that that slot was tracking to the remaining slots. I am not aware of any reason why there should be an impact. On 11/16/2011 03:07 PM, David wrote: I did read the man page for tunefs.ocfs2 but I didn't see anything indicating what the impact to the fs would be when making a change to an existing fs such as reducing the node slots. Anyway, thank you for the feedback, I was able to make the changes with no impact to the fs. David On 11/16/2011 12:12 PM, Sunil Mushran wrote: man tunefs.ocfs2 It cannot be done in an active cluster. But it can be done without having to reformat the volume. On 11/16/2011 10:08 AM, David wrote: I wasn't able to find any documentation that answers whether or not the number of nodes defined for a cluster, can be reduced on an active cluster as seen via: tunefs.ocfs2 -Q %B %T %N\n Does anyone know if this can be done, or do I have to copy the data off of the fs, make the changes, reformat the fs and copy the data back? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
o2image is only useful for debugging. It allows us to get a copy of the file system on which we can test fsck inhouse. The files in lost+found have to be resolved manually. If they are junk, delete them. If useful, move it to another directory. On 11/11/2011 05:36 PM, Nick Khamis wrote: All Fixed! Just a few questions. Is there any documentation on howto diagnose on ocfs2 filesystem: * How to transfer an image file for testing onto a different machine. As you did with o2image.out * Does fsck.ocfs2 -fy /dev/loop0 pretty much fix all the common problems * What can I do with the files in lost+found Thanks Again, Nick. On Fri, Nov 11, 2011 at 8:02 PM, Sunil Mushransunil.mush...@oracle.com wrote: So it detected one cluster that was doubly allocated. It fixed it. Details below. The other fixes could be because the o2image was taken on a live volume. As to how this could happen... I would look at the storage. # fsck.ocfs2 -fy /dev/loop0 fsck.ocfs2 1.6.3 Checking OCFS2 filesystem in /dev/loop0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/loop0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Duplicate clusters detected. Pass 1b will be run Running additional passes to resolve clusters claimed by more than one inode... Pass 1b: Determining ownership of multiply-claimed clusters Pass 1c: Determining the names of inodes owning multiply-claimed clusters Pass 1d: Reconciling multiply-claimed clusters Cluster 161335 is claimed by the following inodes: /asterisk/extensions.conf /moh/macroform-cold_day.wav [DUP_CLUSTERS_CLONE] Inode /asterisk/extensions.conf may be cloned or deleted to break the claim it has on its clusters. Clone inode /asterisk/extensions.conf to break claims on clusters it shares with other inodes? y [DUP_CLUSTERS_CLONE] Inode /moh/macroform-cold_day.wav may be cloned or deleted to break the claim it has on its clusters. Clone inode /moh/macroform-cold_day.wav to break claims on clusters it shares with other inodes? y Pass 2: Checking directory entries. [DIRENT_INODE_FREE] Directory entry 'musiconhold.conf' refers to inode number 35348 which isn't allocated, clear the entry? y Pass 3: Checking directory connectivity. [LOSTFOUND_MISSING] /lost+found does not exist. Create it so that we can possibly fill it with orphaned inodes? y Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes link counts. [INODE_COUNT] Inode 96783 has a link count of 1 on disk but directory entry references come to 2. Update the count on disk to match? y [INODE_NOT_CONNECTED] Inode 96784 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96785 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96794 isn't referenced by any directory entries. Move it to lost+found? y [INODE_NOT_CONNECTED] Inode 96796 isn't referenced by any directory entries. Move it to lost+found? y All passes succeeded. Slot 0's journal dirty flag removed Slot 1's journal dirty flag removed [root@ca-test92 ocfs2]# fsck.ocfs2 -fy /dev/loop0 fsck.ocfs2 1.6.3 Checking OCFS2 filesystem in /dev/loop0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/loop0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Pass 2: Checking directory entries. Pass 3: Checking directory connectivity. Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes link counts. All passes succeeded. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 and db_block_size
We talk about this in the user's guide. 1. Always use 4K blocksize. 2. Never set the cluster size less than the database block size. Having a smaller cluster size could mean that a db block may not be contiguous. And you don't want that for performance and other reasons. Having a still larger cluster size is an easy way to ensure the files are contiguous. Contiguity can only help perf. On 11/14/2011 03:35 PM, Pravin K Patil wrote: Hi All, Is there a benchmark study done different block sizes of ocfs2 and corrosponding db_block_size and its impact on read / write? Similar way s there any study done for cluster size of ocfs2 and corrosponding db_block_size and its impact on read / write? For example if the db_block_size is 8K and if we have ocfs2 cluster size as 4K will it have any performance impact or in other words, if we make cluster size of file systems on which data files are located as 8K will it improve performance? if so is it for read or write? Looking for actual expereince on the settings of ocfs2 block size, cluster size and db_block_size corelation. Regards, Pravin ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
Do: fsck.ocfs2 -f /dev/... Without -f, it only replays the journal. On 11/09/2011 05:49 PM, Nick Khamis wrote: Hello Sunil, This is only on the protoype so it's not crucial however, it would be nice to figure out why for future reference: fsck.ocfs2 /dev/drbd0 fsck.ocfs2 1.6.4 Checking OCFS2 filesystem in /dev/drbd0: Label: AsteriskServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/drbd0 is clean. It will be checked after 20 additional mounts. I can mount it and write to it just fine (read and write). It's just when I start the application that reads from the filesystem (I don't think there is any writing going on), that it goes into read mode... It use to work, other than the update to 1.6.4 I am not sure what I have changed.. Not quite sure what kind of information you would need to help figure out the problem? Cheers, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
The ro issue was different. It appears the volume has more problems. If you want to me to look at the issue, I'll need the image of the volume. # o2image /dev/device /tmp/o2image.out On 11/10/2011 01:55 PM, Nick Khamis wrote: Hello Sunil, Thank you so much for your time, and I do not want to take any more of it. I ran fsck with -f and have the following: fsck.ocfs2 -f /dev/drbd0 fsck.ocfs2 1.6.4 Checking OCFS2 filesystem in /dev/drbd0: Label: ASTServer UUID: 3A791AB36DED41008E58CEF52EBEEFD3 Number of blocks: 592384 Block size: 4096 Number of clusters: 592384 Cluster size: 4096 Number of slots:2 /dev/drbd0 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Duplicate clusters detected. Pass 1b will be run Running additional passes to resolve clusters claimed by more than one inode... Pass 1b: Determining ownership of multiply-claimed clusters pass1b: Inode type does not contain extents while processing inode 5 fsck.ocfs2: Inode type does not contain extents while performing pass 1 Not sure if the read-only is due to the detected duplicate? Thanks in Advance, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] mixing ocfs2 versions in a cluster
I would recommend upgrading all the nodes to 1.2.9 as it contains fixes to known bugs in the versions you are running. Mixing versions is never recommended mainly because it is hard to test all possible combinations. It is alright to do so on an interim basis. But never recommended as a stable setup. On 11/09/2011 10:53 AM, Shashank wrote: Can you mix ocfs2 versions in a cluster? Eg. I have 4 nodes in a cluster. two nodes with version 1.2.7.-1el4 and the other two with 1.2.5-6. Thanks, Vik ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm locking
This has nothing to do with the dlm. The error states that the fs encountered a bad inode on disk. Possible disk corruption. On encountering the fs goes readonly and asks the user to run fsck. On 11/09/2011 11:51 AM, Nick Khamis wrote: Hello Everyone, For the first time I eoerienced a dlm lock: [ 9721.831813] OCFS2 DLM 1.5.0 [ 9721.917032] ocfs2: Registered cluster interface o2cb [ 9722.170848] OCFS2 DLMFS 1.5.0 [ 9722.179018] OCFS2 User DLM kernel interface loaded [ 9755.743195] ocfs2_dlm: Nodes in domain (3A791AB36DED41008E58CEF52EBEEFD3): 1 [ 9755.852798] ocfs2: Mounting device (147,0) on (node 1, slot 0) with ordered data mode. [ 9783.240424] block drbd0: Handshake successful: Agreed network protocol version 91 [ 9783.242922] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC [ 9783.243074] block drbd0: conn( WFConnection - WFReportParams ) [ 9783.243205] block drbd0: Starting asender thread (from drbd0_receiver [4390]) [ 9783.271014] block drbd0: data-integrity-alg:not-used [ 9783.271298] block drbd0: drbd_sync_handshake: [ 9783.271318] block drbd0: self 964FFEDA732A512B:0ABD16D2597E52D9:54E3AEC293CEDC7E:120384BD0E3A5705 bits:3 flags:0 [ 9783.271342] block drbd0: peer B4C81B0FD76EFAC2:0ABD16D2597E52D9:54E3AEC293CEDC7F:120384BD0E3A5705 bits:0 flags:0 [ 9783.271364] block drbd0: uuid_compare()=100 by rule 90 [ 9783.271380] block drbd0: Split-Brain detected, 1 primaries, automatically solved. Sync from this node [ 9783.271417] block drbd0: peer( Unknown - Secondary ) conn( WFReportParams - WFBitMapS ) [ 9783.399967] block drbd0: peer( Secondary - Primary ) [ 9783.515979] block drbd0: conn( WFBitMapS - SyncSource ) pdsk( Outdated - Inconsistent ) [ 9783.522521] block drbd0: Began resync as SyncSource (will sync 12 KB [3 bits set]). [ 9783.629758] block drbd0: Implicitly set pdsk Inconsistent! [ 9783.799387] block drbd0: Resync done (total 1 sec; paused 0 sec; 12 K/sec) [ 9783.799956] block drbd0: conn( SyncSource - Connected ) pdsk( Inconsistent - UpToDate ) [ 9795.430801] o2net: accepted connection from node astdrbd2 (num 2) at 192.168.2.111: [ 9800.231650] ocfs2_dlm: Node 2 joins domain 3A791AB36DED41008E58CEF52EBEEFD3 [ 9800.231668] ocfs2_dlm: Nodes in domain (3A791AB36DED41008E58CEF52EBEEFD3): 1 2 [ 9861.922744] OCFS2: ERROR (device drbd0): ocfs2_validate_inode_block: Invalid dinode #35348: OCFS2_VALID_FL not set [ 9861.922767] [ 9861.927278] File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. [ 9861.928231] (8009,0):ocfs2_read_locked_inode:496 ERROR: status = -22 Not sure where to start, but with your appreciated help I am sure we can get it resolved. Thanks in Advance, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Error building ocfs2-tools
On 10/27/2011 07:10 PM, Tim Serong wrote: Damn. It was in Pacemaker's include/crm/ais.h, back before June 27 last year(!), when it was moved to Pacemaker's configure.ac: https://github.com/ClusterLabs/pacemaker/commit/8e939b0ad779c65d445e2fa150df1cc046428a93#include/crm/ais.h This means it probably no longer appears in any of Pacemaker's public (devel package) header files, which explains the compile error. I did some more digging, and we (SUSE) presumably never had this problem because we've been carrying the attached patch for rather a long time. It replaces CRM_SERVICE (a relatively uninteresting number) with a somewhat more useful string literal... I thought the O2CB OCF RA was always provided by either pacemaker (or, on SUSE at least, in ocfs2-tools), but was never included in the upstream ocfs2-tools source tree? I thought we had checked-in all the pacemaker related patches. Are we missing something? The O2CB OCF RA is this thing: https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/o2cb It's the (better/stronger/faster :)) equivalent of the o2cb init script, which you use when OCFS2 is under Pacemaker's control. There's (IMO) a good argument for having OCF RAs included with the project they're intended for use with (all code pertaining to the operation of some program lives in one place). OTOH, there's another argument for having them included in the generic resource-agents or pacemaker package (Pacemaker and RHCS probably being the only things that actually use OCF RAs). I suspect the RA was either never submitted to ocfs2-tools, or was never accepted (don't know which, I wasn't involved when it was originally written). So I am checking in the patch with your sign-off. I hope that is ok with you. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Error building ocfs2-tools
ocfs2-tools-1.4.4 is too old. Build 1.6.4. The source tarball is on oss.oracle.com. On 10/27/2011 12:45 PM, Nick Khamis wrote: Hello Everyone, I am building ocfs2-tools from source. Modified /ocfs2_controld/Makefile to point to the correct pacemaker 1.1.6 headers: PCMK_INCLUDES = -I/usr/include/pacemaker -I/usr/include/heartbeat -I/usr/include/libxml2 $(GLIB_CFLAGS) However, for some reason I am getting: setup_stack: pacemaker.c:158: error: PCMK_SERVICE_ID undeclared (first use in this function) pacemaker.c:158: error: (Each undeclared identifier is reported only once pacemaker.c:158: error: for each function it appears in.) make[1]: *** [pacemaker.o] Error 1 make[1]: Leaving directory `/usr/local/src/ocfs2-tools-1.4.4/ocfs2_controld' The config I am using: ./configure --sbindir=/sbin --bin=/bin --libdir=/usr/lib --sysconfdir=/etc --datadir=/etc/ocfs2 --sharedstatedir=/var/ocfs2 --libexecdir=/usr/libexec --localstatedir=/var --mandir=/usr/man --enable-dynamic-fsck --enable-dynamic-ctl Thanks in Advance, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Error building ocfs2-tools
I don't remember that resource. If it did exist, it would have existed in pacemaker. ocfs2-tools does not carry any pacemaker bits. It carries bits that allows it to work with pacemaker cman. On 10/27/2011 02:27 PM, Nick Khamis wrote: Hello Sunil, Thank you so much for your response. I just downloaded 1.6. And had to add the following to pacemaker.c: #define PCMK_SERVICE_ID 9 line 158: log_error(Connection to our AIS plugin (%d) failed, PCMK_SERVICE_ID); to avoid. pacemaker.c: In function setup_stack: pacemaker.c:158: error: PCMK_SERVICE_ID undeclared (first use in this function) pacemaker.c:158: error: (Each undeclared identifier is reported only once pacemaker.c:158: error: for each function it appears in.) make[1]: *** [pacemaker.o] Error 1 make[1]: Leaving directory `/usr/local/src/ocfs2-tools-1.6.4/ocfs2_controld' make: *** [ocfs2_controld] Error 2 Not sure if that was the right thing to do? On a slightly unreallated. There use to be pacemaker ocf resource agent script included for o2cb o2cb.ocf. I take it this is now only provided by pacemaker? Cheers, Nick. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Error building ocfs2-tools
On 10/27/2011 05:26 PM, Tim Serong wrote: That ought to work... But where did PCMK_SERVICE_ID come from in that context? AFAICT it's always been CRM_SERVICE there. See current head: http://oss.oracle.com/git/?p=ocfs2-tools.git;a=blob;f=ocfs2_controld/pacemaker.c;hb=HEAD#l158 CRM_SERVICE is then mapped back to PCMK_SERVICE_ID in pacemaker's include/crm/ais.h: https://github.com/ClusterLabs/pacemaker/blob/master/include/crm/ais.h#L54 Where is PCMK_SERVICE_ID defined? This qs has come up more than once. I thought the O2CB OCF RA was always provided by either pacemaker (or, on SUSE at least, in ocfs2-tools), but was never included in the upstream ocfs2-tools source tree? I thought we had checked-in all the pacemaker related patches. Are we missing something? ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
I think it stops by uuid. So try doing this the next time. You are encountering some issue that we have not seen before. ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2 On 10/23/2011 05:32 AM, Laurentiu Gosu wrote: Hi Sunil, Sorry for my late reply, i just had time today to start from scratch and test. I rebuilt my environment(2 nodes connected to a SAN via iSCSI+multipath). I still have the issue that the heartbeat is active after I umount my ocfs2 volume. /etc/init.d/o2cb stop Stopping O2CB cluster CLUST: Failed Unable to stop cluster as heartbeat region still active ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs After i manually kill the ref (ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can live with that but why doesn't it stop automatically? As i understand, hearbeat should be started and stopped once the volume gets mounted/umounted. br, Laurentiu. On 10/19/2011 02:28, Sunil Mushran wrote: Manual delete will only work if there are no references. In your case there are references. You may want to start both nodes from scratch. Do not start/stop heartbeat manually. Also, do not force-format. On 10/18/2011 03:54 PM, Laurentiu Gosu wrote: OK, i rebooted one of the nodes(both had similar issues); . But something is still fishy. - i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/ - i unmount it: umount /mnt/tmp/ - tried to stop o2cb: /etc/init.d/o2cb stop Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active - ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs - ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat - ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/ /sys/kernel/config/cluster/CLUSTER/heartbeat/: total 0 drwxr-xr-x 2 root root0 Oct 19 01:50 0C4AB55FE9314FA5A9F81652FDB9B22D -rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D: total 0 -rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks -rw-r--r-- 1 root root 4096 Oct 19 01:50 dev -r--r--r-- 1 root root 4096 Oct 19 01:50 pid -rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block - i cannot manually delete /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/ PS: i'm going to sleep now, i have to be up in a few hours. We can continue tomorrow if it's ok with you. Thank you for your help. Laurentiu. On 10/19/2011 01:33, Sunil Mushran wrote: One way this can happen is if one starts the hb manually and then force formats on that volume. The format will generate a new uuid. Once that happens, the hb tool cannot map the region to the device and thus fail to stop it. Right now the easiest option on this box is resetting it. On 10/18/2011 03:24 PM, Laurentiu Gosu wrote: Yes, i did reformat it(even more than once i think, last week). This is a pre-production system and i'm trying various options before moving into real life. On 10/19/2011 01:19, Sunil Mushran wrote: Did you reformat the volume recently? or, when did you format last? On 10/18/2011 03:13 PM, Laurentiu Gosu wrote: well..this is weird ls /sys/kernel/config/cluster/CLUSTER/heartbeat/ *918673F06F8F4ED188DDCE14F39945F6* dead_threshold looks like we have different UUIDs. Where is this coming from?? ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6 918673F06F8F4ED188DDCE14F39945F6: 1 refs On 10/19/2011 01:04, Sunil Mushran wrote: Let's do it by hand. rm -rf /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D * On 10/18/2011 02:52 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat No improvment :( On 10/19/2011 00:50, Sunil Mushran wrote: See if this cleans it up. ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:44 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs On 10/19/2011 00:43, Sunil Mushran wrote: ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:40 PM, Laurentiu Gosu wrote: mounted.ocfs2 -d DeviceFS Stack UUID Label /dev/mapper/volgr1-lvol0 ocfs2 o2cb 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2 mounted.ocfs2 -f DeviceFS Nodes /dev/mapper/volgr1-lvol0 ocfs2 ro02xsrv001 ro02xsrv001 = the other node in the cluster. By the way, there is no /dev/md-2 ls /dev/dm-* /dev/dm-0 /dev/dm-1 On 10/19/2011 00:37, Sunil Mushran wrote: So it is not mounted. But we still have a hb thread because hb could not be stopped during umount. The reason for that could be the same that causes ocfs2_hb_ctl to fail. Do: mounted.ocfs2
Re: [Ocfs2-users] OCFS2 slow with multiple writes
Because in this case the cluster lock may be waiting for the journal commit to complete. It depends on where the file is being created, what internal metadata blocks need to be locked, etc. Your dd is not a simple write. It is a create + allocation + write. If the file already exists, then the data extents will first be truncated too. On 10/21/2011 03:27 AM, Prakash Velayutham wrote: Hi Sunil, Thanks for the response. Do you mean OCFS2 is blocking writes from multiple clients? Is that how OCFS2 works? I can understand that writing the (2) 20G files might take longer with ordered option as data needs to be flushed to the FS before journal commit, but why is that blocking a new separate file from being written to the file system? Regards, Prakash On Oct 20, 2011, at 6:25 PM, Sunil Mushran wrote: Use writeback. Ordered data requires the data to be flushed before journal commit. And flushing 40G takes time. mount -t data=writeback DEVICE PATH On 10/20/2011 03:05 PM, Prakash Velayutham wrote: Hi, OS - SLES 11.1 with HAE OCFS2 - 1.4.3-0.16.7 Cluster stack - Pacemaker I have Heartbeat Filesystem monitor that monitors the OCFS2 file system for availability. This monitor kicks in every minute and tries to write a file using dd as below. dd of=/var/lib/mysql/data1/.Filesystem_status/default_bmimysqlp3 oflag=direct,sync bs=512 conv=fsync,sync If the OCFS2 file system is busy, like when I try to create 2 large files (20GB each) in the OCFS2 directory, I see that the above monitor process hangs until the 2 files are created. But this causes Pacemaker to fence the node as the RA is configured for a timeout of 45secs and the 2 file creations do take more than that. The OCFS2 file system is mounted as below. /dev/mapper/bmimysqlp3_p4_vol1 on /var/lib/mysql/data1 type ocfs2 (rw,_netdev,nointr,data=ordered,cluster_stack=pcmk) Is there something wrong with the file system itself that a small file creation hangs like that? Please let me know if you need any more information. Thanks, Prakash ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 slow with multiple writes
Use writeback. Ordered data requires the data to be flushed before journal commit. And flushing 40G takes time. mount -t data=writeback DEVICE PATH On 10/20/2011 03:05 PM, Prakash Velayutham wrote: Hi, OS - SLES 11.1 with HAE OCFS2 - 1.4.3-0.16.7 Cluster stack - Pacemaker I have Heartbeat Filesystem monitor that monitors the OCFS2 file system for availability. This monitor kicks in every minute and tries to write a file using dd as below. dd of=/var/lib/mysql/data1/.Filesystem_status/default_bmimysqlp3 oflag=direct,sync bs=512 conv=fsync,sync If the OCFS2 file system is busy, like when I try to create 2 large files (20GB each) in the OCFS2 directory, I see that the above monitor process hangs until the 2 files are created. But this causes Pacemaker to fence the node as the RA is configured for a timeout of 45secs and the 2 file creations do take more than that. The OCFS2 file system is mounted as below. /dev/mapper/bmimysqlp3_p4_vol1 on /var/lib/mysql/data1 type ocfs2 (rw,_netdev,nointr,data=ordered,cluster_stack=pcmk) Is there something wrong with the file system itself that a small file creation hangs like that? Please let me know if you need any more information. Thanks, Prakash ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
ls -lR /sys/kernel/config/cluster What does this return? On 10/18/2011 02:05 PM, Laurentiu Gosu wrote: Hi, I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5, ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5. My problem is that all the time when i try to run /etc/init.d/o2cb stop it fails with this error: Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active There is no active mount point. I tried to manually stop the heartdbeat with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ). But even if refs number is set to zero the heartbeat region still active occurs. How can i fix this? Thank you in advance. Laurentiu. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
What does this return? cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev Also, do: ls -lR /sys/kernel/debug/ocfs2 ls -lR /sys/kernel/debug/o2dlm On 10/18/2011 02:14 PM, Laurentiu Gosu wrote: Here is the output: ls -lR /sys/kernel/config/cluster /sys/kernel/config/cluster: total 0 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER /sys/kernel/config/cluster/CLUSTER: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms drwxr-xr-x 4 root root0 Oct 11 20:23 node -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms /sys/kernel/config/cluster/CLUSTER/heartbeat: total 0 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev -r--r--r-- 1 root root 4096 Oct 19 00:12 pid -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block /sys/kernel/config/cluster/CLUSTER/node: total 0 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num On 10/19/2011 00:12, Sunil Mushran wrote: ls -lR /sys/kernel/config/cluster What does this return? On 10/18/2011 02:05 PM, Laurentiu Gosu wrote: Hi, I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5, ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5. My problem is that all the time when i try to run /etc/init.d/o2cb stop it fails with this error: Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active There is no active mount point. I tried to manually stop the heartdbeat with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ). But even if refs number is set to zero the heartbeat region still active occurs. How can i fix this? Thank you in advance. Laurentiu. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
mount -t debugfs debugfs /sys/kernel/debug Then list that dir. Also, do: ocfs2_hb_ctl -l -d /dev/dm-2 Be careful before killing. We want to be sure that dev is not mounted. On 10/18/2011 02:23 PM, Laurentiu Gosu wrote: Again the outputs: cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev dm-2 ---here should be volgr1-lvol0 i guess? ls -lR /sys/kernel/debug/ocfs2 ls: /sys/kernel/debug/ocfs2: No such file or directory ls -lR /sys/kernel/debug/o2dlm ls: /sys/kernel/debug/o2dlm: No such file or directory I think i have to enable debug first somehow..? Laurentiu. On 10/19/2011 00:17, Sunil Mushran wrote: What does this return? cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev Also, do: ls -lR /sys/kernel/debug/ocfs2 ls -lR /sys/kernel/debug/o2dlm On 10/18/2011 02:14 PM, Laurentiu Gosu wrote: Here is the output: ls -lR /sys/kernel/config/cluster /sys/kernel/config/cluster: total 0 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER /sys/kernel/config/cluster/CLUSTER: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms drwxr-xr-x 4 root root0 Oct 11 20:23 node -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms /sys/kernel/config/cluster/CLUSTER/heartbeat: total 0 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev -r--r--r-- 1 root root 4096 Oct 19 00:12 pid -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block /sys/kernel/config/cluster/CLUSTER/node: total 0 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num On 10/19/2011 00:12, Sunil Mushran wrote: ls -lR /sys/kernel/config/cluster What does this return? On 10/18/2011 02:05 PM, Laurentiu Gosu wrote: Hi, I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5, ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5. My problem is that all the time when i try to run /etc/init.d/o2cb stop it fails with this error: Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active There is no active mount point. I tried to manually stop the heartdbeat with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ). But even if refs number is set to zero the heartbeat region still active occurs. How can i fix this? Thank you in advance. Laurentiu. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
So it is not mounted. But we still have a hb thread because hb could not be stopped during umount. The reason for that could be the same that causes ocfs2_hb_ctl to fail. Do: mounted.ocfs2 -d On 10/18/2011 02:32 PM, Laurentiu Gosu wrote: ls -lR /sys/kernel/debug/ocfs2 /sys/kernel/debug/ocfs2: total 0 ls -lR /sys/kernel/debug/o2dlm /sys/kernel/debug/o2dlm: total 0 ocfs2_hb_ctl -I -d /dev/dm-2 ocfs2_hb_ctl: Device name specified was not found while reading uuid There is no /dev/dm-2 mounted. On 10/19/2011 00:27, Sunil Mushran wrote: mount -t debugfs debugfs /sys/kernel/debug Then list that dir. Also, do: ocfs2_hb_ctl -l -d /dev/dm-2 Be careful before killing. We want to be sure that dev is not mounted. On 10/18/2011 02:23 PM, Laurentiu Gosu wrote: Again the outputs: cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev dm-2 ---here should be volgr1-lvol0 i guess? ls -lR /sys/kernel/debug/ocfs2 ls: /sys/kernel/debug/ocfs2: No such file or directory ls -lR /sys/kernel/debug/o2dlm ls: /sys/kernel/debug/o2dlm: No such file or directory I think i have to enable debug first somehow..? Laurentiu. On 10/19/2011 00:17, Sunil Mushran wrote: What does this return? cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev Also, do: ls -lR /sys/kernel/debug/ocfs2 ls -lR /sys/kernel/debug/o2dlm On 10/18/2011 02:14 PM, Laurentiu Gosu wrote: Here is the output: ls -lR /sys/kernel/config/cluster /sys/kernel/config/cluster: total 0 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER /sys/kernel/config/cluster/CLUSTER: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms drwxr-xr-x 4 root root0 Oct 11 20:23 node -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms /sys/kernel/config/cluster/CLUSTER/heartbeat: total 0 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev -r--r--r-- 1 root root 4096 Oct 19 00:12 pid -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block /sys/kernel/config/cluster/CLUSTER/node: total 0 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num On 10/19/2011 00:12, Sunil Mushran wrote: ls -lR /sys/kernel/config/cluster What does this return? On 10/18/2011 02:05 PM, Laurentiu Gosu wrote: Hi, I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5, ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5. My problem is that all the time when i try to run /etc/init.d/o2cb stop it fails with this error: Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active There is no active mount point. I tried to manually stop the heartdbeat with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ). But even if refs number is set to zero the heartbeat region still active occurs. How can i fix this? Thank you in advance. Laurentiu. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:40 PM, Laurentiu Gosu wrote: mounted.ocfs2 -d DeviceFS Stack UUID Label /dev/mapper/volgr1-lvol0 ocfs2 o2cb 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2 mounted.ocfs2 -f DeviceFS Nodes /dev/mapper/volgr1-lvol0 ocfs2 ro02xsrv001 ro02xsrv001 = the other node in the cluster. By the way, there is no /dev/md-2 ls /dev/dm-* /dev/dm-0 /dev/dm-1 On 10/19/2011 00:37, Sunil Mushran wrote: So it is not mounted. But we still have a hb thread because hb could not be stopped during umount. The reason for that could be the same that causes ocfs2_hb_ctl to fail. Do: mounted.ocfs2 -d On 10/18/2011 02:32 PM, Laurentiu Gosu wrote: ls -lR /sys/kernel/debug/ocfs2 /sys/kernel/debug/ocfs2: total 0 ls -lR /sys/kernel/debug/o2dlm /sys/kernel/debug/o2dlm: total 0 ocfs2_hb_ctl -I -d /dev/dm-2 ocfs2_hb_ctl: Device name specified was not found while reading uuid There is no /dev/dm-2 mounted. On 10/19/2011 00:27, Sunil Mushran wrote: mount -t debugfs debugfs /sys/kernel/debug Then list that dir. Also, do: ocfs2_hb_ctl -l -d /dev/dm-2 Be careful before killing. We want to be sure that dev is not mounted. On 10/18/2011 02:23 PM, Laurentiu Gosu wrote: Again the outputs: cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev dm-2 ---here should be volgr1-lvol0 i guess? ls -lR /sys/kernel/debug/ocfs2 ls: /sys/kernel/debug/ocfs2: No such file or directory ls -lR /sys/kernel/debug/o2dlm ls: /sys/kernel/debug/o2dlm: No such file or directory I think i have to enable debug first somehow..? Laurentiu. On 10/19/2011 00:17, Sunil Mushran wrote: What does this return? cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev Also, do: ls -lR /sys/kernel/debug/ocfs2 ls -lR /sys/kernel/debug/o2dlm On 10/18/2011 02:14 PM, Laurentiu Gosu wrote: Here is the output: ls -lR /sys/kernel/config/cluster /sys/kernel/config/cluster: total 0 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER /sys/kernel/config/cluster/CLUSTER: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms drwxr-xr-x 4 root root0 Oct 11 20:23 node -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms /sys/kernel/config/cluster/CLUSTER/heartbeat: total 0 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev -r--r--r-- 1 root root 4096 Oct 19 00:12 pid -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block /sys/kernel/config/cluster/CLUSTER/node: total 0 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num On 10/19/2011 00:12, Sunil Mushran wrote: ls -lR /sys/kernel/config/cluster What does this return? On 10/18/2011 02:05 PM, Laurentiu Gosu wrote: Hi, I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5, ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5. My problem is that all the time when i try to run /etc/init.d/o2cb stop it fails with this error: Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active There is no active mount point. I tried to manually stop the heartdbeat with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ). But even if refs number is set to zero the heartbeat region still active occurs. How can i fix this? Thank you in advance. Laurentiu. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
See if this cleans it up. ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:44 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs On 10/19/2011 00:43, Sunil Mushran wrote: ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:40 PM, Laurentiu Gosu wrote: mounted.ocfs2 -d DeviceFS Stack UUID Label /dev/mapper/volgr1-lvol0 ocfs2 o2cb 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2 mounted.ocfs2 -f DeviceFS Nodes /dev/mapper/volgr1-lvol0 ocfs2 ro02xsrv001 ro02xsrv001 = the other node in the cluster. By the way, there is no /dev/md-2 ls /dev/dm-* /dev/dm-0 /dev/dm-1 On 10/19/2011 00:37, Sunil Mushran wrote: So it is not mounted. But we still have a hb thread because hb could not be stopped during umount. The reason for that could be the same that causes ocfs2_hb_ctl to fail. Do: mounted.ocfs2 -d On 10/18/2011 02:32 PM, Laurentiu Gosu wrote: ls -lR /sys/kernel/debug/ocfs2 /sys/kernel/debug/ocfs2: total 0 ls -lR /sys/kernel/debug/o2dlm /sys/kernel/debug/o2dlm: total 0 ocfs2_hb_ctl -I -d /dev/dm-2 ocfs2_hb_ctl: Device name specified was not found while reading uuid There is no /dev/dm-2 mounted. On 10/19/2011 00:27, Sunil Mushran wrote: mount -t debugfs debugfs /sys/kernel/debug Then list that dir. Also, do: ocfs2_hb_ctl -l -d /dev/dm-2 Be careful before killing. We want to be sure that dev is not mounted. On 10/18/2011 02:23 PM, Laurentiu Gosu wrote: Again the outputs: cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev dm-2 ---here should be volgr1-lvol0 i guess? ls -lR /sys/kernel/debug/ocfs2 ls: /sys/kernel/debug/ocfs2: No such file or directory ls -lR /sys/kernel/debug/o2dlm ls: /sys/kernel/debug/o2dlm: No such file or directory I think i have to enable debug first somehow..? Laurentiu. On 10/19/2011 00:17, Sunil Mushran wrote: What does this return? cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev Also, do: ls -lR /sys/kernel/debug/ocfs2 ls -lR /sys/kernel/debug/o2dlm On 10/18/2011 02:14 PM, Laurentiu Gosu wrote: Here is the output: ls -lR /sys/kernel/config/cluster /sys/kernel/config/cluster: total 0 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER /sys/kernel/config/cluster/CLUSTER: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms drwxr-xr-x 4 root root0 Oct 11 20:23 node -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms /sys/kernel/config/cluster/CLUSTER/heartbeat: total 0 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev -r--r--r-- 1 root root 4096 Oct 19 00:12 pid -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block /sys/kernel/config/cluster/CLUSTER/node: total 0 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num On 10/19/2011 00:12, Sunil Mushran wrote: ls -lR /sys/kernel/config/cluster What does this return? On 10/18/2011 02:05 PM, Laurentiu Gosu wrote: Hi, I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5, ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5. My problem is that all the time when i try to run /etc/init.d/o2cb stop it fails with this error: Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active There is no active mount point. I tried to manually stop the heartdbeat with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding the refs number with ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ). But even if refs number is set to zero the heartbeat region still active occurs. How can i fix this? Thank you in advance. Laurentiu. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
Let's do it by hand. rm -rf /sys/kernel/config/cluster/.../heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:52 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat No improvment :( On 10/19/2011 00:50, Sunil Mushran wrote: See if this cleans it up. ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:44 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs On 10/19/2011 00:43, Sunil Mushran wrote: ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:40 PM, Laurentiu Gosu wrote: mounted.ocfs2 -d DeviceFS Stack UUID Label /dev/mapper/volgr1-lvol0 ocfs2 o2cb 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2 mounted.ocfs2 -f DeviceFS Nodes /dev/mapper/volgr1-lvol0 ocfs2 ro02xsrv001 ro02xsrv001 = the other node in the cluster. By the way, there is no /dev/md-2 ls /dev/dm-* /dev/dm-0 /dev/dm-1 On 10/19/2011 00:37, Sunil Mushran wrote: So it is not mounted. But we still have a hb thread because hb could not be stopped during umount. The reason for that could be the same that causes ocfs2_hb_ctl to fail. Do: mounted.ocfs2 -d On 10/18/2011 02:32 PM, Laurentiu Gosu wrote: ls -lR /sys/kernel/debug/ocfs2 /sys/kernel/debug/ocfs2: total 0 ls -lR /sys/kernel/debug/o2dlm /sys/kernel/debug/o2dlm: total 0 ocfs2_hb_ctl -I -d /dev/dm-2 ocfs2_hb_ctl: Device name specified was not found while reading uuid There is no /dev/dm-2 mounted. On 10/19/2011 00:27, Sunil Mushran wrote: mount -t debugfs debugfs /sys/kernel/debug Then list that dir. Also, do: ocfs2_hb_ctl -l -d /dev/dm-2 Be careful before killing. We want to be sure that dev is not mounted. On 10/18/2011 02:23 PM, Laurentiu Gosu wrote: Again the outputs: cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev dm-2 ---here should be volgr1-lvol0 i guess? ls -lR /sys/kernel/debug/ocfs2 ls: /sys/kernel/debug/ocfs2: No such file or directory ls -lR /sys/kernel/debug/o2dlm ls: /sys/kernel/debug/o2dlm: No such file or directory I think i have to enable debug first somehow..? Laurentiu. On 10/19/2011 00:17, Sunil Mushran wrote: What does this return? cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev Also, do: ls -lR /sys/kernel/debug/ocfs2 ls -lR /sys/kernel/debug/o2dlm On 10/18/2011 02:14 PM, Laurentiu Gosu wrote: Here is the output: ls -lR /sys/kernel/config/cluster /sys/kernel/config/cluster: total 0 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER /sys/kernel/config/cluster/CLUSTER: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms drwxr-xr-x 4 root root0 Oct 11 20:23 node -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms /sys/kernel/config/cluster/CLUSTER/heartbeat: total 0 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev -r--r--r-- 1 root root 4096 Oct 19 00:12 pid -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block /sys/kernel/config/cluster/CLUSTER/node: total 0 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num On 10/19/2011 00:12, Sunil Mushran wrote: ls -lR /sys/kernel/config/cluster What does this return? On 10/18/2011 02:05 PM, Laurentiu Gosu wrote: Hi, I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5, ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5. My problem is that all the time when i try to run /etc/init.d/o2cb stop it fails with this error: Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active There is no active mount point. I tried to manually stop the heartdbeat with ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2 (after finding
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
Did you reformat the volume recently? or, when did you format last? On 10/18/2011 03:13 PM, Laurentiu Gosu wrote: well..this is weird ls /sys/kernel/config/cluster/CLUSTER/heartbeat/ *918673F06F8F4ED188DDCE14F39945F6* dead_threshold looks like we have different UUIDs. Where is this coming from?? ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6 918673F06F8F4ED188DDCE14F39945F6: 1 refs On 10/19/2011 01:04, Sunil Mushran wrote: Let's do it by hand. rm -rf /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D * On 10/18/2011 02:52 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat No improvment :( On 10/19/2011 00:50, Sunil Mushran wrote: See if this cleans it up. ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:44 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs On 10/19/2011 00:43, Sunil Mushran wrote: ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:40 PM, Laurentiu Gosu wrote: mounted.ocfs2 -d DeviceFS Stack UUID Label /dev/mapper/volgr1-lvol0 ocfs2 o2cb 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2 mounted.ocfs2 -f DeviceFS Nodes /dev/mapper/volgr1-lvol0 ocfs2 ro02xsrv001 ro02xsrv001 = the other node in the cluster. By the way, there is no /dev/md-2 ls /dev/dm-* /dev/dm-0 /dev/dm-1 On 10/19/2011 00:37, Sunil Mushran wrote: So it is not mounted. But we still have a hb thread because hb could not be stopped during umount. The reason for that could be the same that causes ocfs2_hb_ctl to fail. Do: mounted.ocfs2 -d On 10/18/2011 02:32 PM, Laurentiu Gosu wrote: ls -lR /sys/kernel/debug/ocfs2 /sys/kernel/debug/ocfs2: total 0 ls -lR /sys/kernel/debug/o2dlm /sys/kernel/debug/o2dlm: total 0 ocfs2_hb_ctl -I -d /dev/dm-2 ocfs2_hb_ctl: Device name specified was not found while reading uuid There is no /dev/dm-2 mounted. On 10/19/2011 00:27, Sunil Mushran wrote: mount -t debugfs debugfs /sys/kernel/debug Then list that dir. Also, do: ocfs2_hb_ctl -l -d /dev/dm-2 Be careful before killing. We want to be sure that dev is not mounted. On 10/18/2011 02:23 PM, Laurentiu Gosu wrote: Again the outputs: cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev dm-2 ---here should be volgr1-lvol0 i guess? ls -lR /sys/kernel/debug/ocfs2 ls: /sys/kernel/debug/ocfs2: No such file or directory ls -lR /sys/kernel/debug/o2dlm ls: /sys/kernel/debug/o2dlm: No such file or directory I think i have to enable debug first somehow..? Laurentiu. On 10/19/2011 00:17, Sunil Mushran wrote: What does this return? cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev Also, do: ls -lR /sys/kernel/debug/ocfs2 ls -lR /sys/kernel/debug/o2dlm On 10/18/2011 02:14 PM, Laurentiu Gosu wrote: Here is the output: ls -lR /sys/kernel/config/cluster /sys/kernel/config/cluster: total 0 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER /sys/kernel/config/cluster/CLUSTER: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms drwxr-xr-x 4 root root0 Oct 11 20:23 node -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms /sys/kernel/config/cluster/CLUSTER/heartbeat: total 0 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev -r--r--r-- 1 root root 4096 Oct 19 00:12 pid -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block /sys/kernel/config/cluster/CLUSTER/node: total 0 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num On 10/19/2011 00:12, Sunil Mushran wrote: ls -lR /sys/kernel/config/cluster What does this return? On 10/18/2011 02:05 PM, Laurentiu Gosu wrote: Hi, I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5, ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5. My problem
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
One way this can happen is if one starts the hb manually and then force formats on that volume. The format will generate a new uuid. Once that happens, the hb tool cannot map the region to the device and thus fail to stop it. Right now the easiest option on this box is resetting it. On 10/18/2011 03:24 PM, Laurentiu Gosu wrote: Yes, i did reformat it(even more than once i think, last week). This is a pre-production system and i'm trying various options before moving into real life. On 10/19/2011 01:19, Sunil Mushran wrote: Did you reformat the volume recently? or, when did you format last? On 10/18/2011 03:13 PM, Laurentiu Gosu wrote: well..this is weird ls /sys/kernel/config/cluster/CLUSTER/heartbeat/ *918673F06F8F4ED188DDCE14F39945F6* dead_threshold looks like we have different UUIDs. Where is this coming from?? ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6 918673F06F8F4ED188DDCE14F39945F6: 1 refs On 10/19/2011 01:04, Sunil Mushran wrote: Let's do it by hand. rm -rf /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D * On 10/18/2011 02:52 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat No improvment :( On 10/19/2011 00:50, Sunil Mushran wrote: See if this cleans it up. ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:44 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs On 10/19/2011 00:43, Sunil Mushran wrote: ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:40 PM, Laurentiu Gosu wrote: mounted.ocfs2 -d DeviceFS Stack UUID Label /dev/mapper/volgr1-lvol0 ocfs2 o2cb 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2 mounted.ocfs2 -f DeviceFS Nodes /dev/mapper/volgr1-lvol0 ocfs2 ro02xsrv001 ro02xsrv001 = the other node in the cluster. By the way, there is no /dev/md-2 ls /dev/dm-* /dev/dm-0 /dev/dm-1 On 10/19/2011 00:37, Sunil Mushran wrote: So it is not mounted. But we still have a hb thread because hb could not be stopped during umount. The reason for that could be the same that causes ocfs2_hb_ctl to fail. Do: mounted.ocfs2 -d On 10/18/2011 02:32 PM, Laurentiu Gosu wrote: ls -lR /sys/kernel/debug/ocfs2 /sys/kernel/debug/ocfs2: total 0 ls -lR /sys/kernel/debug/o2dlm /sys/kernel/debug/o2dlm: total 0 ocfs2_hb_ctl -I -d /dev/dm-2 ocfs2_hb_ctl: Device name specified was not found while reading uuid There is no /dev/dm-2 mounted. On 10/19/2011 00:27, Sunil Mushran wrote: mount -t debugfs debugfs /sys/kernel/debug Then list that dir. Also, do: ocfs2_hb_ctl -l -d /dev/dm-2 Be careful before killing. We want to be sure that dev is not mounted. On 10/18/2011 02:23 PM, Laurentiu Gosu wrote: Again the outputs: cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev dm-2 ---here should be volgr1-lvol0 i guess? ls -lR /sys/kernel/debug/ocfs2 ls: /sys/kernel/debug/ocfs2: No such file or directory ls -lR /sys/kernel/debug/o2dlm ls: /sys/kernel/debug/o2dlm: No such file or directory I think i have to enable debug first somehow..? Laurentiu. On 10/19/2011 00:17, Sunil Mushran wrote: What does this return? cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev Also, do: ls -lR /sys/kernel/debug/ocfs2 ls -lR /sys/kernel/debug/o2dlm On 10/18/2011 02:14 PM, Laurentiu Gosu wrote: Here is the output: ls -lR /sys/kernel/config/cluster /sys/kernel/config/cluster: total 0 drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER /sys/kernel/config/cluster/CLUSTER: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method drwxr-xr-x 3 root root0 Oct 19 00:12 heartbeat -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms drwxr-xr-x 4 root root0 Oct 11 20:23 node -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms /sys/kernel/config/cluster/CLUSTER/heartbeat: total 0 drwxr-xr-x 2 root root0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6 -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev -r--r--r-- 1 root root 4096 Oct 19 00:12 pid -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block /sys/kernel/config/cluster/CLUSTER/node: total 0 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001 drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002 /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001: total 0 -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port -rw-r--r-- 1 root root 4096 Oct 19 00:12 local -rw-r--r-- 1 root root 4096 Oct 19 00:12 num /sys/kernel
Re: [Ocfs2-users] Unable to stop cluster as heartbeat region still active
Manual delete will only work if there are no references. In your case there are references. You may want to start both nodes from scratch. Do not start/stop heartbeat manually. Also, do not force-format. On 10/18/2011 03:54 PM, Laurentiu Gosu wrote: OK, i rebooted one of the nodes(both had similar issues); . But something is still fishy. - i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/ - i unmount it: umount /mnt/tmp/ - tried to stop o2cb: /etc/init.d/o2cb stop Stopping O2CB cluster CLUSTER: Failed Unable to stop cluster as heartbeat region still active - ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs - ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat - ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/ /sys/kernel/config/cluster/CLUSTER/heartbeat/: total 0 drwxr-xr-x 2 root root0 Oct 19 01:50 0C4AB55FE9314FA5A9F81652FDB9B22D -rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D: total 0 -rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes -rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks -rw-r--r-- 1 root root 4096 Oct 19 01:50 dev -r--r--r-- 1 root root 4096 Oct 19 01:50 pid -rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block - i cannot manually delete /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/ PS: i'm going to sleep now, i have to be up in a few hours. We can continue tomorrow if it's ok with you. Thank you for your help. Laurentiu. On 10/19/2011 01:33, Sunil Mushran wrote: One way this can happen is if one starts the hb manually and then force formats on that volume. The format will generate a new uuid. Once that happens, the hb tool cannot map the region to the device and thus fail to stop it. Right now the easiest option on this box is resetting it. On 10/18/2011 03:24 PM, Laurentiu Gosu wrote: Yes, i did reformat it(even more than once i think, last week). This is a pre-production system and i'm trying various options before moving into real life. On 10/19/2011 01:19, Sunil Mushran wrote: Did you reformat the volume recently? or, when did you format last? On 10/18/2011 03:13 PM, Laurentiu Gosu wrote: well..this is weird ls /sys/kernel/config/cluster/CLUSTER/heartbeat/ *918673F06F8F4ED188DDCE14F39945F6* dead_threshold looks like we have different UUIDs. Where is this coming from?? ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6 918673F06F8F4ED188DDCE14F39945F6: 1 refs On 10/19/2011 01:04, Sunil Mushran wrote: Let's do it by hand. rm -rf /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D * On 10/18/2011 02:52 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat No improvment :( On 10/19/2011 00:50, Sunil Mushran wrote: See if this cleans it up. ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:44 PM, Laurentiu Gosu wrote: ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs On 10/19/2011 00:43, Sunil Mushran wrote: ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D On 10/18/2011 02:40 PM, Laurentiu Gosu wrote: mounted.ocfs2 -d DeviceFS Stack UUID Label /dev/mapper/volgr1-lvol0 ocfs2 o2cb 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2 mounted.ocfs2 -f DeviceFS Nodes /dev/mapper/volgr1-lvol0 ocfs2 ro02xsrv001 ro02xsrv001 = the other node in the cluster. By the way, there is no /dev/md-2 ls /dev/dm-* /dev/dm-0 /dev/dm-1 On 10/19/2011 00:37, Sunil Mushran wrote: So it is not mounted. But we still have a hb thread because hb could not be stopped during umount. The reason for that could be the same that causes ocfs2_hb_ctl to fail. Do: mounted.ocfs2 -d On 10/18/2011 02:32 PM, Laurentiu Gosu wrote: ls -lR /sys/kernel/debug/ocfs2 /sys/kernel/debug/ocfs2: total 0 ls -lR /sys/kernel/debug/o2dlm /sys/kernel/debug/o2dlm: total 0 ocfs2_hb_ctl -I -d /dev/dm-2 ocfs2_hb_ctl: Device name specified was not found while reading uuid There is no /dev/dm-2 mounted. On 10/19/2011 00:27, Sunil Mushran wrote: mount -t debugfs debugfs /sys/kernel/debug Then list that dir. Also, do: ocfs2_hb_ctl -l -d /dev/dm-2 Be careful before killing. We want to be sure that dev is not mounted. On 10/18/2011 02:23 PM, Laurentiu Gosu wrote: Again the outputs: cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev dm-2 ---here should be volgr1-lvol0 i guess? ls -lR /sys/kernel/debug/ocfs2 ls: /sys/kernel/debug/ocfs2: No such file or directory ls -lR /sys/kernel/debug/o2dlm ls: /sys/kernel/debug/o2dlm: No such file or directory I think i have to enable debug first somehow..? Laurentiu. On 10/19/2011 00:17, Sunil Mushran wrote: What does
Re: [Ocfs2-users] Partition table crash, where can I find debug message?
Not sure what you mean by a partition table crash. Is it that someone overwrote the partition table on the iscsi server? That's what it looks like. If mount cannot detect the fs type, then it means atleast superblock corruption. And such corruptions typically caused by external entities. Stray dd perhaps. Did you try recovering the superblock using one of the the backups? fsck.ocfs2 -r [1-6] /dev/sdX ? On 10/11/2011 07:04 PM, Frank Zhang wrote: Hi Experts, recently I observed a partition table crash that made me really scared. I have two OVM servers sharing OCFS2 over iscsi, after running a bunch of VMs for a while, all VMs were gone and I saw the mount points of OCFS2 gone on both hosts. Then I tried to mount it again, the iscsi device crashed by saying please specify filesystem type. I checked dmesg but there is nothing useful except SCSI device sdc: drive cache: write back sdc: unknown partition table sd 2:0:0:1: Attached scsi disk sdc sd 2:0:0:1: Attached scsi generic sg3 type 0 OCFS2 Node Manager 1.4.4 OCFS2 DLM 1.4.4 OCFS2 DLMFS 1.4.4 OCFS2 User DLM kernel interface loaded connection1:0: detected conn error (1011) basically after logging into ISCSI device on both hosts, I created soft links of /dev/ovm_iscsi1 pointing to device node under /dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf and made o2cb correctly start). Could somebody tell me where to get more debug info to trace the problem? This is really scared considering I may lose all my VMs because of the silent crash. And is there any way to recover the partition table? Thanks ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Partition table crash, where can I find debug message?
Hard to say. You'll need to investigate the extent of the crash. On 10/12/2011 10:49 AM, Frank Zhang wrote: Sorry, it's not power outage, it's just a normal reboot. Is this serious to corrupt the super block? *From:*Frank Zhang *Sent:* Wednesday, October 12, 2011 10:37 AM *To:* 'Sunil Mushran' *Cc:* 'ocfs2-users@oss.oracle.com' *Subject:* RE: [Ocfs2-users] Partition table crash, where can I find debug message? Thanks Suni. Yes the terminology should be super block corruption. I checked with my colleague they said the ISCSI server suffered a power outage yesterday so they rebooted it. Given it was under heavy usage because of many VM running on, I guess this may be the cause. now I am trying to recover it *From:*Sunil Mushran [mailto:sunil.mush...@oracle.com] mailto:[mailto:sunil.mush...@oracle.com] *Sent:* Wednesday, October 12, 2011 10:08 AM *To:* Frank Zhang *Cc:* 'ocfs2-users@oss.oracle.com' *Subject:* Re: [Ocfs2-users] Partition table crash, where can I find debug message? Not sure what you mean by a partition table crash. Is it that someone overwrote the partition table on the iscsi server? That's what it looks like. If mount cannot detect the fs type, then it means atleast superblock corruption. And such corruptions typically caused by external entities. Stray dd perhaps. Did you try recovering the superblock using one of the the backups? fsck.ocfs2 -r [1-6] /dev/sdX ? On 10/11/2011 07:04 PM, Frank Zhang wrote: Hi Experts, recently I observed a partition table crash that made me really scared. I have two OVM servers sharing OCFS2 over iscsi, after running a bunch of VMs for a while, all VMs were gone and I saw the mount points of OCFS2 gone on both hosts. Then I tried to mount it again, the iscsi device crashed by saying please specify filesystem type. I checked dmesg but there is nothing useful except SCSI device sdc: drive cache: write back sdc: unknown partition table sd 2:0:0:1: Attached scsi disk sdc sd 2:0:0:1: Attached scsi generic sg3 type 0 OCFS2 Node Manager 1.4.4 OCFS2 DLM 1.4.4 OCFS2 DLMFS 1.4.4 OCFS2 User DLM kernel interface loaded connection1:0: detected conn error (1011) basically after logging into ISCSI device on both hosts, I created soft links of /dev/ovm_iscsi1 pointing to device node under /dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf and made o2cb correctly start). Could somebody tell me where to get more debug info to trace the problem? This is really scared considering I may lose all my VMs because of the silent crash. And is there any way to recover the partition table? Thanks ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com mailto:Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Partition table crash, where can I find debug message?
extent of the corruption... (not crash) On 10/12/2011 10:51 AM, Sunil Mushran wrote: Hard to say. You'll need to investigate the extent of the crash. On 10/12/2011 10:49 AM, Frank Zhang wrote: Sorry, it's not power outage, it's just a normal reboot. Is this serious to corrupt the super block? *From:*Frank Zhang *Sent:* Wednesday, October 12, 2011 10:37 AM *To:* 'Sunil Mushran' *Cc:* 'ocfs2-users@oss.oracle.com' *Subject:* RE: [Ocfs2-users] Partition table crash, where can I find debug message? Thanks Suni. Yes the terminology should be super block corruption. I checked with my colleague they said the ISCSI server suffered a power outage yesterday so they rebooted it. Given it was under heavy usage because of many VM running on, I guess this may be the cause. now I am trying to recover it *From:*Sunil Mushran [mailto:sunil.mush...@oracle.com] mailto:[mailto:sunil.mush...@oracle.com] *Sent:* Wednesday, October 12, 2011 10:08 AM *To:* Frank Zhang *Cc:* 'ocfs2-users@oss.oracle.com' *Subject:* Re: [Ocfs2-users] Partition table crash, where can I find debug message? Not sure what you mean by a partition table crash. Is it that someone overwrote the partition table on the iscsi server? That's what it looks like. If mount cannot detect the fs type, then it means atleast superblock corruption. And such corruptions typically caused by external entities. Stray dd perhaps. Did you try recovering the superblock using one of the the backups? fsck.ocfs2 -r [1-6] /dev/sdX ? On 10/11/2011 07:04 PM, Frank Zhang wrote: Hi Experts, recently I observed a partition table crash that made me really scared. I have two OVM servers sharing OCFS2 over iscsi, after running a bunch of VMs for a while, all VMs were gone and I saw the mount points of OCFS2 gone on both hosts. Then I tried to mount it again, the iscsi device crashed by saying please specify filesystem type. I checked dmesg but there is nothing useful except SCSI device sdc: drive cache: write back sdc: unknown partition table sd 2:0:0:1: Attached scsi disk sdc sd 2:0:0:1: Attached scsi generic sg3 type 0 OCFS2 Node Manager 1.4.4 OCFS2 DLM 1.4.4 OCFS2 DLMFS 1.4.4 OCFS2 User DLM kernel interface loaded connection1:0: detected conn error (1011) basically after logging into ISCSI device on both hosts, I created soft links of /dev/ovm_iscsi1 pointing to device node under /dev/disk/by-path/real_isci_device, then I formatted /dev/ovm_iscsi1 to OCFS2 and mounted them to somewhere(of course I configured /etc/ocfs2/cluster.conf and made o2cb correctly start). Could somebody tell me where to get more debug info to trace the problem? This is really scared considering I may lose all my VMs because of the silent crash. And is there any way to recover the partition table? Thanks ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com mailto:Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] one node kernel panic
uek is a different kernel entirely. It is hard to say whether you will or will not hit it with uek mainly because the underlying code is different. On 10/06/2011 10:33 PM, Hideyasu Kojima wrote: Thank you for responding. I think UEK5 is based on RHEL5 kernel. Does the problem same as UEK5 arise? (2011/10/05 1:45), Sunil Mushran wrote: int sigprocmask(int how, sigset_t *set, sigset_t *oldset) { int error; spin_lock_irq(current-sighand-siglock); CRASH if (oldset) *oldset = current-blocked; ... } current-sighand is NULL. So definitely a race. Generic kernel issue. Ping your kernel vendor. On 10/03/2011 07:49 PM, Hideyasu Kojima wrote: Hi, I run ocfs2/drbd active-active 2node cluster. ocfs2 version is 1.4.7-1 ocfs2-tool version is 1.4.4 Linux version is RHEL 5.4 (2.6.18-164.el5 x86_64) 1 node crash with kernel panic once. What is the cause? The bottom is the analysis of vmcore. Unable to handle kernel NULL pointer dereference at 0808 RIP: [80064ae6] _spin_lock_irq+0x1/0xb PGD 187e15067 PUD 187e16067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /devices/pci:00/:00:09.0/:06:00.0/:07:00.0/irq CPU 1 Modules linked in: mptctl mptbase softdog autofs4 ipmi_devintf ipmi_si ipmi_msghandler ocfs2(U) ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs drbd(U) bonding ipv6 xfrm_nalgo crypto_api bnx2i(U) libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic(U) dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod cdrom sg pcspkr serio_raw hpilo bnx2(U) dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache hpahcisr(PU) ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 21924, comm: res Tainted: P 2.6.18-164.el5 #1 RIP: 0010:[80064ae6] [80064ae6] _spin_lock_irq+0x1/0xb RSP: 0018:81008b1cfae0 EFLAGS: 00010002 RAX: 810187af4040 RBX: RCX: 8101342b7b80 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808 RBP: 81008b1cfb98 R08: R09: R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8 FS: () GS:810105d51840() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0808 CR3: 000187e14000 CR4: 06e0 Process res (pid: 21924, threadinfo 81008b1ce000, task 810187af4040) Stack: 8001db30 81007f070520 885961f3 810105d39400 88596323 06ff813231393234 810075463018 810075463018 0297 81007f070520 810075463028 0246 Call Trace: [8001db30] sigprocmask+0x28/0xdb [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691 [88596323] :ocfs2:ocfs2_delete_inode+0x130/0x1691 [88581f16] :ocfs2:ocfs2_drop_lock+0x67a/0x77b [8858026a] :ocfs2:ocfs2_remove_lockres_tracking+0x10/0x45 [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691 [8002f49e] generic_delete_inode+0xc6/0x143 [88595c85] :ocfs2:ocfs2_drop_inode+0xf0/0x161 [8000d46e] dput+0xf6/0x114 [800e9c44] prune_one_dentry+0x66/0x76 [8002e958] prune_dcache+0x10f/0x149 [8004d66e] shrink_dcache_parent+0x1c/0xe1 [80104f8b] proc_flush_task+0x17c/0x1f6 [8008fa2c] sched_exit+0x27/0xb5 [80018024] release_task+0x387/0x3cb [80015c50] do_exit+0x865/0x911 [80049281] cpuset_exit+0x0/0x88 [8002b080] get_signal_to_deliver+0x42c/0x45a [8005ae7b] do_notify_resume+0x9c/0x7af [8008b6a2] deactivate_task+0x28/0x5f [80021f3f] __up_read+0x19/0x7f [80066b58] do_page_fault+0x4fe/0x830 [800b65b2] audit_syscall_exit+0x336/0x362 [8005d32e] int_signal+0x12/0x17 Code: f0 ff 0f 0f 88 f3 00 00 00 c3 53 48 89 fb e8 33 f5 02 00 f0 RIP [80064ae6] _spin_lock_irq+0x1/0xb RSP81008b1cfae0 crash bt PID: 21924 TASK: 810187af4040 CPU: 1 COMMAND: res #0 [81008b1cf840] crash_kexec at 800ac5b9 #1 [81008b1cf900] __die at 80065127 #2 [81008b1cf940] do_page_fault at 80066da7 #3 [81008b1cfa30] error_exit at 8005dde9 [exception RIP: _spin_lock_irq+1] RIP: 80064ae6 RSP: 81008b1cfae0 RFLAGS: 00010002 RAX: 810187af4040 RBX: RCX: 8101342b7b80 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808 RBP: 81008b1cfb98 R8: R9: R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8 ORIG_RAX: CS: 0010 SS: 0018 #4 [81008b1cfae0] sigprocmask at 8001db30 #5
Re: [Ocfs2-users] Kernel Panic / Fencing
I am unclear. What happens when a server is rebooted (or crashes). Crash the network? Can you expand on this? On 10/06/2011 05:52 PM, Tony Rios wrote: Hey all, I'm running a current version of Ubuntu and we are using OCFS2 across a cluster of 9 web servers. Everything works perfectly, so long as none of the servers need to be rebooted (or crash). I've done several web searches and one of the items that I've found to be suggested was to double the Heartbeat threshold. I increased ours from 31 to 61 and it doesn't appear to have helped at all. I can't imagine that if a server becomes unreachable that by design it is intended to crash the entire network. I'm hoping that someone will have some feedback here because I'm at a loss. Thanks so much, Tony ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Fwd: OCFS drives not syncing
On 10/05/2011 08:46 AM, Bradlee Landis wrote: Sorry Sunil, my email replied to you instead of the list. On Wed, Oct 5, 2011 at 10:09 AM, Sunil Mushransunil.mush...@oracle.com wrote: ocfs2 is a shared disk cluster file system. It requires a shared disk. However, if you are only going to use 2 nodes, you could use drbd, a replicating block device. To ocfs2, it appears as a shared disk. Google drbd and ocfs2 for more. So I've been confused about this the whole time I guess. So how is the OCFS drive shared? Is it done through OCFS, or does it require NFS? How do I access the filesystem from the other node? The drives need to be physically shared. As in, all nodes need to be able to concurrently read and write directly to the disk. Two popular solutions are fiber channel and iscsi. A fiber channel solution could be EMC disk array + FC switch + hbas on all nodes hooked up to the switch. An iscsi solution could be a iscsi target running on one server with the disks. The nodes would use an iscsi initiator to access the target. The devices will show up as regular devices (/dev/sdX) on all nodes. The cheapest solution would be to use drbd. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] one node kernel panic
int sigprocmask(int how, sigset_t *set, sigset_t *oldset) { int error; spin_lock_irq(current-sighand-siglock); CRASH if (oldset) *oldset = current-blocked; ... } current-sighand is NULL. So definitely a race. Generic kernel issue. Ping your kernel vendor. On 10/03/2011 07:49 PM, Hideyasu Kojima wrote: Hi, I run ocfs2/drbd active-active 2node cluster. ocfs2 version is 1.4.7-1 ocfs2-tool version is 1.4.4 Linux version is RHEL 5.4 (2.6.18-164.el5 x86_64) 1 node crash with kernel panic once. What is the cause? The bottom is the analysis of vmcore. Unable to handle kernel NULL pointer dereference at 0808 RIP: [80064ae6] _spin_lock_irq+0x1/0xb PGD 187e15067 PUD 187e16067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /devices/pci:00/:00:09.0/:06:00.0/:07:00.0/irq CPU 1 Modules linked in: mptctl mptbase softdog autofs4 ipmi_devintf ipmi_si ipmi_msghandler ocfs2(U) ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs drbd(U) bonding ipv6 xfrm_nalgo crypto_api bnx2i(U) libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic(U) dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod cdrom sg pcspkr serio_raw hpilo bnx2(U) dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache hpahcisr(PU) ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 21924, comm: res Tainted: P 2.6.18-164.el5 #1 RIP: 0010:[80064ae6] [80064ae6] _spin_lock_irq+0x1/0xb RSP: 0018:81008b1cfae0 EFLAGS: 00010002 RAX: 810187af4040 RBX: RCX: 8101342b7b80 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808 RBP: 81008b1cfb98 R08: R09: R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8 FS: () GS:810105d51840() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0808 CR3: 000187e14000 CR4: 06e0 Process res (pid: 21924, threadinfo 81008b1ce000, task 810187af4040) Stack: 8001db30 81007f070520 885961f3 810105d39400 88596323 06ff813231393234 810075463018 810075463018 0297 81007f070520 810075463028 0246 Call Trace: [8001db30] sigprocmask+0x28/0xdb [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691 [88596323] :ocfs2:ocfs2_delete_inode+0x130/0x1691 [88581f16] :ocfs2:ocfs2_drop_lock+0x67a/0x77b [8858026a] :ocfs2:ocfs2_remove_lockres_tracking+0x10/0x45 [885961f3] :ocfs2:ocfs2_delete_inode+0x0/0x1691 [8002f49e] generic_delete_inode+0xc6/0x143 [88595c85] :ocfs2:ocfs2_drop_inode+0xf0/0x161 [8000d46e] dput+0xf6/0x114 [800e9c44] prune_one_dentry+0x66/0x76 [8002e958] prune_dcache+0x10f/0x149 [8004d66e] shrink_dcache_parent+0x1c/0xe1 [80104f8b] proc_flush_task+0x17c/0x1f6 [8008fa2c] sched_exit+0x27/0xb5 [80018024] release_task+0x387/0x3cb [80015c50] do_exit+0x865/0x911 [80049281] cpuset_exit+0x0/0x88 [8002b080] get_signal_to_deliver+0x42c/0x45a [8005ae7b] do_notify_resume+0x9c/0x7af [8008b6a2] deactivate_task+0x28/0x5f [80021f3f] __up_read+0x19/0x7f [80066b58] do_page_fault+0x4fe/0x830 [800b65b2] audit_syscall_exit+0x336/0x362 [8005d32e] int_signal+0x12/0x17 Code: f0 ff 0f 0f 88 f3 00 00 00 c3 53 48 89 fb e8 33 f5 02 00 f0 RIP [80064ae6] _spin_lock_irq+0x1/0xb RSP81008b1cfae0 crash bt PID: 21924 TASK: 810187af4040 CPU: 1 COMMAND: res #0 [81008b1cf840] crash_kexec at 800ac5b9 #1 [81008b1cf900] __die at 80065127 #2 [81008b1cf940] do_page_fault at 80066da7 #3 [81008b1cfa30] error_exit at 8005dde9 [exception RIP: _spin_lock_irq+1] RIP: 80064ae6 RSP: 81008b1cfae0 RFLAGS: 00010002 RAX: 810187af4040 RBX: RCX: 8101342b7b80 RDX: 81008b1cfb98 RSI: 81008b1cfba8 RDI: 0808 RBP: 81008b1cfb98 R8: R9: R10: 810075463090 R11: 88595b95 R12: 81008b1cfba8 R13: 81007f070520 R14: 0001 R15: 81008b1cfce8 ORIG_RAX: CS: 0010 SS: 0018 #4 [81008b1cfae0] sigprocmask at 8001db30 #5 [81008b1cfb00] ocfs2_delete_inode at 88596323 #6 [81008b1cfbf0] generic_delete_inode at 8002f49e #7 [81008b1cfc10] ocfs2_drop_inode at
Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0000000000000001b027d69b591f15 not on the Tracking list
On 09/30/2011 06:49 AM, Herman L wrote: On Thursday, September 29, 2011 2:04 PM Sunil Mushran wrote: On 09/29/2011 08:56 AM, Herman L wrote: On Wednesday, September 21, 2011 4:00 PM, Sunil Mushran wrote: On 09/21/2011 12:37 PM, Herman L wrote: On 09/19/2011 08:35 AM, Herman L wrote: Hi all, Got a couple of these messages recently, but I don't know what they mean. Can anyone let me know if I need to panic? I'm using OCFS2 compiled from the kernel source of RHEL 6.0's 2.6.32-71.18.2.el6.x86_64. Sep 19 08:07:15 server-1 kernel: [3892420.40] (10387,12):dlm_lockres_release:507 ERROR: Resource W0001b027d69b591f15 not on the Tracking list Sep 19 08:07:15 server-1 kernel: [3892420.398194] lockres: W0001b027d69b591f1, owner=1, state=0 Sep 19 08:07:15 server-1 kernel: [3892420.398195] last used: 8197071325, refcnt: 0, on purge list: no Sep 19 08:07:15 server-1 kernel: [3892420.398197] on dirty list: no, on reco list: no, migrating pending: no Sep 19 08:07:15 server-1 kernel: [3892420.398198] inflight locks: 0, asts reserved: 0 Sep 19 08:07:15 server-1 kernel: [3892420.398199] refmap nodes: [ ], inflight=0 Sep 19 08:07:15 server-1 kernel: [3892420.398200] granted queue: Sep 19 08:07:15 server-1 kernel: [3892420.398200] converting queue: Sep 19 08:07:15 server-1 kernel: [3892420.398201] blocked queue: Thanks! Herman From: Sunil Mushran To: Herman L Sent: Monday, September 19, 2011 12:57 PM Subject: Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0001b027d69b591f15 not on the Tracking list I've no idea of the state of the source that you are using. The message is a warning indicating a race. While it probably did not affect the functioning, there is no guarantee that that would be the case the next time around. The closest relevant patch is over 2 years old. http://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=commit;h=b0d4f817ba5de8adb875ace594554a96d7737710 Thanks Sunil for responding. I know you can't easily support my setup, but anyways I checked the sources. Looks like the patch you mention is in the sources I compiled from ( RHEL6.0 kernel-2.6.32-71.24.1.el6.src.rpm ), so I guess the source of the problem is elsewhere. The fs/ocfs2 directory from the RHEL6 sources I compiled from is almost exactly the same as the mainline 2.6.32 kernel, except 1) It looks like they implemented the changes in aops.c from the cleanup blockdev_direct_IO locking patch that's in 2.6.33. 2) In journal.c, they rename ocfs2_commit_trigger to ocfs2_frozen_trigger, which seems to be from 2.6.35. 3) In cluster/masklog.c they add a const to the mlog_attr_ops declaration 4) And in quota.h, they are missing #define QFMT_OCFS2 3 Not sure if that helps any, but thanks in any case! All those changes are ok. And unrelated. This is a new one. Sorry, I think I accidentally wrote a message with only the quoted block... oops. Sorry. Sunil, are you able to and interested in looking at this issue? If so, is there any information that I can provide that might help? Fortunately, after those few initial days of daily errors, it seems to have stopped for now. But of course, I'm still worried about this. http://oss.oracle.com/~smushran/0001-ocfs2-dlm-Use-dlm-track_lock-when-adding-resource-to.patch This should fix it. But do note that the patch is untested. Thanks for the quick reply and patch! I'll try to test it out when I get a chance. Also, is there any way to force this error so that I can know if that patch is working? Also, now that you have a fix for this, can you make any kind of guess as to how likely or what circumstances that the unpatched OCFS2 will cause dangerous problems? Well, the first goal is always to see nothing else is breaking. That's the most important bit. As far as fixing the issue goes, only time will tell. There is no way I can think of that will definitely prove that the issue is resolved. Also, even if it does reproduce, it does not mean that this patch is bad. It could be there is another race that we have to plug. Depends on the definition of dangerous. If it means cluster-wide corruption, or cluster-wide outage, then no. But if it means a node crashing, then yes. Though the chance of that is fairly low. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5
On 09/27/2011 09:12 AM, Ulf Zimmermann wrote: - -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Monday, September 26, 2011 10:09 AM To: Ulf Zimmermann Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5 I'll look at the tunefs issue. But the other one does not make sense. strict_jbd is a compat flag. Mount should work. What is the mount error? As in, in dmesg. I don't see any dmesg or /var/log/messages, but the error I saw was from tunefs: demodb01 root /home/ulf # /usr/bin/yes | /sbin/tunefs.ocfs2 -U -L /export/u07 /dev/mapper/u07 tunefs.ocfs2 1.2.7 tunefs.ocfs2: Filesystem has unsupported feature(s) while opening device /dev/mapper/u07 So that is correct. In short that flag was added to allow us to use the jbd(2) features. We use this to create volumes 16TB. I guess if you want to use with 1.2, format it with 1.2 tools. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5
I'll look at the tunefs issue. But the other one does not make sense. strict_jbd is a compat flag. Mount should work. What is the mount error? As in, in dmesg. On 09/25/2011 04:43 AM, Ulf Zimmermann wrote: As tunefs.ocfs2 wasn't working for us, I tried to mkfs.ocfs2 the volumes again with --fs-feature-level=max-compat. This still turns on strict-journal-super and there seems no way around this? This makes the volume not compatible with OCFS 1.2.9 -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users- boun...@oss.oracle.com] On Behalf Of Ulf Zimmermann Sent: Sunday, September 25, 2011 1:43 AM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Problem with tunefs.ocfs2, similar to fsck.ocfs2 on EL5 We are running into a problem which looks like the same we had with fsck.ocfs2 a while back. This is with ocfs2-tools 1.4.4. I am trying to use tunefs.ocfs2 to turn off some features. The program starts up but then starts eating all available memory and more and the system starts to swap like crazy in and out. This is exactly the same behavior as the fsck.ocfs2 for which we were given a patched binary. I tried to compile the tunefs.ocfs2 from 1.6.x but the same problem with that binary. Ulf. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0000000000000001b027d69b591f15 not on the Tracking list
On 09/21/2011 12:37 PM, Herman L wrote: On 09/19/2011 08:35 AM, Herman L wrote: Hi all, Got a couple of these messages recently, but I don't know what they mean. Can anyone let me know if I need to panic? I'm using OCFS2 compiled from the kernel source of RHEL 6.0's 2.6.32-71.18.2.el6.x86_64. Sep 19 08:07:15 server-1 kernel: [3892420.40] (10387,12):dlm_lockres_release:507 ERROR: Resource W0001b027d69b591f15 not on the Tracking list Sep 19 08:07:15 server-1 kernel: [3892420.398194] lockres: W0001b027d69b591f1, owner=1, state=0 Sep 19 08:07:15 server-1 kernel: [3892420.398195] last used: 8197071325, refcnt: 0, on purge list: no Sep 19 08:07:15 server-1 kernel: [3892420.398197] on dirty list: no, on reco list: no, migrating pending: no Sep 19 08:07:15 server-1 kernel: [3892420.398198] inflight locks: 0, asts reserved: 0 Sep 19 08:07:15 server-1 kernel: [3892420.398199] refmap nodes: [ ], inflight=0 Sep 19 08:07:15 server-1 kernel: [3892420.398200] granted queue: Sep 19 08:07:15 server-1 kernel: [3892420.398200] converting queue: Sep 19 08:07:15 server-1 kernel: [3892420.398201] blocked queue: Thanks! Herman From: Sunil Mushran To: Herman L Sent: Monday, September 19, 2011 12:57 PM Subject: Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0001b027d69b591f15 not on the Tracking list I've no idea of the state of the source that you are using. The message is a warning indicating a race. While it probably did not affect the functioning, there is no guarantee that that would be the case the next time around. The closest relevant patch is over 2 years old. http://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=commit;h=b0d4f817ba5de8adb875ace594554a96d7737710 Thanks Sunil for responding. I know you can't easily support my setup, but anyways I checked the sources. Looks like the patch you mention is in the sources I compiled from ( RHEL6.0 kernel-2.6.32-71.24.1.el6.src.rpm ), so I guess the source of the problem is elsewhere. The fs/ocfs2 directory from the RHEL6 sources I compiled from is almost exactly the same as the mainline 2.6.32 kernel, except 1) It looks like they implemented the changes in aops.c from the cleanup blockdev_direct_IO locking patch that's in 2.6.33. 2) In journal.c, they rename ocfs2_commit_trigger to ocfs2_frozen_trigger, which seems to be from 2.6.35. 3) In cluster/masklog.c they add a const to the mlog_attr_ops declaration 4) And in quota.h, they are missing #define QFMT_OCFS2 3 Not sure if that helps any, but thanks in any case! All those changes are ok. And unrelated. This is a new one. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0000000000000001b027d69b591f15 not on the Tracking list
I've no idea of the state of the source that you are using. The message is a warning indicating a race. While it probably did not affect the functioning, there is no guarantee that that would be the case the next time around. The closest relevant patch is over 2 years old. http://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=commit;h=b0d4f817ba5de8adb875ace594554a96d7737710 On 09/19/2011 08:35 AM, Herman L wrote: Hi all, Got a couple of these messages recently, but I don't know what they mean. Can anyone let me know if I need to panic? I'm using OCFS2 compiled from the kernel source of RHEL 6.0's 2.6.32-71.18.2.el6.x86_64. Sep 19 08:07:15 server-1 kernel: [3892420.40] (10387,12):dlm_lockres_release:507 ERROR: Resource W0001b027d69b591f15 not on the Tracking list Sep 19 08:07:15 server-1 kernel: [3892420.398194] lockres: W0001b027d69b591f1, owner=1, state=0 Sep 19 08:07:15 server-1 kernel: [3892420.398195] last used: 8197071325, refcnt: 0, on purge list: no Sep 19 08:07:15 server-1 kernel: [3892420.398197] on dirty list: no, on reco list: no, migrating pending: no Sep 19 08:07:15 server-1 kernel: [3892420.398198] inflight locks: 0, asts reserved: 0 Sep 19 08:07:15 server-1 kernel: [3892420.398199] refmap nodes: [ ], inflight=0 Sep 19 08:07:15 server-1 kernel: [3892420.398200] granted queue: Sep 19 08:07:15 server-1 kernel: [3892420.398200] converting queue: Sep 19 08:07:15 server-1 kernel: [3892420.398201] blocked queue: Thanks! Herman ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] 11gr1 RAC + ocfs2 node2 is down and not able to mount the ocfs2 FS on node1
The connect is failing. One of the main reason is a firewall. See if iptables is running. Check on both nodes. If so, shutdown it down or add a rule to allow traffic on the o2cb port. On 09/18/2011 08:57 PM, veeraa bose wrote: Hi All, we are having two node 11gr1 RAC (we have used ocfs2 for CRS and ASM for DB data), now the node2 is down and node1 got rebooted and after node1 is the ocfs2 Fs used for CRS is not getting mounted and the error is. #/etc/init.d/ocfs2 start Starting Oracle Cluster File System (OCFS2) mount.ocfs2: Transport endpoint is not connected while mounting /dev/mapper/vg_oracle_shared- RAC--DG--CLUS--01 on /u02/ocfs2/RAC-DG-CLUS-01. Check 'dmesg' for more information on this error. mount.ocfs2: Transport endpoint is not connected while mounting /dev/mapper/vg_oracle_shared-RAC--DG--CLUS--02 on /u02/ocfs2/RAC-DG-CLUS-02. Check 'dmesg' for more information on this error. mount.ocfs2: Transport endpoint is not connected while mounting /dev/mapper/vg_oracle_shared-global_backup on /global/backup. Check 'dmesg' for more information on this error. [FAILED] And below is the log from Dmesg. (o2net,6121,4):o2net_connect_expired:1664 ERROR: no connection established with node 2 after 60.0 seconds, giving up and returning errors. (mount.ocfs2,7327,12):dlm_request_join:1036 ERROR: status = -107 (mount.ocfs2,7327,12):dlm_try_to_join_domain:1210 ERROR: status = -107 (mount.ocfs2,7327,12):dlm_join_domain:1488 ERROR: status = -107 (mount.ocfs2,7327,12):dlm_register_domain:1754 ERROR: status = -107 (mount.ocfs2,7327,12):ocfs2_dlm_init:2808 ERROR: status = -107 (mount.ocfs2,7327,12):ocfs2_mount_volume:1447 ERROR: status = -107 ocfs2: Unmounting device (253,19) on (node 1) please guideme, how to mount the ocfs2 FS on the node1 and bring the cluster up. Thanks Veera. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] fsck doesn't fix bad chain
Can you save the o2image of the volume when it is in that state. We'll need that for analysis. On 09/16/2011 05:41 AM, Andre Nathan wrote: Hello For a while I had seen errors like this in the kernel logs: OCFS2: ERROR (device drbd5): ocfs2_validate_gd_parent: Group descriptor #69084874 has bad chain 126 File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. This always happened in the same device, and whenever it happened I ran fsck.ocfs2 -fy /dev/drbd5, which showed messages like these: [GROUP_FREE_BITS] Group descriptor at block 201309696 claims to have 9893 free bits which is more than 9886 bits indicated by the bitmap. Drop its free bit count down to the total? y [CHAIN_BITS] Chain 166 in allocator inode 11 has 1264713 bits marked free out of 1516032 total bits but the block groups in the chain have 1264706 free out of 1516032 total. Fix this by updating the chain record? y [CHAIN_GROUP_BITS] Allocator inode 11 has 79407510 bits marked used out of 365955414 total bits but the chains have 79407911 used out of 365955414 total. Fix this by updating the inode counts? y [INODE_COUNT] Inode 69085510 has a link count of 0 on disk but directory entry references come to 1. Update the count on disk to match? y As time passed, the frequency of these issues started to increase, and the last time it happened, I decided to run fsck twice in a row, and was surprised to see it showed the same messages in both runs. It seems it was unable to fix the problem. I identified the files corresponding to the inodes using debugfs.ocfs2 and copied them to a new place, and then moved the copy over the original file, in order to recreate the inodes. Whenever I did that for one inode, the error above happened and the filesystem became read-only, so I had to umount/mount the volume again in order to be able to write to it again. After doing this, I ran fsck.ocfs2 -fy again twice, and no errors were reported. Since then I haven't seen this problem again. I'm running kernel 2.6.35 and ocfs2-tools 1.6.4. Has anyone else seen an issue like that? Thanks Andre ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Linux kernel crash due to ocfs2
I got it. But I still don't see the symbols. Maybe we are corrupting the stack. Maybe this is ppc specific. Do you have a x86/x86_64 box that can access the same volume? If so I could give you a drop of the same for that arch. Also, have to run fsck on this volume before? One reason o2image could fail is if there is a bad block pointer. While it is supposed to handle all such cases, it is known to miss some cases. On 09/16/2011 12:06 AM, Betzos Giorgos wrote: Please try http://portal-md.glk.gr/ocfs2/core.32578.bz2 Please let me know, in case you have any problem downloading it. Thanks, George On Thu, 2011-09-15 at 09:45 -0700, Sunil Mushran wrote: I was hoping to get a readable stack. Please could you provide a link to the coredump. On 09/15/2011 02:51 AM, Betzos Giorgos wrote: Hello, I am sorry for the delay in responding. Unfortunately, if faulted again. Here is the log. Although my email client folds the Memory Map lines. The core file is available. Thanks, George # ./o2image.ppc.dbg /dev/mapper/mpath0 /files_shared/u02.o2image *** glibc detected *** ./o2image.ppc.dbg: corrupted double-linked list: 0x10075000 *** === Backtrace: = /lib/libc.so.6[0xfeb1ab4] /lib/libc.so.6(cfree+0xc8)[0xfeb5b68] ./o2image.ppc.dbg[0x1000d098] ./o2image.ppc.dbg[0x1000297c] ./o2image.ppc.dbg[0x10001eb8] ./o2image.ppc.dbg[0x1000228c] ./o2image.ppc.dbg[0x10002804] ./o2image.ppc.dbg[0x10001eb8] ./o2image.ppc.dbg[0x1000228c] ./o2image.ppc.dbg[0x10002804] ./o2image.ppc.dbg[0x10003bbc] ./o2image.ppc.dbg[0x10004480] /lib/libc.so.6[0xfe4dc60] /lib/libc.so.6[0xfe4dea0] === Memory map: 0010-0012 r-xp 0010 00:00 0 [vdso] 0f43-0f44 r-xp 08:13 180307 /lib/libcom_err.so.2.1 0f44-0f45 rw-p 08:13 180307 /lib/libcom_err.so.2.1 0f90-0f9c r-xp 08:13 180293 /lib/libglib-2.0.so.0.1200.3 0f9c-0f9d rw-p 000b 08:13 180293 /lib/libglib-2.0.so.0.1200.3 0fa4-0fa5 r-xp 08:13 180292 /lib/librt-2.5.so 0fa5-0fa6 r--p 08:13 180292 /lib/librt-2.5.so 0fa6-0fa7 rw-p 0001 08:13 180292 /lib/librt-2.5.so 0fce-0fd0 r-xp 08:13 180291 /lib/libpthread-2.5.so 0fd0-0fd1 r--p 0001 08:13 180291 /lib/libpthread-2.5.so 0fd1-0fd2 rw-p 0002 08:13 180291 /lib/libpthread-2.5.so 0fe3-0ffa r-xp 08:13 180288 /lib/libc-2.5.so 0ffa-0ffb r--p 0016 08:13 180288 /lib/libc-2.5.so 0ffb-0ffc rw-p 0017 08:13 180288 /lib/libc-2.5.so 0ffc-0ffe r-xp 08:13 180287 /lib/ld-2.5.so 0ffe-0fff r--p 0001 08:13 180287 /lib/ld-2.5.so 0fff-1000 rw-p 0002 08:13 180287 /lib/ld-2.5.so 1000-1005 r-xp 08:13 7487795/root/o2image.ppc.dbg 1005-1006 rw-p 0004 08:13 7487795/root/o2image.ppc.dbg 1006-1009 rwxp 1006 00:00 0 [heap] f768-f7ff rw-p f768 00:00 0 ff9a-ffaf rw-p ff9a 00:00 0 [stack] Aborted (core dumped) On Thu, 2011-09-08 at 12:10 -0700, Sunil Mushran wrote: http://oss.oracle.com/~smushran/o2image.ppc.dbg Use the above executable. Hoping it won't fault. But if it does email me the backtrace. That trace will be readable as the exec has debugging symbols enabled. On 09/07/2011 11:24 PM, Betzos Giorgos wrote: # rpm -q ocfs2-tools ocfs2-tools-1.4.4-1.el5.ppc On Wed, 2011-09-07 at 09:13 -0700, Sunil Mushran wrote: version of ocfs2-tools? On 09/07/2011 09:10 AM, Betzos Giorgos wrote: Hello, I tried what you suggested but here is what I got: # o2image /dev/mapper/mpath0 /files_shared/u02.o2image *** glibc detected *** o2image: corrupted double-linked list: 0x10045000 *** === Backtrace: = /lib/libc.so.6[0xfeb1ab4] /lib/libc.so.6(cfree+0xc8)[0xfeb5b68] o2image[0x10007bb0] o2image[0x10002748] o2image[0x10001f50] o2image[0x10002334] o2image[0x100026a0] o2image[0x10001f50] o2image[0x10002334] o2image[0x100026a0] o2image[0x1000358c] o2image[0x10003e28] /lib/libc.so.6[0xfe4dc60] /lib/libc.so.6[0xfe4dea0] === Memory map: 0010-0012 r-xp 0010 00:00 0 [vdso] 0f55-0f56 r-xp 08:13 2881590 /lib/libcom_err.so.2.1 0f56-0f57 rw-p 08:13 2881590 /lib/libcom_err.so.2.1 0f90-0f9c r-xp 08:13 2881576
Re: [Ocfs2-users] Syslog reports (ocfs2_wq, 15527, 2):ocfs2_orphan_del:1841 ERROR: status = -2
drwxr-xr-x 2 0 04096 21-Jun-2008 16:42 . 6 drwxr-xr-x 6 0 04096 22-May-2008 12:01 .. debugfs: ls -l //orphan_dir:0002 14 drwxr-xr-x 2 0 04096 22-May-2008 12:01 . 6 drwxr-xr-x 6 0 04096 22-May-2008 12:01 .. debugfs: ls -l //orphan_dir:0003 15 drwxr-xr-x 2 0 04096 22-May-2008 12:01 . 6 drwxr-xr-x 6 0 04096 22-May-2008 12:01 .. Working on /dev/mapper/mpath20p1 debugfs.ocfs2 1.4.4 debugfs: ls -l //orphan_dir: 12 drwxr-xr-x 2 0 04096 3-Jun-2008 16:59 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:58 .. debugfs: ls -l //orphan_dir:0001 13 drwxr-xr-x 2 0 04096 21-Jun-2008 17:39 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:58 .. debugfs: ls -l //orphan_dir:0002 14 drwxr-xr-x 2 0 04096 22-May-2008 11:58 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:58 .. debugfs: ls -l //orphan_dir:0003 15 drwxr-xr-x 2 0 04096 22-May-2008 11:58 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:58 .. Working on /dev/mapper/mpath18p1 debugfs.ocfs2 1.4.4 debugfs: ls -l //orphan_dir: 12 drwxr-xr-x 2 0 04096 9-Jun-2008 13:54 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:56 .. debugfs: ls -l //orphan_dir:0001 13 drwxr-xr-x 2 0 04096 22-May-2008 11:56 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:56 .. debugfs: ls -l //orphan_dir:0002 14 drwxr-xr-x 2 0 04096 22-May-2008 11:56 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:56 .. debugfs: ls -l //orphan_dir:0003 15 drwxr-xr-x 2 0 04096 22-May-2008 11:56 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:56 .. Working on /dev/mapper/mpath19p1 debugfs.ocfs2 1.4.4 debugfs: ls -l //orphan_dir: 12 drwxr-xr-x 2 0 04096 3-Jun-2008 17:47 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:57 .. debugfs: ls -l //orphan_dir:0001 13 drwxr-xr-x 2 0 04096 30-Aug-2009 14:55 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:57 .. debugfs: ls -l //orphan_dir:0002 14 drwxr-xr-x 2 0 04096 22-May-2008 11:57 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:57 .. debugfs: ls -l //orphan_dir:0003 15 drwxr-xr-x 2 0 04096 22-May-2008 11:57 . 6 drwxr-xr-x 6 0 04096 22-May-2008 11:57 .. Working on /dev/mapper/mpath33p1 debugfs.ocfs2 1.4.4 debugfs: ls -l //orphan_dir: 12 drwxr-xr-x 2 0 04096 12-Dec-2008 13:41 . 6 drwxr-xr-x 6 0 04096 21-Nov-2008 10:54 .. debugfs: ls -l //orphan_dir:0001 13 drwxr-xr-x 2 0 04096 21-Nov-2008 10:54 . 6 drwxr-xr-x 6 0 04096 21-Nov-2008 10:54 .. debugfs: ls -l //orphan_dir:0002 14 drwxr-xr-x 2 0 04096 21-Nov-2008 10:54 . 6 drwxr-xr-x 6 0 04096 21-Nov-2008 10:54 .. debugfs: ls -l //orphan_dir:0003 15 drwxr-xr-x 2 0 04096 21-Nov-2008 10:54 . 6 drwxr-xr-x 6 0 04096 21-Nov-2008 10:54 .. [root@ausracdbd01 tmp]# From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Thursday, September 15, 2011 10:04 AM To: Daniel Keisling Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Syslog reports (ocfs2_wq, 15527, 2):ocfs2_orphan_del:1841 ERROR: status = -2 The issue that caused it has been fixed. The fix is here. http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=b6f3de3fd54026df74 8bfd1449bbe31b9803f8f7 The actual problem could have happened much earlier. 1.4.4 is showing the messages as it is more aggressive (than 1.4.1) in cleaning up the orphans. By default, the fs scans for orphans once every 10 mins on a node in the cluster. fsck should fix it. I would have