Re: [Ocfs2-users] Recommended settings for mkfs.ocfs2
Andrew, I would make sure you use say large cluster and block sizes if possible with the inband FS option enabled (you will loose some space but ive noticed it tends to run a bit better). As for the bug, its one ive been fighting with. However using only 2 nodes will make it take a long time to occur (took 2 yrs on my cluster) with 6 nodes. David -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Andrew Robert Nicols Sent: Monday, April 19, 2010 9:26 AM To: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Recommended settings for mkfs.ocfs2 Hi Brian, Thank you for taking the time to reply. On Mon, Apr 19, 2010 at 08:53:47AM -0500, Brian Kroth wrote: lenny-backports has a 2.6.32 based kernel that might already have the free space fix in it. I haven't checked yet. From what I can tell, the ENOSPC issue isn't fixed until 2.6.33 (http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189#c25) so the version in backports (or even Squeeze) isn't much help yet I'm afraid. Also you don't really explain what you're trying to use the data store for (eg: lots of small files, video files, heavy writes, heavy reads, random, sequential, etc.). It may impact the options you want to give to mkfs. Sorry - that didn't occur to me. This is going to be a file store for a variety of user submitted data for a web application (Moodle). At present we have a variety of: * videos * audio * images * database backups (gunzipped tar) * large files (primarily zip) * small files The activity is primarily read with writes too but I'm not sure on the exact characteristics at present. I'd guess fairly random rather than sequential and there are periods with heavy writes. Files are served to 6 frontend web servers over NFS for serving with Apache2. We've currently got 2.2TB of space used. Thank you for your input - if there's anything else which would be useful, I'll see if I can provide it. Andrew -- Systems Developer e: andrew.nic...@luns.net.uk im: a.nic...@jabber.lancs.ac.uk t: +44 (0)1524 5 10147 Lancaster University Network Services is a limited company registered in England and Wales. Registered number: 04311892. Registered office: University House, Lancaster University, Lancaster, LA1 4YW ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] OCFS2 1.4.7-1 and OCFS2 Tools 1.4.4-1 released
Sunil, Any chance I can get a timeline on having a defrag tool to make noncontiguous files become contagious? David -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: Monday, April 19, 2010 1:08 PM To: ocfs2-annou...@oss.oracle.com; ocfs2-users Subject: [Ocfs2-users] OCFS2 1.4.7-1 and OCFS2 Tools 1.4.4-1 released All, We are pleased to announce the release of OCFS2 1.4.7-1 and OCFS2 Tools 1.4.4-1 for Oracle's and Red Hat's Enterprise Linux 5 Update 2 and higher. Oracle's Unbreakable Linux Network users who are subscribing to the OCFS2 1.4 packages for Enterprise Linux 5 channel can upgrade to this release by running up2date. http://oss.oracle.com/pipermail/el-errata/2010-April/001438.html http://oss.oracle.com/pipermail/el-errata/2010-April/001439.html Red Hat's Enterprise Linux 5 users can download and install the relevant file system and tools packages from oss.oracle.com. http://oss.oracle.com/projects/ocfs2/files/RedHat/RHEL5/ http://oss.oracle.com/projects/ocfs2-tools/files/RedHat/RHEL5/ COMPATIBILITY This release is fully compatible with earlier releases of OCFS2 1.4. Users can upgrade their nodes to the new version in a rolling manner. This release is on-disk compatible with OCFS2 1.2.x. Users can install the software and mount the older volumes as-is. However, a rolling upgrade from 1.2 to 1.4 will not work. RECOMMENDATION This is just to remind users to add the noatime mount option to the mounts that hold the Oracle datafiles, redologs, archivelogs, voting file, etc. This is for OCFS2 1.4 only. WHAT'S CHANGED This release includes mostly bug fixes. The one new feature we've added is not much of one. It allows users to change the fence method from the default of machine reset to panic. This was requested by some developers who are interested in the vmcore dump that is generated when a machine panics. So unless you want the same, our recommendation would be for you to leave the fence method as-is. Do note that the fence method of a node can be toggled between reset and panic at anytime. To view the current fence method, do: # cat /sys/kernel/config/cluster/CLUSTER/fence_method reset To change to panic, do: # echo panic /sys/kernel/config/CLUSTER/cacl10/fence_method # cat /sys/kernel/config/cluster/CLUSTER/fence_method panic The bug fixes can be classified under three groupings. The first group involves cluster locking. Specifically in the area of downconverting cluster locks. The links below explain two of the more interesting problems. Our thanks to David Teigland of Red Hat for helping us fix these problems. http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=e8ef96c444326e4262fd37 1729e7beebda1af4d1 http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=39febfd5ee7948c018b667 e0b909886e1cfa1235 The second group of bug fixes concern NFS support. This release fixes a nfsd lockup issue and a stale inode read problem. Again, the links below describe the problems in detail. http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=2d561d3636c80af24063a7 4ae8c817661c574d78 http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=aa20775d1e7feba9b22e76 1fa9b69bd5c3f043bd The last group of fixes concerns users encountering erroneous out-of-space errors. Our analysis found that the errors were triggered because the file system could not grow the extent block allocator because of free space fragmentation. The extent block allocator houses the extent blocks that are used when an inode needs more than approx 250 extents to describe a file. So the way this plays out is that, in the early going, when free space is contiguous, the inodes rarely use the extent blocks. They start getting used just when the free space is fragmented enough that the extent allocator cannot be grown. The sad part is that the space required by this allocator is typically very small. So small that there was no reason we could not allocate it up front. In this release, the format tool, mkfs.ocfs2, reserves upto 0.3% of the volume for this allocator. Furthermore, if the file system finds that this allocator cannot be grown, it now can steal free blocks from another slot's allocator. The first fix will help newly formatted volumes. The second fix will also help existing volumes. The final fix for this problem will be provided in the next patch update (1.4.8). In it, we will allow the block allocators (inode and extent) to be grown even when a 4MB contiguous chunk is not available. Users will be able to activate this feature (discontiguous block groups) on existing volumes. This feature is currently in testing. BUGS FIXED ossbz#970 Unfair postponement of local lock requests (Livelock) ossbz#1175 BUG in dlm_free_dead_locks() (Oops during dlm recovery) ossbz#1178 BUG in ocfs2_prepare_downconvert() (Oops during downconvert) ossbz#1189 Free space trouble in a ocfs2 partition
Re: [Ocfs2-users] new ocfs2 release?
Sunil, Will this be bug be corrected? I think it what im running into http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189 I see it was commited to the 1.4 branch but im not sure if it had been merged into 1.4.7 David -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Thursday, April 15, 2010 1:46 PM To: David Murphy Cc: li...@svrinformatica.it; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] new ocfs2 release? The announcement will cover this topic. David Murphy wrote: Sunil, Is there a online/offline defrag in the 1.47 version. We are having some out of space issues due to fragmentation. I am preparing to move to a new partition that will give me sparse and inline functions but I feel this could become an issue again. David -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: Thursday, April 15, 2010 12:14 PM To: li...@svrinformatica.it Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] new ocfs2 release? We are hoping to release it anyday now. Have you filed a bug about your issue? I have no recollection of any reports of such an issue. Orphan scanning has not changed in 1.4.7. File a bz. We'll need to get more information to understand the problem you are encountering. Mailing List SVR wrote: Hi ocfs2 developers, there are some news about the schedule for a new ocfs2 release that solve the actual bug/limitations? I can see an 1.4.7 release tagged here: http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=summary Is there a planned release date? in my environment (about 500 files with 30 new files/deletion per day) I see load until 1000 and I/O almost blocked for some hours of the day I think this is caused by The file system now scans the orphan directory at a regular interval to delete orphaned files that are no longer in use is this behaviuor still present in the 1.4.7 tag? thanks Nicola ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] new ocfs2 release?
Sunil, Is there a online/offline defrag in the 1.47 version. We are having some out of space issues due to fragmentation. I am preparing to move to a new partition that will give me sparse and inline functions but I feel this could become an issue again. David -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: Thursday, April 15, 2010 12:14 PM To: li...@svrinformatica.it Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] new ocfs2 release? We are hoping to release it anyday now. Have you filed a bug about your issue? I have no recollection of any reports of such an issue. Orphan scanning has not changed in 1.4.7. File a bz. We'll need to get more information to understand the problem you are encountering. Mailing List SVR wrote: Hi ocfs2 developers, there are some news about the schedule for a new ocfs2 release that solve the actual bug/limitations? I can see an 1.4.7 release tagged here: http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=summary Is there a planned release date? in my environment (about 500 files with 30 new files/deletion per day) I see load until 1000 and I/O almost blocked for some hours of the day I think this is caused by The file system now scans the orphan directory at a regular interval to delete orphaned files that are no longer in use is this behaviuor still present in the 1.4.7 tag? thanks Nicola ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Best Linux distribution for OCFS2?
OCFS and Fedora both work well with OCFS just be aware OCFS/RHEL are on seriously outdated kernel 2.6.18 vs 2.6.30 so OCFS's module is an added kmod . -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sérgio Surkamp Sent: Thursday, April 15, 2010 4:55 PM To: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Best Linux distribution for OCFS2? Watch out for SLES11 as it needs extra package license to use OCFS2. http://forums.novell.com/novell-product-support-forums/suse-linux-enterprise -server-sles/sles-configure-administer/366627-sles-11-ocfs2.html What about CentOS with stock Oracle OCFS2 packages? Regards, Sérgio Em Thu, 15 Apr 2010 10:34:52 -0700 Patrick J. LoPresti lopre...@gmail.com escreveu: OK, I realize this is a loaded question, but I really am interested in some feedback. I am preparing to create a new OCFS2 cluster -- several of them, actually -- and I have the luxury of choosing my Linux distribution. I am agnostic on this, save for a slight bias against Fedora Core (and, by implication, Red Hat) due to some bad experiences a few years ago. My current short list of options reads: Ubuntu Lucid Lynx Suse Linux Enterprise Server 11 OpenSuse 11.2 or 11.3 Although I have my choice of distributions now, and I have a couple of months to prototype, once the choice is made I will be stuck supporting the configuration for years; hardware and O/S changes will be costly. So I want to get this right. I have been reading this mailing list for a while, and it sounds like OCFS2 has had some fairly serious bugs fixed just in the last few weeks and months (e.g., ENOSPC when there is plenty of space). Which distribution, if any, has incorporated these fixes? Which would be most likely to provide such fixes in the future? I am also curious to hear success stories, failure stories, advocacy, warnings... Feel free to reply to me personally if you do not want to spam the list, and I will post a summary. Possibly relevant other technologies I intend to use: iSCSI over 10GigE Linux md software RAID-0 (my iSCSI hardware RAID units already provide redundancy) My configuration will be storing 100+ terabytes on a single partition. (Sounds crazy, perhaps? My application is a little... special.) Thanks! - Pat ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users -- .:':. .:'` Sérgio Surkamp | Gerente de Rede :: ser...@gruposinternet.com.br `:..:' `:, ,.:' *Grupos Internet S.A.* `: :'R. Lauro Linhares, 2123 Torre B - Sala 201 : : Trindade - Florianópolis - SC :.' :: +55 48 3234-4109 : ' http://www.gruposinternet.com.br ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Odd error on FC12 with ocfs2
[r...@web1 /dev]# debugfs.ocfs2 -l TCP off /dev/mapper/OCFS2_200Gp1 [r...@web1 /dev]# mount /dev/mapper/OCFS2_200Gp1 -v device=/dev/mapper/OCFS2_200Gp1 mount.ocfs2: Transport endpoint is not connected while mounting /dev/mapper/OCFS2_200Gp1 on /mnt/appshare. Check 'dmesg' for more information on this error. [r...@web1 /dev]#dmesg DMESG: Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656 ERROR: no connection established with node 3 after 30.0 seconds, giving up and returning errors. Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656 ERROR: no connection established with node 4 after 30.0 seconds, giving up and returning errors. Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656 ERROR: no connection established with node 5 after 30.0 seconds, giving up and returning errors. Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656 ERROR: no connection established with node 6 after 30.0 seconds, giving up and returning errors. Mar 30 10:23:38 web1 kernel: (1740,0):dlm_request_join:1035 ERROR: status = -107 Mar 30 10:23:38 web1 kernel: (1740,0):dlm_try_to_join_domain:1209 ERROR: status = -107 Mar 30 10:23:38 web1 kernel: (1740,0):dlm_join_domain:1487 ERROR: status = -107 Mar 30 10:23:38 web1 kernel: (1740,0):dlm_register_domain:1753 ERROR: status = -107 Mar 30 10:23:38 web1 kernel: (1740,0):o2cb_cluster_connect:313 ERROR: status = -107 Mar 30 10:23:38 web1 kernel: (1740,0):ocfs2_dlm_init:2963 ERROR: status = -107 Mar 30 10:23:38 web1 kernel: (1740,0):ocfs2_mount_volume:1788 ERROR: status = -107 Mar 30 10:23:38 web1 kernel: ocfs2: Unmounting device (253,1) on (node 0) DEBUGFS: debugfs: curdev /dev/mapper/OCFS2_200Gp1 debugfs: controld dump controld: Unable to access cluster service while obtaining the debug buffer debugfs: slotmap Slot# Node# 0 3 1 5 2 2 4 4 5 6 debugfs: stats Revision: 0.90 Mount Count: 0 Max Mount Count: 20 State: 0 Errors: 0 Check Interval: 0 Last Check: Mon Mar 29 10:53:52 2010 Creator OS: 0 Feature Compat: 1 backup-super Feature Incompat: 16 sparse Tunefs Incomplete: 0 Feature RO compat: 1 unwritten Root Blknum: 5 System Dir Blknum: 6 First Cluster Group Blknum: 3 Block Size Bits: 12 Cluster Size Bits: 12 Max Node Slots: 6 Extended Attributes Inline Size: 0 Label: OCFS2_APPSHARE_200G UUID: D6E0DD0AAC8844ED94A4A459FBB6F7FF UUID_hash: 0 (0x0) Cluster stack: classic o2cb Inode: 2 Mode: 00 Generation: 2428834932 (0x90c51474) FS Generation: 2428834932 (0x90c51474) CRC32: ECC: Type: Unknown Attr: 0x0 Flags: Valid System Superblock Dynamic Features: (0x0) User: 0 (root) Group: 0 (root) Size: 0 Links: 0 Clusters: 52428119 ctime: 0x4a0b2372 -- Wed May 13 14:45:54 2009 atime: 0x0 -- Wed Dec 31 18:00:00 1969 mtime: 0x4a0b2372 -- Wed May 13 14:45:54 2009 dtime: 0x0 -- Wed Dec 31 18:00:00 1969 ctime_nsec: 0x -- 0 atime_nsec: 0x -- 0 mtime_nsec: 0x -- 0 Last Extblk: 0 Sub Alloc Slot: Global Sub Alloc Bit: 65535 It doesn't appear any extra debug logging actually was created. David -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Monday, March 29, 2010 10:23 PM To: Angelo McComis Cc: David Murphy; ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2 No On Mar 29, 2010, at 8:10 PM, Angelo McComis ang...@mccomis.com wrote: Does it matter that the nodes are numbered 1-6 instead of 0-5? On Mon, Mar 29, 2010 at 4:25 PM, Sunil Mushran sunil.mush...@oracle.com wrote: Enable some debugging. #debugfs.ocfs2 -l TCP allow ...do mount... #debugfs.ocfs2 -l TCP off David Murphy wrote: [r...@web2 ~]# nc -z 192.168.102.140 Connection to 192.168.102.140 port [tcp/cbt] succeeded! [r...@web1 /etc/sysconfig/network-scripts]# nc -z 192.168.102.141 Connection to 192.168.102.141 port [tcp/cbt] succeeded! -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com
Re: [Ocfs2-users] Odd error on FC12 with ocfs2
Maybe I miss spoke then, at any rate The machine clearly has networking working to each of the other nodes, but the node thinks its cant talk to the rest of the cluster so it wont join the cluster. However nmap/telnet clearly show that it can infact talk to those devices on the correct port, and all the other device are active and talking to each other. This is with iptables and ipv6 iptables disabled and SELINUX in disabled mode. David Murphy -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Thursday, March 25, 2010 4:46 PM To: David Murphy Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2 hmm.. o2cb_ctl makes no connections. It just reads the cluster.conf and populates configfs. AFAIK. David Murphy wrote: We had 6 nodes running CentOS 5.4 using 1.4.3 ocfs2-tools. I decided to rebuild one node with FC12. Which is working fine, however Nmap 192.168.200.112 shows as open And O2cb_ctl is timing out when trying to connect to that node which then causes a 107 error. This happens with all node and all node have open via nmap from the FC machine. Is there a way to further debug this to see what exactly o2cb_ctl is seeing when trying to connect? David ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Odd error on FC12 with ocfs2
[r...@web2 ~]# nc -z 192.168.102.140 Connection to 192.168.102.140 port [tcp/cbt] succeeded! [r...@web1 /etc/sysconfig/network-scripts]# nc -z 192.168.102.141 Connection to 192.168.102.141 port [tcp/cbt] succeeded! -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Monday, March 29, 2010 5:08 PM To: David Murphy Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2 What happens when you use netcat to ping the node? nc -z host.example.com David Murphy wrote: Some additional data: From Web1 ( New Fedora Machine) to Web2: [r...@web1 /etc/sysconfig/network-scripts]# nmap 192.168.102.141 Starting Nmap 5.21 ( http://nmap.org ) at 2010-03-29 16:56 CDT Nmap scan report for 192.168.102.141 Host is up (0.76s latency). Not shown: 993 closed ports PORT STATE SERVICE 22/tcp open ssh 80/tcp open http 81/tcp open hosts2-ns 111/tcp open rpcbind 5666/tcp open nrpe /tcp open unknown 9102/tcp open jetdirect MAC Address: 00:50:56:A3:58:5D (VMware) Nmap done: 1 IP address (1 host up) scanned in 1.18 seconds From web2 - web1 (new fedora machine) [r...@web2 ~]# nmap 192.168.102.140 Starting Nmap 5.00 ( http://nmap.org ) at 2010-03-29 16:40 CDT Interesting ports on 192.168.102.140: Not shown: 994 closed ports PORT STATE SERVICE 22/tcp open ssh 80/tcp open http 81/tcp open hosts2-ns 111/tcp open rpcbind 443/tcp open https /tcp open unknown MAC Address: 00:50:56:A3:14:62 (VMWare) Nmap done: 1 IP address (1 host up) scanned in 1.31 seconds Cluster.conf: cluster: node_count = 6 name = appshare node: ip_port = ip_address = 192.168.102.140 number = 1 name = web1 cluster = appshare node: ip_port = ip_address = 192.168.102.141 number = 2 name = web2 cluster = appshare node: ip_port = ip_address = 192.168.102.142 number = 3 name = web3 cluster = appshare node: ip_port = ip_address = 192.168.102.111 number = 4 name = rgapp1 cluster = appshare node: ip_port = ip_address = 192.168.102.122 number = 5 name = deploy cluster = appshare node: ip_port = ip_address = 192.168.102.112 number = 6 name = app1 cluster = appshare DMESG on WEB1: OCFS2 1.5.0 (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 3 after 30.0 seconds, giving up and returning errors. (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 4 after 30.0 seconds, giving up and returning errors. (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 5 after 30.0 seconds, giving up and returning errors. (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 6 after 30.0 seconds, giving up and returning errors. (1262,0):dlm_request_join:1035 ERROR: status = -107 (1262,0):dlm_try_to_join_domain:1209 ERROR: status = -107 (1262,0):dlm_join_domain:1487 ERROR: status = -107 (1262,0):dlm_register_domain:1753 ERROR: status = -107 (1262,0):o2cb_cluster_connect:313 ERROR: status = -107 (1262,0):ocfs2_dlm_init:2963 ERROR: status = -107 (1262,0):ocfs2_mount_volume:1788 ERROR: status = -107 ocfs2: Unmounting device (253,1) on (node 0) (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 3 after 30.0 seconds, giving up and returning errors. (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 5 after 30.0 seconds, giving up and returning errors. (1199,0):o2net_connect_expired:1656 ERROR: no connection established with node 6 after 30.0 seconds, giving up and returning errors. (1323,0):dlm_request_join:1035 ERROR: status = -107 (1323,0):dlm_try_to_join_domain:1209 ERROR: status = -107 (1323,0):dlm_join_domain:1487 ERROR: status = -107 (1323,0):dlm_register_domain
Re: [Ocfs2-users] Odd error on FC12 with ocfs2
1.5.0 OCFS2 DLM 1.5.0 ocfs2: Registered cluster interface o2cb OCFS2 DLMFS 1.5.0 OCFS2 User DLM kernel interface loaded OCFS2 1.5.0 (1810,0):o2net_connect_expired:1656 ERROR: no connection established with node 4 after 30.0 seconds, giving up and returning errors. (1810,0):o2net_connect_expired:1656 ERROR: no connection established with node 5 after 30.0 seconds, giving up and returning errors. (1810,0):o2net_connect_expired:1656 ERROR: no connection established with node 6 after 30.0 seconds, giving up and returning errors. (1810,0):o2net_connect_expired:1656 ERROR: no connection established with node 2 after 30.0 seconds, giving up and returning errors. (1810,0):o2net_connect_expired:1656 ERROR: no connection established with node 3 after 30.0 seconds, giving up and returning errors. (1839,0):dlm_request_join:1035 ERROR: status = -107 (1839,0):dlm_try_to_join_domain:1209 ERROR: status = -107 (1839,0):dlm_join_domain:1487 ERROR: status = -107 (1839,0):dlm_register_domain:1753 ERROR: status = -107 (1839,0):o2cb_cluster_connect:313 ERROR: status = -107 (1839,0):ocfs2_dlm_init:2963 ERROR: status = -107 (1839,0):ocfs2_mount_volume:1788 ERROR: status = -107 ocfs2: Unmounting device (253,1) on (node 0) So clearly ocfs2 the service things it can connect to the node, but nmap sees the connection just fine. And Web2 can see the port on web1 just fine, so there is no firewall blocking the connections. I think it might be Fedora 12 used 1.50 for the OCFS kernel module and CentOS 5.3/5.4 use 1.4.4-1. Am I correct in thinking this? David -Original Message- From: Sunil Mushran [mailto:sunil.mush...@oracle.com] Sent: Thursday, March 25, 2010 6:46 PM To: David Murphy Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2 hmm.. o2cb_ctl makes no connections. It just reads the cluster.conf and populates configfs. AFAIK. David Murphy wrote: We had 6 nodes running CentOS 5.4 using 1.4.3 ocfs2-tools. I decided to rebuild one node with FC12. Which is working fine, however Nmap 192.168.200.112 shows as open And O2cb_ctl is timing out when trying to connect to that node which then causes a 107 error. This happens with all node and all node have open via nmap from the FC machine. Is there a way to further debug this to see what exactly o2cb_ctl is seeing when trying to connect? David -- -- ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] Odd error on FC12 with ocfs2
We had 6 nodes running CentOS 5.4 using 1.4.3 ocfs2-tools. I decided to rebuild one node with FC12. Which is working fine, however Nmap 192.168.200.112 shows as open And O2cb_ctl is timing out when trying to connect to that node which then causes a 107 error. This happens with all node and all node have open via nmap from the FC machine. Is there a way to further debug this to see what exactly o2cb_ctl is seeing when trying to connect? David ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at the same time
I think I found the core issue. The DLM on Centos is running 1.4.1, but on Ubuntu its 1.3.3, I can't seem to find any packages for debian or Ubuntu that upgrade the kernel modules to 1.4 series. Does anyone know how I can do this? David From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of David Murphy Sent: Wednesday, October 21, 2009 10:46 AM To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at the same time Web2 is out Ubuntu Node and Web1 is a new CentOS 5.3 Node I put 1.4.1-1 on CentOS to match the one on the Ubuntu nodes. Also I copied the o2cb configs from the Ubuntu node to the CentOS one. O2CB starts just fine , but I get these errors when I try to start the ocfs2 service or mount the partition: (6203,0):o2net_check_handshake:1205 node deploy (num 5) at 192.168.102.12: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node deploy (num 5) at 192.168.102.12: advertised net protocol version 8 but 11 is required, disconnecting Does anyone have any ideas what's going on? Full Dmesg, rpm, dpkg output is below: OCFS2 Node Manager 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build 304d9ff0c301f79f846e3cc423c30674) OCFS2 DLM 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build 96988c7961cf38309cc33396bb27b400) OCFS2 DLMFS 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build 96988c7961cf38309cc33396bb27b400) OCFS2 User DLM kernel interface loaded (6203,0):o2net_check_handshake:1205 node web2 (num 2) at 192.168.102.41: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node web3 (num 3) at 192.168.102.42: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node web3 (num 3) at 192.168.102.42: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node app1 (num 6) at 192.168.102.10: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node app1 (num 6) at 192.168.102.10: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node rgapp1 (num 4) at 192.168.102.11: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node rgapp1 (num 4) at 192.168.102.11: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node deploy (num 5) at 192.168.102.12: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node deploy (num 5) at 192.168.102.12: advertised net protocol version 8 but 11 is required, disconnecting OCFS2 1.4.1 Wed Jan 21 11:39:13 PST 2009 (build a1974724e90d3f07ae88531f6a9547a9) (6240,0):dlm_request_join:1033 ERROR: status = -107 (6240,0):dlm_try_to_join_domain:1207 ERROR: status = -107 (6240,0):dlm_join_domain:1485 ERROR: status = -107 (6240,0):dlm_register_domain:1732 ERROR: status = -107 (6240,0):ocfs2_dlm_init:2662 ERROR: status = -107 (6240,0):ocfs2_mount_volume:1251 ERROR: status = -107 ocfs2: Unmounting device (8,129) on (node 1) [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# rpm -qa | grep ocfs ocfs2-tools-1.4.1-1.el5 ocfs2-2.6.18-128.el5-1.4.1-1.el5 [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# ssh web2 dpkg -l | grep ocfs ii ocfs2-tools1.4.1-1 tools for managing OCFS2 cluster filesystems [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# No virus found in this outgoing message. Checked by AVG - www.avg.com Version: 8.5.423 / Virus Database: 270.14.24/2449 - Release Date: 10/20/09 18:42:00 ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at the same time
We are trying to port our current Ubuntu based OCFS2 cluster to CentOS 5.2 (RHEL) but Ubuntu is using DLM v. 1.3.9 not 1.4.1 but its tools are on 1.4.1. So I am getting a Network version mismatch. Is there any way to upgrade the DLM , I have tried updating the kernels. IF I upgrade from Ubuntu 8.04 to 9.04. It says the DLM version is 1.50 not 1.41. Which further confuses me. Basically I need to have a temporary mixed environment to transition nodes over to CentOS. Is the DLM a kernel module like I assume it is or something the ocfs2-tools should be upgrading when those are on 1.41 + ? David -Original Message- From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran Sent: Wednesday, October 21, 2009 12:23 PM To: David Murphy Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at the same time The production release of ocfs2 (1.2, 1.4, and the upcoming 1.6) is only available for (rh)el and sles. No other distributions. David Murphy wrote: I think I found the core issue. The DLM on Centos is running 1.4.1, but on Ubuntu its 1.3.3, I can't seem to find any packages for debian or Ubuntu that upgrade the kernel modules to 1.4 series. Does anyone know how I can do this? David *From:* ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] *On Behalf Of *David Murphy *Sent:* Wednesday, October 21, 2009 10:46 AM *To:* ocfs2-users@oss.oracle.com *Subject:* [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at the same time Web2 is out Ubuntu Node and Web1 is a new CentOS 5.3 Node I put 1.4.1-1 on CentOS to match the one on the Ubuntu nodes. Also I copied the o2cb configs from the Ubuntu node to the CentOS one. O2CB starts just fine , but I get these errors when I try to start the ocfs2 service or mount the partition: (6203,0):o2net_check_handshake:1205 node deploy (num 5) at 192.168.102.12: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node deploy (num 5) at 192.168.102.12: advertised net protocol version 8 but 11 is required, disconnecting Does anyone have any ideas what's going on? Full Dmesg, rpm, dpkg output is below: OCFS2 Node Manager 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build 304d9ff0c301f79f846e3cc423c30674) OCFS2 DLM 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build 96988c7961cf38309cc33396bb27b400) OCFS2 DLMFS 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build 96988c7961cf38309cc33396bb27b400) OCFS2 User DLM kernel interface loaded (6203,0):o2net_check_handshake:1205 node web2 (num 2) at 192.168.102.41: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node web3 (num 3) at 192.168.102.42: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node web3 (num 3) at 192.168.102.42: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node app1 (num 6) at 192.168.102.10: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node app1 (num 6) at 192.168.102.10: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node rgapp1 (num 4) at 192.168.102.11: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node rgapp1 (num 4) at 192.168.102.11: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node deploy (num 5) at 192.168.102.12: advertised net protocol version 8 but 11 is required, disconnecting (6203,0):o2net_check_handshake:1205 node deploy (num 5) at 192.168.102.12: advertised net protocol version 8 but 11 is required, disconnecting OCFS2 1.4.1 Wed Jan 21 11:39:13 PST 2009 (build a1974724e90d3f07ae88531f6a9547a9) (6240,0):dlm_request_join:1033 ERROR: status = -107 (6240,0):dlm_try_to_join_domain:1207 ERROR: status = -107 (6240,0):dlm_join_domain:1485 ERROR: status = -107 (6240,0):dlm_register_domain:1732 ERROR: status = -107 (6240,0):ocfs2_dlm_init:2662 ERROR: status = -107 (6240,0):ocfs2_mount_volume:1251 ERROR: status = -107 ocfs2: Unmounting device (8,129) on (node 1) [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# rpm -qa | grep ocfs ocfs2-tools-1.4.1-1.el5 ocfs2-2.6.18-128.el5-1.4.1-1.el5 [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# ssh web2 dpkg -l | grep ocfs ii ocfs2-tools 1.4.1-1 tools for managing OCFS2 cluster filesystems [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# -- -- No virus found in this outgoing message. Checked by AVG - www.avg.com Version: 8.5.423 / Virus Database
[Ocfs2-users] Unsual Segfault (but reboot did not occur and node stayed offline)
My logs on Node Id 3: Dec 16 06:44:03 web3 syslogd 1.5.0#1ubuntu1: restart. Dec 16 08:43:31 web3 kernel: [10727560.835261] Modules linked in: vmmemctl ocfs2 ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs vmhgfs ext2 dm_round_robin crc32c libcrc32c iscsi_tcp libiscsi scsi_transport_iscsi lp loop ipv6 parport_pc parport psmouse evdev serio_raw pcspkr i2c_piix4 i2c_core container ac button intel_agp agpgart dm_multipath dm_mod ext3 jbd mbcache sr_mod cdrom sg sd_mod ata_piix pata_acpi floppy pcnet32 ata_generic mii mptspi mptscsih mptbase scsi_transport_spi libata scsi_mod thermal processor fan vmxnet vesafb fbcon tileblit font bitblit softcursor Dec 16 08:43:31 web3 kernel: [10727560.843108] Dec 16 08:43:31 web3 kernel: [10727560.843900] Pid: 4856, comm: o2net Not tainted (2.6.24-19-virtual #1) Dec 16 08:43:31 web3 kernel: [10727560.844724] EIP: 0062:[f8e682bb] EFLAGS: 00010202 CPU: 0 Dec 16 08:43:31 web3 kernel: [10727560.845566] EIP is at __dlm_print_one_lock_resource+0x9db/0x9f0 [ocfs2_dlm] Dec 16 08:43:31 web3 kernel: [10727560.846385] EAX: 0001 EBX: 001f ECX: EDX: Dec 16 08:43:31 web3 kernel: [10727560.849779] ESI: f75e8c00 EDI: EBP: ec774700 ESP: df877d34 Dec 16 08:43:31 web3 kernel: [10727560.851900] DS: 007b ES: 007b FS: 00d8 GS: SS: 006a Dec 16 08:43:31 web3 kernel: [10727560.906502] ---[ end trace 989a5ffd1351fea4 ]--- Dec 16 08:44:01 web3 kernel: [10727590.622434] o2net: connection to node deploy (num 5) at 192.168.102.12: has been idle for 30.0 seconds, shutting it down. Dec 16 08:44:01 web3 kernel: [10727590.627319] (4,0):o2net_idle_timer:1414 here are some times that might help debug the situation: (tmr 1229438611.731225 now 1229438641.727360 dr 1229438613.731191 adv 1229438611.731227:1229438611.731228 func (a9b6ebe7:504) 1229438600.868142:1229438600.868149) Dec 16 08:44:01 web3 kernel: [10727590.629281] o2net: connection to node app1 (num 6) at 192.168.102.10: has been idle for 30.0 seconds, shutting it down. Dec 16 08:44:01 web3 kernel: [10727590.630630] (4,0):o2net_idle_timer:1414 here are some times that might help debug the situation: (tmr 1229438611.731486 now 1229438641.734226 dr 1229438634.811356 adv 1229438611.731488:1229438611.731489 func (a9b6ebe7:502) 1229438610.482837:1229438610.482839) Dec 16 08:44:01 web3 kernel: [10727590.632818] o2net: connection to node rgapp1 (num 4) at 192.168.102.11: has been idle for 30.0 seconds, shutting it down. Dec 16 08:44:01 web3 kernel: [10727590.634937] (4,0):o2net_idle_timer:1414 here are some times that might help debug the situation: (tmr 1229438611.736146 now 1229438641.737771 dr 1229438613.756472 adv 1229438611.736149:1229438611.736149 func (a9b6ebe7:503) 1229438611.735983:1229438611.735988) Dec 16 08:44:01 web3 kernel: [10727590.640618] o2net: connection to node web1 (num 1) at 192.168.102.40: has been idle for 30.0 seconds, shutting it down. Dec 16 08:44:01 web3 kernel: [10727590.642402] (4,0):o2net_idle_timer:1414 here are some times that might help debug the situation: (tmr 1229438611.742904 now 1229438641.745604 dr 1229438617.734942 adv 1229438611.742907:1229438611.742907 func (a9b6ebe7:504) 1229438611.675070:1229438611.675075) Dec 16 08:44:01 web3 kernel: [10727590.651745] o2net: connection to node web2 (num 2) at 192.168.102.41: has been idle for 30.0 seconds, shutting it down. Dec 16 08:44:01 web3 kernel: [10727590.657208] (0,0):o2net_idle_timer:1414 here are some times that might help debug the situation: (tmr 1229438611.756791 now 1229438641.756770 dr 1229438641.756769 adv 1229438611.756768:1229438611.756697 func (a9b6ebe7:507) 1229438611.756792:1229438611.746230) On the other nodes they ended up locking up waiting for death notification of Node3. Can anyone tell me with the kernel message above means and what I can to to keep this from occurring again Thanks David ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1
We are getting: Dec 4 17:19:41 web2 kernel: [9724159.177875] EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended Dec 4 17:19:41 web2 kernel: [9724159.463691] VMware hgfs: HGFS is disabled in the host Dec 4 17:19:41 web2 kernel: [9724160.965637] OCFS2 Node Manager 1.3.3 Dec 4 17:19:41 web2 kernel: [9724161.033122] OCFS2 DLM 1.3.3 Dec 4 17:19:41 web2 kernel: [9724161.037686] OCFS2 DLMFS 1.3.3 Dec 4 17:19:41 web2 kernel: [9724161.038842] OCFS2 User DLM kernel interface loaded Dec 4 17:19:41 web2 kernel: [9724171.616652] o2net: accepted connection from node rgapp1 (num 4) at 192.168.102.11: Dec 4 17:19:41 web2 kernel: [9724171.722162] OCFS2 1.3.3 Dec 4 17:19:41 web2 kernel: [9724171.782112] ocfs2_dlm: Nodes in domain (7D876A4B2EE14D0C8E1181E8DCF4237B): 2 Dec 4 17:19:41 web2 kernel: [9724171.782345] ocfs2_dlm: Node 4 joins domain 7D876A4B2EE14D0C8E1181E8DCF4237B Dec 4 17:19:41 web2 kernel: [9724171.782348] ocfs2_dlm: Nodes in domain (7D876A4B2EE14D0C8E1181E8DCF4237B): 2 4 Dec 4 17:19:41 web2 kernel: [9724171.782758] (4262,0):ocfs2_find_slot:268 slot 2 is already allocated to this node! Dec 4 17:19:41 web2 kernel: [9724171.841264] (4262,0):ocfs2_check_volume:1662 File system was not unmounted cleanly, recovering volume. Dec 4 17:19:41 web2 kernel: [9724171.841830] kjournald starting. Commit interval 5 seconds Dec 4 17:19:41 web2 kernel: [9724171.880229] ocfs2: Mounting device (8,17) on (node 2, slot 2) with ordered data mode. Dec 4 17:19:43 web2 kernel: [9724175.991919] o2net: accepted connection from node app1 (num 6) at 192.168.102.10: Dec 4 17:19:45 web2 kernel: [9724178.086781] VMware memory control driver initialized Dec 4 17:19:46 web2 kernel: [9724178.235647] o2net: accepted connection from node deploy (num 5) at 192.168.102.12: Dec 4 17:19:50 web2 kernel: [9724182.319762] ocfs2_dlm: Node 6 joins domain 7D876A4B2EE14D0C8E1181E8DCF4237B Dec 4 17:19:50 web2 kernel: [9724182.319773] ocfs2_dlm: Nodes in domain (7D876A4B2EE14D0C8E1181E8DCF4237B): 2 4 6 Dec 4 17:19:50 web2 kernel: [9724182.598848] ocfs2_dlm: Node 5 joins domain 7D876A4B2EE14D0C8E1181E8DCF4237B Dec 4 17:19:50 web2 kernel: [9724182.598853] ocfs2_dlm: Nodes in domain (7D876A4B2EE14D0C8E1181E8DCF4237B): 2 4 5 6 Dec 4 17:21:32 web2 syslogd 1.5.0#1ubuntu1: restart. This completely froze the entire cluster, when ESX tried to v-motion 3 of 6 nodes to a new host. Is it recommended by Oracle not to enable DRS on virtual machine using the cluster, or is there a configuration we can use to keep crashes like this from happening all the time. I have seen several posts suggesting that disabling DRS would be a way to workaround this issue but not really a good practice as you would loose a lot of your HA abilities. Also is there a way to have OCFS2 drop a node from the cluster if a new node comes online with its ID? David Murphy ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users