Re: [Ocfs2-users] Recommended settings for mkfs.ocfs2

2010-04-19 Thread David Murphy
Andrew,

I would make sure  you use  say  large cluster and block sizes if possible
with the inband FS option enabled (you will loose some space but ive noticed
it tends to run a bit better).  
As for the bug, its one ive been fighting with. However using only 2 nodes
will make it take a  long time to occur (took 2 yrs  on my cluster) with 6
nodes.


David

-Original Message-
From: ocfs2-users-boun...@oss.oracle.com
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Andrew Robert
Nicols
Sent: Monday, April 19, 2010 9:26 AM
To: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Recommended settings for mkfs.ocfs2

Hi Brian,

Thank you for taking the time to reply.

On Mon, Apr 19, 2010 at 08:53:47AM -0500, Brian Kroth wrote:
 lenny-backports has a 2.6.32 based kernel that might already have the 
 free space fix in it.  I haven't checked yet.

From what I can tell, the ENOSPC issue isn't fixed until 2.6.33
(http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189#c25) so the version in
backports (or even Squeeze) isn't much help yet I'm afraid.

 Also you don't really explain what you're trying to use the data store 
 for (eg: lots of small files, video files, heavy writes, heavy reads, 
 random, sequential, etc.).  It may impact the options you want to give 
 to mkfs.

Sorry - that didn't occur to me. This is going to be a file store for a
variety of user submitted data for a web application (Moodle). At present we
have a variety of:
* videos
* audio
* images
* database backups (gunzipped tar)
* large files (primarily zip)
* small files

The activity is primarily read with writes too but I'm not sure on the exact
characteristics at present. I'd guess fairly random rather than sequential
and there are periods with heavy writes.

Files are served to 6 frontend web servers over NFS for serving with
Apache2. We've currently got 2.2TB of space used.

Thank you for your input - if there's anything else which would be useful,
I'll see if I can provide it.

Andrew

--
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 04311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] OCFS2 1.4.7-1 and OCFS2 Tools 1.4.4-1 released

2010-04-19 Thread David Murphy
Sunil,

Any chance  I can get  a timeline on  having a defrag tool to   make
noncontiguous files become contagious?


David

-Original Message-
From: ocfs2-users-boun...@oss.oracle.com
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran
Sent: Monday, April 19, 2010 1:08 PM
To: ocfs2-annou...@oss.oracle.com; ocfs2-users
Subject: [Ocfs2-users] OCFS2 1.4.7-1 and OCFS2 Tools 1.4.4-1 released

All,

We are pleased to announce the release of OCFS2 1.4.7-1 and OCFS2 Tools
1.4.4-1 for Oracle's and Red Hat's Enterprise Linux 5 Update 2 and higher.

Oracle's Unbreakable Linux Network users who are subscribing to the OCFS2
1.4 packages for Enterprise Linux 5 channel can upgrade to this release
by running up2date.
http://oss.oracle.com/pipermail/el-errata/2010-April/001438.html
http://oss.oracle.com/pipermail/el-errata/2010-April/001439.html

Red Hat's Enterprise Linux 5 users can download and install the relevant
file system and tools packages from oss.oracle.com.
http://oss.oracle.com/projects/ocfs2/files/RedHat/RHEL5/
http://oss.oracle.com/projects/ocfs2-tools/files/RedHat/RHEL5/

COMPATIBILITY

This release is fully compatible with earlier releases of OCFS2 1.4.
Users can upgrade their nodes to the new version in a rolling manner.

This release is on-disk compatible with OCFS2 1.2.x. Users can install
the software and mount the older volumes as-is. However, a rolling upgrade
from 1.2 to 1.4 will not work.

RECOMMENDATION

This is just to remind users to add the noatime mount option to the
mounts that hold the Oracle datafiles, redologs, archivelogs, voting file,
etc. This is for OCFS2 1.4 only.

WHAT'S CHANGED

This release includes mostly bug fixes.

The one new feature we've added is not much of one. It allows users
to change the fence method from the default of machine reset to panic.
This was requested by some developers who are interested in the vmcore
dump that is generated when a machine panics. So unless you want the
same, our recommendation would be for you to leave the fence method as-is.
Do note that the fence method of a node can be toggled between reset
and panic at anytime.

To view the current fence method, do:
# cat /sys/kernel/config/cluster/CLUSTER/fence_method 
reset

To change to panic, do:
# echo panic  /sys/kernel/config/CLUSTER/cacl10/fence_method 
# cat /sys/kernel/config/cluster/CLUSTER/fence_method 
panic

The bug fixes can be classified under three groupings. The first group
involves cluster locking. Specifically in the area of downconverting
cluster locks. The links below explain two of the more interesting
problems. Our thanks to David Teigland of Red Hat for helping us fix
these problems.
http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=e8ef96c444326e4262fd37
1729e7beebda1af4d1
http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=39febfd5ee7948c018b667
e0b909886e1cfa1235

The second group of bug fixes concern NFS support. This release fixes
a nfsd lockup issue and a stale inode read problem. Again, the links
below describe the problems in detail.
http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=2d561d3636c80af24063a7
4ae8c817661c574d78
http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=commit;h=aa20775d1e7feba9b22e76
1fa9b69bd5c3f043bd

The last group of fixes concerns users encountering erroneous out-of-space
errors. Our analysis found that the errors were triggered because the file
system could not grow the extent block allocator because of free space
fragmentation. The extent block allocator houses the extent blocks that are
used when an inode needs more than approx 250 extents to describe a file.
So the way this plays out is that, in the early going, when free space is
contiguous, the inodes rarely use the extent blocks. They start getting
used just when the free space is fragmented enough that the extent allocator
cannot be grown.

The sad part is that the space required by this allocator is typically very
small. So small that there was no reason we could not allocate it up front.
In this release, the format tool, mkfs.ocfs2, reserves upto 0.3% of the
volume
for this allocator. Furthermore, if the file system finds that this
allocator
cannot be grown, it now can steal free blocks from another slot's allocator.
The first fix will help newly formatted volumes. The second fix will also
help existing volumes.

The final fix for this problem will be provided in the next patch update
(1.4.8). In it, we will allow the block allocators (inode and extent) to be
grown even when a 4MB contiguous chunk is not available. Users will be
able to activate this feature (discontiguous block groups) on existing
volumes. This feature is currently in testing.

BUGS FIXED

ossbz#970  Unfair postponement of local lock requests (Livelock)
ossbz#1175 BUG in dlm_free_dead_locks() (Oops during dlm recovery)
ossbz#1178 BUG in ocfs2_prepare_downconvert() (Oops during downconvert)
ossbz#1189 Free space trouble in a ocfs2 partition 

Re: [Ocfs2-users] new ocfs2 release?

2010-04-16 Thread David Murphy
Sunil,

Will this be bug be corrected? I think it what im running into
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1189


I see it was commited to the 1.4 branch but im not sure if it had been
merged into 1.4.7

David

-Original Message-
From: Sunil Mushran [mailto:sunil.mush...@oracle.com] 
Sent: Thursday, April 15, 2010 1:46 PM
To: David Murphy
Cc: li...@svrinformatica.it; ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] new ocfs2 release?

The announcement will cover this topic.

David Murphy wrote:
 Sunil,

 Is there a online/offline defrag in the 1.47 version. We are having some
 out of space issues due to   fragmentation. I am  preparing to move to a
new
 partition that will give me  sparse and inline   functions but  I feel
this
 could become an issue again.

 David

 -Original Message-
 From: ocfs2-users-boun...@oss.oracle.com
 [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran
 Sent: Thursday, April 15, 2010 12:14 PM
 To: li...@svrinformatica.it
 Cc: ocfs2-users@oss.oracle.com
 Subject: Re: [Ocfs2-users] new ocfs2 release?

 We are hoping to release it anyday now.

 Have you filed a bug about your issue? I have no recollection of any
 reports of such an issue. Orphan scanning has not changed in 1.4.7.

 File a bz. We'll need to get more information to understand the
 problem you are encountering.

 Mailing List SVR wrote:
   
 Hi ocfs2 developers,

 there are some news about the schedule for a new ocfs2 release that 
 solve the

 actual bug/limitations? I can see an 1.4.7 release tagged here:

 http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=summary

 Is there a planned release date?

 in my environment (about 500 files with 30 new files/deletion 
 per day) I see load until 1000 and I/O almost blocked for some hours 
 of the day I think this is caused by

 The file system now scans the orphan directory at a regular interval 
 to delete orphaned files that are no longer in use

 is this behaviuor still present in the 1.4.7 tag?

 thanks

 Nicola

 


 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users

   


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] new ocfs2 release?

2010-04-15 Thread David Murphy
Sunil,

Is there a online/offline defrag in the 1.47 version. We are having some
out of space issues due to   fragmentation. I am  preparing to move to a new
partition that will give me  sparse and inline   functions but  I feel this
could become an issue again.

David

-Original Message-
From: ocfs2-users-boun...@oss.oracle.com
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran
Sent: Thursday, April 15, 2010 12:14 PM
To: li...@svrinformatica.it
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] new ocfs2 release?

We are hoping to release it anyday now.

Have you filed a bug about your issue? I have no recollection of any
reports of such an issue. Orphan scanning has not changed in 1.4.7.

File a bz. We'll need to get more information to understand the
problem you are encountering.

Mailing List SVR wrote:

 Hi ocfs2 developers,

 there are some news about the schedule for a new ocfs2 release that 
 solve the

 actual bug/limitations? I can see an 1.4.7 release tagged here:

 http://oss.oracle.com/git/?p=ocfs2-1.4.git;a=summary

 Is there a planned release date?

 in my environment (about 500 files with 30 new files/deletion 
 per day) I see load until 1000 and I/O almost blocked for some hours 
 of the day I think this is caused by

 The file system now scans the orphan directory at a regular interval 
 to delete orphaned files that are no longer in use

 is this behaviuor still present in the 1.4.7 tag?

 thanks

 Nicola



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Best Linux distribution for OCFS2?

2010-04-15 Thread David Murphy
OCFS and Fedora both work  well with  OCFS just be aware OCFS/RHEL are on
seriously outdated kernel 2.6.18 vs 2.6.30 so OCFS's module is an added kmod
.

-Original Message-
From: ocfs2-users-boun...@oss.oracle.com
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sérgio Surkamp
Sent: Thursday, April 15, 2010 4:55 PM
To: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Best Linux distribution for OCFS2?

Watch out for SLES11 as it needs extra package license to use OCFS2.
http://forums.novell.com/novell-product-support-forums/suse-linux-enterprise
-server-sles/sles-configure-administer/366627-sles-11-ocfs2.html

What about CentOS with stock Oracle OCFS2 packages?

Regards,
Sérgio

Em Thu, 15 Apr 2010 10:34:52 -0700
Patrick J. LoPresti lopre...@gmail.com escreveu:

 OK, I realize this is a loaded question, but I really am interested in
 some feedback.
 
 I am preparing to create a new OCFS2 cluster -- several of them,
 actually -- and I have the luxury of choosing my Linux distribution.
 I am agnostic on this, save for a slight bias against Fedora Core
 (and, by implication, Red Hat) due to some bad experiences a few years
 ago.
 
 My current short list of options reads:
 
   Ubuntu Lucid Lynx
   Suse Linux Enterprise Server 11
   OpenSuse 11.2 or 11.3
 
 Although I have my choice of distributions now, and I have a couple of
 months to prototype, once the choice is made I will be stuck
 supporting the configuration for years; hardware and O/S changes will
 be costly.  So I want to get this right.
 
 I have been reading this mailing list for a while, and it sounds like
 OCFS2 has had some fairly serious bugs fixed just in the last few
 weeks and months (e.g., ENOSPC when there is plenty of space).  Which
 distribution, if any, has incorporated these fixes?  Which would be
 most likely to provide such fixes in the future?  I am also curious to
 hear success stories, failure stories, advocacy, warnings...  Feel
 free to reply to me personally if you do not want to spam the list,
 and I will post a summary.
 
 Possibly relevant other technologies I intend to use:
 
   iSCSI over 10GigE
   Linux md software RAID-0 (my iSCSI hardware RAID units already
 provide redundancy)
 
 My configuration will be storing 100+ terabytes on a single partition.
  (Sounds crazy, perhaps?  My application is a little... special.)
 
 Thanks!
 
  - Pat
 
 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


-- 
  .:':.
.:'` Sérgio Surkamp | Gerente de Rede
::   ser...@gruposinternet.com.br
`:..:'
  `:,   ,.:' *Grupos Internet S.A.*
`: :'R. Lauro Linhares, 2123 Torre B - Sala 201
 : : Trindade - Florianópolis - SC
 :.'
 ::  +55 48 3234-4109
 :
 '   http://www.gruposinternet.com.br

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Odd error on FC12 with ocfs2

2010-03-30 Thread David Murphy
[r...@web1 /dev]# debugfs.ocfs2  -l TCP off /dev/mapper/OCFS2_200Gp1 
[r...@web1 /dev]# mount /dev/mapper/OCFS2_200Gp1  -v
device=/dev/mapper/OCFS2_200Gp1
mount.ocfs2: Transport endpoint is not connected while mounting
/dev/mapper/OCFS2_200Gp1 on /mnt/appshare. Check 'dmesg' for more
information on this error.
[r...@web1 /dev]#dmesg

DMESG:
Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
ERROR: no connection established with node 2 after 30.0 seconds, giving up
and returning errors.
Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
ERROR: no connection established with node 3 after 30.0 seconds, giving up
and returning errors.
Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
ERROR: no connection established with node 4 after 30.0 seconds, giving up
and returning errors.
Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
ERROR: no connection established with node 5 after 30.0 seconds, giving up
and returning errors.
Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
ERROR: no connection established with node 6 after 30.0 seconds, giving up
and returning errors.
Mar 30 10:23:38 web1 kernel: (1740,0):dlm_request_join:1035 ERROR:
status = -107
Mar 30 10:23:38 web1 kernel: (1740,0):dlm_try_to_join_domain:1209
ERROR: status = -107
Mar 30 10:23:38 web1 kernel: (1740,0):dlm_join_domain:1487 ERROR:
status = -107
Mar 30 10:23:38 web1 kernel: (1740,0):dlm_register_domain:1753
ERROR: status = -107
Mar 30 10:23:38 web1 kernel: (1740,0):o2cb_cluster_connect:313
ERROR: status = -107
Mar 30 10:23:38 web1 kernel: (1740,0):ocfs2_dlm_init:2963 ERROR:
status = -107
Mar 30 10:23:38 web1 kernel: (1740,0):ocfs2_mount_volume:1788 ERROR:
status = -107
Mar 30 10:23:38 web1 kernel: ocfs2: Unmounting device (253,1) on
(node 0)

DEBUGFS:
debugfs: curdev
/dev/mapper/OCFS2_200Gp1
debugfs: controld dump
controld: Unable to access cluster service while obtaining the debug
buffer
debugfs: slotmap
Slot#   Node#
0   3
1   5
2   2
4   4
5   6
debugfs: stats
Revision: 0.90
Mount Count: 0   Max Mount Count: 20
State: 0   Errors: 0
Check Interval: 0   Last Check: Mon Mar 29 10:53:52 2010
Creator OS: 0
Feature Compat: 1 backup-super
Feature Incompat: 16 sparse
Tunefs Incomplete: 0 
Feature RO compat: 1 unwritten
Root Blknum: 5   System Dir Blknum: 6
First Cluster Group Blknum: 3
Block Size Bits: 12   Cluster Size Bits: 12
Max Node Slots: 6
Extended Attributes Inline Size: 0
Label: OCFS2_APPSHARE_200G
UUID: D6E0DD0AAC8844ED94A4A459FBB6F7FF
UUID_hash: 0 (0x0)
Cluster stack: classic o2cb
Inode: 2   Mode: 00   Generation: 2428834932 (0x90c51474)
FS Generation: 2428834932 (0x90c51474)
CRC32:    ECC: 
Type: Unknown   Attr: 0x0   Flags: Valid System Superblock 
Dynamic Features: (0x0) 
User: 0 (root)   Group: 0 (root)   Size: 0
Links: 0   Clusters: 52428119
ctime: 0x4a0b2372 -- Wed May 13 14:45:54 2009
atime: 0x0 -- Wed Dec 31 18:00:00 1969
mtime: 0x4a0b2372 -- Wed May 13 14:45:54 2009
dtime: 0x0 -- Wed Dec 31 18:00:00 1969
ctime_nsec: 0x -- 0
atime_nsec: 0x -- 0
mtime_nsec: 0x -- 0
Last Extblk: 0
Sub Alloc Slot: Global   Sub Alloc Bit: 65535



It doesn't appear any extra debug logging actually was created.

David
-Original Message-
From: Sunil Mushran [mailto:sunil.mush...@oracle.com]
Sent: Monday, March 29, 2010 10:23 PM
To: Angelo McComis
Cc: David Murphy; ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2

No

On Mar 29, 2010, at 8:10 PM, Angelo McComis ang...@mccomis.com wrote:

 Does it matter that the nodes are numbered 1-6 instead of 0-5?



 On Mon, Mar 29, 2010 at 4:25 PM, Sunil Mushran 
 sunil.mush...@oracle.com
  wrote:
 Enable some debugging.

 #debugfs.ocfs2 -l TCP allow
 ...do mount...
 #debugfs.ocfs2 -l TCP off


 David Murphy wrote:
 [r...@web2 ~]# nc -z  192.168.102.140  Connection to 
 192.168.102.140  port [tcp/cbt] succeeded!

 [r...@web1 /etc/sysconfig/network-scripts]# nc -z  192.168.102.141
  Connection to 192.168.102.141  port [tcp/cbt] succeeded!

 -Original Message-
 From: Sunil Mushran [mailto:sunil.mush...@oracle.com

Re: [Ocfs2-users] Odd error on FC12 with ocfs2

2010-03-29 Thread David Murphy
Maybe I miss spoke then, at any rate

The machine clearly has networking working  to each of the other nodes, but
the node thinks its cant talk to the rest   of the cluster so it wont join
the  cluster. However nmap/telnet clearly show that it can infact talk to
those devices on the correct port, and all the other device are active and
talking to each other.


This is with  iptables and ipv6 iptables disabled and SELINUX in disabled
mode.

David Murphy

-Original Message-
From: Sunil Mushran [mailto:sunil.mush...@oracle.com] 
Sent: Thursday, March 25, 2010 4:46 PM
To: David Murphy
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2

hmm.. o2cb_ctl makes no connections. It just reads the cluster.conf
and populates configfs. AFAIK.

David Murphy wrote:

 We had  6 nodes running CentOS 5.4 using  1.4.3 ocfs2-tools.

  

 I decided to rebuild one node with FC12.

  

  

 Which is working fine, however

  

 Nmap 192.168.200.112  shows  as open

 And  

  

 O2cb_ctl is  timing out when trying to connect to that node which then 
 causes a 107 error. This happens with all node and all node have  
 open  via nmap from the FC machine.

  

  

 Is there a way to further debug this to see what exactly  o2cb_ctl is 
 seeing when trying to connect?

  

  

 David

 

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] Odd error on FC12 with ocfs2

2010-03-29 Thread David Murphy
[r...@web2 ~]# nc -z  192.168.102.140 
Connection to 192.168.102.140  port [tcp/cbt] succeeded!

[r...@web1 /etc/sysconfig/network-scripts]# nc -z  192.168.102.141 
Connection to 192.168.102.141  port [tcp/cbt] succeeded!

-Original Message-
From: Sunil Mushran [mailto:sunil.mush...@oracle.com] 
Sent: Monday, March 29, 2010 5:08 PM
To: David Murphy
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2

What happens when you use netcat to ping the node?
nc -z host.example.com 

David Murphy wrote:
 Some additional data:
 From Web1 ( New Fedora Machine) to Web2:
   [r...@web1 /etc/sysconfig/network-scripts]# nmap 192.168.102.141

   Starting Nmap 5.21 ( http://nmap.org ) at 2010-03-29 16:56 CDT
   Nmap scan report for 192.168.102.141
   Host is up (0.76s latency).
   Not shown: 993 closed ports
   PORT STATE SERVICE
   22/tcp   open  ssh
   80/tcp   open  http
   81/tcp   open  hosts2-ns
   111/tcp  open  rpcbind
   5666/tcp open  nrpe
   /tcp open  unknown
   9102/tcp open  jetdirect
   MAC Address: 00:50:56:A3:58:5D (VMware)
   
   Nmap done: 1 IP address (1 host up) scanned in 1.18 seconds


 From   web2 - web1 (new fedora machine)
   [r...@web2 ~]# nmap 192.168.102.140
   
   Starting Nmap 5.00 ( http://nmap.org ) at 2010-03-29 16:40 CDT
   Interesting ports on 192.168.102.140:
   Not shown: 994 closed ports
   PORT STATE SERVICE
   22/tcp   open  ssh
   80/tcp   open  http
   81/tcp   open  hosts2-ns
   111/tcp  open  rpcbind
   443/tcp  open  https
   /tcp open  unknown
   MAC Address: 00:50:56:A3:14:62 (VMWare)

   Nmap done: 1 IP address (1 host up) scanned in 1.31 seconds


 Cluster.conf:
   cluster:
   node_count = 6
   name = appshare
   
   node:
   ip_port = 
   ip_address = 192.168.102.140
   number = 1
   name = web1
   cluster = appshare
   
   node:
   ip_port = 
   ip_address = 192.168.102.141
   number = 2
   name = web2
   cluster = appshare
   
   node:
   ip_port = 
   ip_address = 192.168.102.142
   number = 3
   name = web3
   cluster = appshare
   
   node:
   ip_port = 
   ip_address = 192.168.102.111
   number = 4
   name = rgapp1
   cluster = appshare
   
   node:
   ip_port = 
   ip_address = 192.168.102.122
   number = 5
   name = deploy
   cluster = appshare
   
   node:
   ip_port = 
   ip_address = 192.168.102.112
   number = 6
   name = app1
   cluster = appshare

 DMESG on WEB1:
   OCFS2 1.5.0
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 2 after 30.0 seconds, giving up and returning errors.
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 3 after 30.0 seconds, giving up and returning errors.
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 4 after 30.0 seconds, giving up and returning errors.
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 5 after 30.0 seconds, giving up and returning errors.
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 6 after 30.0 seconds, giving up and returning errors.
   (1262,0):dlm_request_join:1035 ERROR: status = -107
   (1262,0):dlm_try_to_join_domain:1209 ERROR: status = -107
   (1262,0):dlm_join_domain:1487 ERROR: status = -107
   (1262,0):dlm_register_domain:1753 ERROR: status = -107
   (1262,0):o2cb_cluster_connect:313 ERROR: status = -107
   (1262,0):ocfs2_dlm_init:2963 ERROR: status = -107
   (1262,0):ocfs2_mount_volume:1788 ERROR: status = -107
   ocfs2: Unmounting device (253,1) on (node 0)
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 2 after 30.0 seconds, giving up and returning errors.
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 3 after 30.0 seconds, giving up and returning errors.
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 5 after 30.0 seconds, giving up and returning errors.
   (1199,0):o2net_connect_expired:1656 ERROR: no connection established
 with node 6 after 30.0 seconds, giving up and returning errors.
   (1323,0):dlm_request_join:1035 ERROR: status = -107
   (1323,0):dlm_try_to_join_domain:1209 ERROR: status = -107
   (1323,0):dlm_join_domain:1487 ERROR: status = -107
   (1323,0):dlm_register_domain

Re: [Ocfs2-users] Odd error on FC12 with ocfs2

2010-03-29 Thread David Murphy
 1.5.0
OCFS2 DLM 1.5.0
ocfs2: Registered cluster interface o2cb
OCFS2 DLMFS 1.5.0
OCFS2 User DLM kernel interface loaded
OCFS2 1.5.0
(1810,0):o2net_connect_expired:1656 ERROR: no connection established
with node 4 after 30.0 seconds, giving up and returning errors.
(1810,0):o2net_connect_expired:1656 ERROR: no connection established
with node 5 after 30.0 seconds, giving up and returning errors.
(1810,0):o2net_connect_expired:1656 ERROR: no connection established
with node 6 after 30.0 seconds, giving up and returning errors.
(1810,0):o2net_connect_expired:1656 ERROR: no connection established
with node 2 after 30.0 seconds, giving up and returning errors.
(1810,0):o2net_connect_expired:1656 ERROR: no connection established
with node 3 after 30.0 seconds, giving up and returning errors.
(1839,0):dlm_request_join:1035 ERROR: status = -107
(1839,0):dlm_try_to_join_domain:1209 ERROR: status = -107
(1839,0):dlm_join_domain:1487 ERROR: status = -107
(1839,0):dlm_register_domain:1753 ERROR: status = -107
(1839,0):o2cb_cluster_connect:313 ERROR: status = -107
(1839,0):ocfs2_dlm_init:2963 ERROR: status = -107
(1839,0):ocfs2_mount_volume:1788 ERROR: status = -107
ocfs2: Unmounting device (253,1) on (node 0)



So clearly  ocfs2 the service things it can connect to the node, but nmap
sees the connection just fine. And Web2 can see the port on web1 just fine,
so there is no firewall blocking the connections.

I think it might be   Fedora 12 used 1.50 for the OCFS kernel module and
CentOS 5.3/5.4 use 1.4.4-1. Am I correct in thinking this?

David
-Original Message-
From: Sunil Mushran [mailto:sunil.mush...@oracle.com]
Sent: Thursday, March 25, 2010 6:46 PM
To: David Murphy
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2

hmm.. o2cb_ctl makes no connections. It just reads the cluster.conf and
populates configfs. AFAIK.

David Murphy wrote:

 We had  6 nodes running CentOS 5.4 using  1.4.3 ocfs2-tools.

  

 I decided to rebuild one node with FC12.

  

  

 Which is working fine, however

  

 Nmap 192.168.200.112  shows  as open

 And

  

 O2cb_ctl is  timing out when trying to connect to that node which then 
 causes a 107 error. This happens with all node and all node have  
 open  via nmap from the FC machine.

  

  

 Is there a way to further debug this to see what exactly  o2cb_ctl is 
 seeing when trying to connect?

  

  

 David

 --
 --

 ___
 Ocfs2-users mailing list
 Ocfs2-users@oss.oracle.com
 http://oss.oracle.com/mailman/listinfo/ocfs2-users



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


[Ocfs2-users] Odd error on FC12 with ocfs2

2010-03-25 Thread David Murphy
We had  6 nodes running CentOS 5.4 using  1.4.3 ocfs2-tools.

 

I decided to rebuild one node with FC12.

 

 

Which is working fine, however

 

Nmap 192.168.200.112  shows  as open

And  

 

O2cb_ctl is  timing out when trying to connect to that node which then
causes a 107 error. This happens with all node and all node have  open
via nmap from the FC machine.

 

 

Is there a way to further debug this to see what exactly  o2cb_ctl is seeing
when trying to connect?

 

 

David

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at the same time

2009-10-21 Thread David Murphy
I think I found the core issue.

 

The DLM on Centos is running 1.4.1, but on Ubuntu its 1.3.3, I can't seem to
find any packages  for debian or Ubuntu that upgrade the  kernel modules to
1.4 series. Does anyone know how I can do this?

 

David

 

 

From: ocfs2-users-boun...@oss.oracle.com
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of David Murphy
Sent: Wednesday, October 21, 2009 10:46 AM
To: ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at the
same time

 

 

Web2 is out Ubuntu Node and Web1 is a new  CentOS 5.3 Node  I put  1.4.1-1
on CentOS  to  match the one on the Ubuntu nodes. Also I copied the o2cb
configs from the Ubuntu node to the  CentOS one. O2CB starts just fine , but
I get these errors when I try to start the ocfs2 service or mount the
partition:

 

(6203,0):o2net_check_handshake:1205 node deploy (num 5) at
192.168.102.12: advertised net protocol version 8 but 11 is required,
disconnecting

(6203,0):o2net_check_handshake:1205 node deploy (num 5) at
192.168.102.12: advertised net protocol version 8 but 11 is required,
disconnecting

 

 

 

Does anyone have any ideas what's going on? Full  Dmesg, rpm, dpkg output is
below:

 

 

OCFS2 Node Manager 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build
304d9ff0c301f79f846e3cc423c30674)

OCFS2 DLM 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build
96988c7961cf38309cc33396bb27b400)

OCFS2 DLMFS 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build
96988c7961cf38309cc33396bb27b400)

OCFS2 User DLM kernel interface loaded

(6203,0):o2net_check_handshake:1205 node web2 (num 2) at 192.168.102.41:
advertised net protocol version 8 but 11 is required, disconnecting

(6203,0):o2net_check_handshake:1205 node web3 (num 3) at 192.168.102.42:
advertised net protocol version 8 but 11 is required, disconnecting

(6203,0):o2net_check_handshake:1205 node web3 (num 3) at 192.168.102.42:
advertised net protocol version 8 but 11 is required, disconnecting

(6203,0):o2net_check_handshake:1205 node app1 (num 6) at 192.168.102.10:
advertised net protocol version 8 but 11 is required, disconnecting

(6203,0):o2net_check_handshake:1205 node app1 (num 6) at 192.168.102.10:
advertised net protocol version 8 but 11 is required, disconnecting

(6203,0):o2net_check_handshake:1205 node rgapp1 (num 4) at
192.168.102.11: advertised net protocol version 8 but 11 is required,
disconnecting

(6203,0):o2net_check_handshake:1205 node rgapp1 (num 4) at
192.168.102.11: advertised net protocol version 8 but 11 is required,
disconnecting

(6203,0):o2net_check_handshake:1205 node deploy (num 5) at
192.168.102.12: advertised net protocol version 8 but 11 is required,
disconnecting

(6203,0):o2net_check_handshake:1205 node deploy (num 5) at
192.168.102.12: advertised net protocol version 8 but 11 is required,
disconnecting

OCFS2 1.4.1 Wed Jan 21 11:39:13 PST 2009 (build
a1974724e90d3f07ae88531f6a9547a9)

(6240,0):dlm_request_join:1033 ERROR: status = -107

(6240,0):dlm_try_to_join_domain:1207 ERROR: status = -107

(6240,0):dlm_join_domain:1485 ERROR: status = -107

(6240,0):dlm_register_domain:1732 ERROR: status = -107

(6240,0):ocfs2_dlm_init:2662 ERROR: status = -107

(6240,0):ocfs2_mount_volume:1251 ERROR: status = -107

ocfs2: Unmounting device (8,129) on (node 1)

[r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# rpm -qa | grep ocfs

ocfs2-tools-1.4.1-1.el5

ocfs2-2.6.18-128.el5-1.4.1-1.el5

[r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# ssh web2 dpkg -l | grep
ocfs

ii  ocfs2-tools1.4.1-1 tools
for managing OCFS2 cluster filesystems

[r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]#

 


No virus found in this outgoing message.
Checked by AVG - www.avg.com
Version: 8.5.423 / Virus Database: 270.14.24/2449 - Release Date: 10/20/09 
18:42:00
___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at the same time

2009-10-21 Thread David Murphy
We are trying  to port our current Ubuntu based OCFS2 cluster to  CentOS 5.2
(RHEL)  but  Ubuntu is using DLM v. 1.3.9 not 1.4.1   but its tools are on
1.4.1. So I am getting  a  Network version mismatch. Is there any way  to
upgrade the  DLM  , I have tried updating the kernels. IF I upgrade from
Ubuntu 8.04 to 9.04. It  says the  DLM version is  1.50 not 1.41. Which
further confuses me. Basically I need to have  a temporary mixed environment
to transition nodes over to  CentOS.

Is the DLM a  kernel module like I assume it is or something the
ocfs2-tools should be upgrading when those are on 1.41 + ?

David
-Original Message-
From: ocfs2-users-boun...@oss.oracle.com
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Sunil Mushran
Sent: Wednesday, October 21, 2009 12:23 PM
To: David Murphy
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut at
the same time

The production release of ocfs2 (1.2, 1.4, and the upcoming 1.6) is only
available for (rh)el and sles. No other distributions.

David Murphy wrote:

 I think I found the core issue.

 The DLM on Centos is running 1.4.1, but on Ubuntu its 1.3.3, I can't 
 seem to find any packages for debian or Ubuntu that upgrade the kernel 
 modules to 1.4 series. Does anyone know how I can do this?

 David

 *From:* ocfs2-users-boun...@oss.oracle.com
 [mailto:ocfs2-users-boun...@oss.oracle.com] *On Behalf Of *David 
 Murphy
 *Sent:* Wednesday, October 21, 2009 10:46 AM
 *To:* ocfs2-users@oss.oracle.com
 *Subject:* [Ocfs2-users] Unable to mount cluster on CentOS and Ubunut 
 at the same time

 Web2 is out Ubuntu Node and Web1 is a new CentOS 5.3 Node I put
 1.4.1-1 on CentOS to match the one on the Ubuntu nodes. Also I copied 
 the o2cb configs from the Ubuntu node to the CentOS one. O2CB starts 
 just fine , but I get these errors when I try to start the ocfs2 
 service or mount the partition:

 (6203,0):o2net_check_handshake:1205 node deploy (num 5) at
 192.168.102.12: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node deploy (num 5) at
 192.168.102.12: advertised net protocol version 8 but 11 is 
 required, disconnecting

 Does anyone have any ideas what's going on? Full Dmesg, rpm, dpkg 
 output is below:

 OCFS2 Node Manager 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build
 304d9ff0c301f79f846e3cc423c30674)

 OCFS2 DLM 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build
 96988c7961cf38309cc33396bb27b400)

 OCFS2 DLMFS 1.4.1 Wed Jan 21 11:39:16 PST 2009 (build
 96988c7961cf38309cc33396bb27b400)

 OCFS2 User DLM kernel interface loaded

 (6203,0):o2net_check_handshake:1205 node web2 (num 2) at
 192.168.102.41: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node web3 (num 3) at
 192.168.102.42: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node web3 (num 3) at
 192.168.102.42: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node app1 (num 6) at
 192.168.102.10: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node app1 (num 6) at
 192.168.102.10: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node rgapp1 (num 4) at
 192.168.102.11: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node rgapp1 (num 4) at
 192.168.102.11: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node deploy (num 5) at
 192.168.102.12: advertised net protocol version 8 but 11 is 
 required, disconnecting

 (6203,0):o2net_check_handshake:1205 node deploy (num 5) at
 192.168.102.12: advertised net protocol version 8 but 11 is 
 required, disconnecting

 OCFS2 1.4.1 Wed Jan 21 11:39:13 PST 2009 (build
 a1974724e90d3f07ae88531f6a9547a9)

 (6240,0):dlm_request_join:1033 ERROR: status = -107

 (6240,0):dlm_try_to_join_domain:1207 ERROR: status = -107

 (6240,0):dlm_join_domain:1485 ERROR: status = -107

 (6240,0):dlm_register_domain:1732 ERROR: status = -107

 (6240,0):ocfs2_dlm_init:2662 ERROR: status = -107

 (6240,0):ocfs2_mount_volume:1251 ERROR: status = -107

 ocfs2: Unmounting device (8,129) on (node 1)

 [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# rpm -qa | grep 
 ocfs

 ocfs2-tools-1.4.1-1.el5

 ocfs2-2.6.18-128.el5-1.4.1-1.el5

 [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]# ssh web2 dpkg -l
 | grep ocfs

 ii ocfs2-tools 1.4.1-1 tools for managing OCFS2 cluster filesystems

 [r...@web1 /opt/build-scripts/CoreUtils/ocfs_rpms]#

 --
 --


 No virus found in this outgoing message.
 Checked by AVG - www.avg.com
 Version: 8.5.423 / Virus Database

[Ocfs2-users] Unsual Segfault (but reboot did not occur and node stayed offline)

2008-12-16 Thread David Murphy
My logs on Node Id 3:


Dec 16 06:44:03 web3 syslogd 1.5.0#1ubuntu1: restart.
Dec 16 08:43:31 web3 kernel: [10727560.835261] Modules linked in: vmmemctl
ocfs2 ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs vmhgfs ext2
dm_round_robin crc32c libcrc32c iscsi_tcp libiscsi scsi_transport_iscsi lp
loop ipv6 parport_pc parport psmouse evdev serio_raw pcspkr i2c_piix4
i2c_core container ac button intel_agp agpgart dm_multipath dm_mod ext3 jbd
mbcache sr_mod cdrom sg sd_mod ata_piix pata_acpi floppy pcnet32 ata_generic
mii mptspi mptscsih mptbase scsi_transport_spi libata scsi_mod thermal
processor fan vmxnet vesafb fbcon tileblit font bitblit softcursor
Dec 16 08:43:31 web3 kernel: [10727560.843108] 
Dec 16 08:43:31 web3 kernel: [10727560.843900] Pid: 4856, comm: o2net Not
tainted (2.6.24-19-virtual #1)
Dec 16 08:43:31 web3 kernel: [10727560.844724] EIP: 0062:[f8e682bb]
EFLAGS: 00010202 CPU: 0
Dec 16 08:43:31 web3 kernel: [10727560.845566] EIP is at
__dlm_print_one_lock_resource+0x9db/0x9f0 [ocfs2_dlm]
Dec 16 08:43:31 web3 kernel: [10727560.846385] EAX: 0001 EBX: 001f
ECX:  EDX: 
Dec 16 08:43:31 web3 kernel: [10727560.849779] ESI: f75e8c00 EDI: 
EBP: ec774700 ESP: df877d34
Dec 16 08:43:31 web3 kernel: [10727560.851900]  DS: 007b ES: 007b FS: 00d8
GS:  SS: 006a
Dec 16 08:43:31 web3 kernel: [10727560.906502] ---[ end trace
989a5ffd1351fea4 ]---
Dec 16 08:44:01 web3 kernel: [10727590.622434] o2net: connection to node
deploy (num 5) at 192.168.102.12: has been idle for 30.0 seconds,
shutting it down.
Dec 16 08:44:01 web3 kernel: [10727590.627319] (4,0):o2net_idle_timer:1414
here are some times that might help debug the situation: (tmr
1229438611.731225 now 1229438641.727360 dr 1229438613.731191 adv
1229438611.731227:1229438611.731228 func (a9b6ebe7:504)
1229438600.868142:1229438600.868149)
Dec 16 08:44:01 web3 kernel: [10727590.629281] o2net: connection to node
app1 (num 6) at 192.168.102.10: has been idle for 30.0 seconds, shutting
it down.
Dec 16 08:44:01 web3 kernel: [10727590.630630] (4,0):o2net_idle_timer:1414
here are some times that might help debug the situation: (tmr
1229438611.731486 now 1229438641.734226 dr 1229438634.811356 adv
1229438611.731488:1229438611.731489 func (a9b6ebe7:502)
1229438610.482837:1229438610.482839)
Dec 16 08:44:01 web3 kernel: [10727590.632818] o2net: connection to node
rgapp1 (num 4) at 192.168.102.11: has been idle for 30.0 seconds,
shutting it down.
Dec 16 08:44:01 web3 kernel: [10727590.634937] (4,0):o2net_idle_timer:1414
here are some times that might help debug the situation: (tmr
1229438611.736146 now 1229438641.737771 dr 1229438613.756472 adv
1229438611.736149:1229438611.736149 func (a9b6ebe7:503)
1229438611.735983:1229438611.735988)
Dec 16 08:44:01 web3 kernel: [10727590.640618] o2net: connection to node
web1 (num 1) at 192.168.102.40: has been idle for 30.0 seconds, shutting
it down.
Dec 16 08:44:01 web3 kernel: [10727590.642402] (4,0):o2net_idle_timer:1414
here are some times that might help debug the situation: (tmr
1229438611.742904 now 1229438641.745604 dr 1229438617.734942 adv
1229438611.742907:1229438611.742907 func (a9b6ebe7:504)
1229438611.675070:1229438611.675075)
Dec 16 08:44:01 web3 kernel: [10727590.651745] o2net: connection to node
web2 (num 2) at 192.168.102.41: has been idle for 30.0 seconds, shutting
it down.
Dec 16 08:44:01 web3 kernel: [10727590.657208] (0,0):o2net_idle_timer:1414
here are some times that might help debug the situation: (tmr
1229438611.756791 now 1229438641.756770 dr 1229438641.756769 adv
1229438611.756768:1229438611.756697 func (a9b6ebe7:507)
1229438611.756792:1229438611.746230)



On the other nodes they ended up locking up waiting for  death notification
of Node3. 
Can anyone tell me with the kernel message above means and what I can to to
keep this from occurring again


Thanks
David


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


[Ocfs2-users] ESX 3.5 DRS and OCFS2 1.4.1-1

2008-12-04 Thread David Murphy
We are  getting:

 

Dec  4 17:19:41 web2 kernel: [9724159.177875] EXT2-fs warning: mounting
unchecked fs, running e2fsck is recommended

Dec  4 17:19:41 web2 kernel: [9724159.463691] VMware hgfs: HGFS is disabled
in the host

Dec  4 17:19:41 web2 kernel: [9724160.965637] OCFS2 Node Manager 1.3.3

Dec  4 17:19:41 web2 kernel: [9724161.033122] OCFS2 DLM 1.3.3

Dec  4 17:19:41 web2 kernel: [9724161.037686] OCFS2 DLMFS 1.3.3

Dec  4 17:19:41 web2 kernel: [9724161.038842] OCFS2 User DLM kernel
interface loaded

Dec  4 17:19:41 web2 kernel: [9724171.616652] o2net: accepted connection
from node rgapp1 (num 4) at 192.168.102.11:

Dec  4 17:19:41 web2 kernel: [9724171.722162] OCFS2 1.3.3

Dec  4 17:19:41 web2 kernel: [9724171.782112] ocfs2_dlm: Nodes in domain
(7D876A4B2EE14D0C8E1181E8DCF4237B): 2 

Dec  4 17:19:41 web2 kernel: [9724171.782345] ocfs2_dlm: Node 4 joins domain
7D876A4B2EE14D0C8E1181E8DCF4237B

Dec  4 17:19:41 web2 kernel: [9724171.782348] ocfs2_dlm: Nodes in domain
(7D876A4B2EE14D0C8E1181E8DCF4237B): 2 4 

Dec  4 17:19:41 web2 kernel: [9724171.782758] (4262,0):ocfs2_find_slot:268
slot 2 is already allocated to this node!

Dec  4 17:19:41 web2 kernel: [9724171.841264]
(4262,0):ocfs2_check_volume:1662 File system was not unmounted cleanly,
recovering volume.

Dec  4 17:19:41 web2 kernel: [9724171.841830] kjournald starting.  Commit
interval 5 seconds

Dec  4 17:19:41 web2 kernel: [9724171.880229] ocfs2: Mounting device (8,17)
on (node 2, slot 2) with ordered data mode.

Dec  4 17:19:43 web2 kernel: [9724175.991919] o2net: accepted connection
from node app1 (num 6) at 192.168.102.10:

Dec  4 17:19:45 web2 kernel: [9724178.086781] VMware memory control driver
initialized

Dec  4 17:19:46 web2 kernel: [9724178.235647] o2net: accepted connection
from node deploy (num 5) at 192.168.102.12:

Dec  4 17:19:50 web2 kernel: [9724182.319762] ocfs2_dlm: Node 6 joins domain
7D876A4B2EE14D0C8E1181E8DCF4237B

Dec  4 17:19:50 web2 kernel: [9724182.319773] ocfs2_dlm: Nodes in domain
(7D876A4B2EE14D0C8E1181E8DCF4237B): 2 4 6 

Dec  4 17:19:50 web2 kernel: [9724182.598848] ocfs2_dlm: Node 5 joins domain
7D876A4B2EE14D0C8E1181E8DCF4237B

Dec  4 17:19:50 web2 kernel: [9724182.598853] ocfs2_dlm: Nodes in domain
(7D876A4B2EE14D0C8E1181E8DCF4237B): 2 4 5 6 

Dec  4 17:21:32 web2 syslogd 1.5.0#1ubuntu1: restart.

 

 

 

 

This completely froze the entire cluster, when ESX tried to v-motion 3 of 6
nodes to a new host. 

Is it recommended by Oracle not to enable DRS on virtual machine using the
cluster, or is there a configuration we can use to keep crashes like this
from happening all the time.

 

I have seen several posts suggesting that disabling DRS would be a way to
workaround this issue but not really a good practice as you would loose a
lot of your HA abilities.

 

Also is there a way to have OCFS2 drop a node from the cluster if a new node
comes online with its ID?

 

David Murphy

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users