Re: [CentOS-virt] GFS2 hangs after one node going down

2013-03-25 Thread Maurizio Giungato
Il 22/03/2013 16:27, Digimer ha scritto:
 On 03/22/2013 11:21 AM, Maurizio Giungato wrote:
 Il 22/03/2013 00:34, Digimer ha scritto:
 On 03/21/2013 02:09 PM, Maurizio Giungato wrote:
 Il 21/03/2013 18:48, Maurizio Giungato ha scritto:
 Il 21/03/2013 18:14, Digimer ha scritto:
 On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
 Hi guys,

 my goal is to create a reliable virtualization environment using
 CentOS
 6.4 and KVM, I've three nodes and a clustered GFS2.

 The enviroment is up and working, but I'm worry for the
 reliability, if
 I turn the network interface down on one node to simulate a crash
 (for
 example on the node node6.blade):

 1) GFS2 hangs (processes go in D state) until node6.blade get 
 fenced
 2) not only node6.blade get fenced, but also node5.blade!

 Help me to save my last neurons!

 Thanks
 Maurizio

 DLM, the distributed lock manager provided by the cluster, is
 designed to block when a known goes into an unknown state. It does
 not unblock until that node is confirmed to be fenced. This is by
 design. GFS2, rgmanager and clustered LVM all use DLM, so they will
 all block as well.

 As for why two nodes get fenced, you will need to share more about
 your configuration.

 My configuration is very simple I attached cluster.conf and hosts
 files.
 This is the row I added in /etc/fstab:
 /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2
 defaults,noatime,nodiratime 0 0

 I set also fallback_to_local_locking = 0 in lvm.conf (but nothing
 change)

 PS: I had two virtualization enviroments working like a charm on
 OCFS2, but since Centos 6.x I'm not able to install it, there is same
 way to achieve the same results with GFS2 (with GFS2 sometime I've a
 crash after only a service network restart [I've many interfaces
 then this operation takes more than 10 seconds], with OCFS2 I've 
 never
 had this problem.

 Thanks
 I attached my logs from /var/log/cluster/*

 The configuration itself seems ok, though I think you can safely take
 qdisk out to simplify things. That's neither here nor there though.

 This concerns me:

 Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent
 fence_bladecenter result: error from agent
 Mar 21 19:00:14 fenced fence lama6.blade failed

 How are you triggering the failure(s)? The failed fence would
 certainly help explain the delays. As I mentioned earlier, DLM is
 designed to block when a node is in an unknowned state (failed but not
 yet successfully fenced).

 As an aside; I do my HA VMs using clustered LVM LVs as the backing
 storage behind the VMs. GFS2 is an excellent file system, but it is
 expensive. Putting your VMs directly on the LV takes them out of the
 equation

 I used 'service network stop' to simulate the failure, the node get
 fenced through fence_bladecenter (BladeCenter HW)

 Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM
 LVs, I'm trying for many hours to reproduce the issue

 - only the node where I execute 'service network stop' get fenced
 - using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain
 writable also while fencing take place

 All seems to work like a charm now.

 I'd like to understand what was happening. I'll try for same day before
 trusting it.

 Thank you so much.
 Maurizio


 Testing testing testing. It's good that you plan to test before 
 trusting. I wish everyone had that philosophy!

 The clustered locking for LVM comes into play for 
 activating/inactivating, creating, deleting, resizing and so on. It 
 does not affect what happens in an LV. That's why an LV remains 
 writeable when a fence is pending. However, I feel this is safe 
 because rgmanager won't recover a VM on another node until the lost 
 node is fenced.

 Cheers

Thank you very much! The cluster continue working like a charm. Failure 
after failure I mean :)

We are not using rgmanager fault management because doesn't have a check 
about the memory availability on the destination node, then we prefer to 
manage this situation with custom script we wrote.

last questions:
- have you any advice to improve the tollerance against network failures?
- to avoid having a gfs2 only for VM's xml, I've thought to keep them on 
each node synced with rsync. Any alternatives?
- If I want to have only the clustered LVM without no other functions, 
can you advice about a minimal configuration? (for example I think that 
rgmanager is not necessary)

Thank you in advance



___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] GFS2 hangs after one node going down

2013-03-25 Thread Digimer
On 03/25/2013 08:44 AM, Maurizio Giungato wrote:
 Il 22/03/2013 16:27, Digimer ha scritto:
 On 03/22/2013 11:21 AM, Maurizio Giungato wrote:
 Il 22/03/2013 00:34, Digimer ha scritto:
 On 03/21/2013 02:09 PM, Maurizio Giungato wrote:
 Il 21/03/2013 18:48, Maurizio Giungato ha scritto:
 Il 21/03/2013 18:14, Digimer ha scritto:
 On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
 Hi guys,

 my goal is to create a reliable virtualization environment using
 CentOS
 6.4 and KVM, I've three nodes and a clustered GFS2.

 The enviroment is up and working, but I'm worry for the
 reliability, if
 I turn the network interface down on one node to simulate a crash
 (for
 example on the node node6.blade):

 1) GFS2 hangs (processes go in D state) until node6.blade get
 fenced
 2) not only node6.blade get fenced, but also node5.blade!

 Help me to save my last neurons!

 Thanks
 Maurizio

 DLM, the distributed lock manager provided by the cluster, is
 designed to block when a known goes into an unknown state. It does
 not unblock until that node is confirmed to be fenced. This is by
 design. GFS2, rgmanager and clustered LVM all use DLM, so they will
 all block as well.

 As for why two nodes get fenced, you will need to share more about
 your configuration.

 My configuration is very simple I attached cluster.conf and hosts
 files.
 This is the row I added in /etc/fstab:
 /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2
 defaults,noatime,nodiratime 0 0

 I set also fallback_to_local_locking = 0 in lvm.conf (but nothing
 change)

 PS: I had two virtualization enviroments working like a charm on
 OCFS2, but since Centos 6.x I'm not able to install it, there is same
 way to achieve the same results with GFS2 (with GFS2 sometime I've a
 crash after only a service network restart [I've many interfaces
 then this operation takes more than 10 seconds], with OCFS2 I've
 never
 had this problem.

 Thanks
 I attached my logs from /var/log/cluster/*

 The configuration itself seems ok, though I think you can safely take
 qdisk out to simplify things. That's neither here nor there though.

 This concerns me:

 Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent
 fence_bladecenter result: error from agent
 Mar 21 19:00:14 fenced fence lama6.blade failed

 How are you triggering the failure(s)? The failed fence would
 certainly help explain the delays. As I mentioned earlier, DLM is
 designed to block when a node is in an unknowned state (failed but not
 yet successfully fenced).

 As an aside; I do my HA VMs using clustered LVM LVs as the backing
 storage behind the VMs. GFS2 is an excellent file system, but it is
 expensive. Putting your VMs directly on the LV takes them out of the
 equation

 I used 'service network stop' to simulate the failure, the node get
 fenced through fence_bladecenter (BladeCenter HW)

 Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM
 LVs, I'm trying for many hours to reproduce the issue

 - only the node where I execute 'service network stop' get fenced
 - using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain
 writable also while fencing take place

 All seems to work like a charm now.

 I'd like to understand what was happening. I'll try for same day before
 trusting it.

 Thank you so much.
 Maurizio


 Testing testing testing. It's good that you plan to test before
 trusting. I wish everyone had that philosophy!

 The clustered locking for LVM comes into play for
 activating/inactivating, creating, deleting, resizing and so on. It
 does not affect what happens in an LV. That's why an LV remains
 writeable when a fence is pending. However, I feel this is safe
 because rgmanager won't recover a VM on another node until the lost
 node is fenced.

 Cheers

 Thank you very much! The cluster continue working like a charm. Failure
 after failure I mean :)

 We are not using rgmanager fault management because doesn't have a check
 about the memory availability on the destination node, then we prefer to
 manage this situation with custom script we wrote.

 last questions:
 - have you any advice to improve the tollerance against network failures?
 - to avoid having a gfs2 only for VM's xml, I've thought to keep them on
 each node synced with rsync. Any alternatives?
 - If I want to have only the clustered LVM without no other functions,
 can you advice about a minimal configuration? (for example I think that
 rgmanager is not necessary)

 Thank you in advance

For network redundancy, I use two switches and bonded (mode=1) links 
with one link going to either switch. This way, losing a NIC or a switch 
won't break the cluster. Details here:

https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network

Using rsync to keep the XML files in sync is fine, if you really don't 
want to use GFS2.

You do not need rgmanager for clvmd to work. All you need is the base 
cluster.conf (and working fencing, as you've seen).

If you are over-provisioning VMs and need to 

Re: [CentOS-virt] GFS2 hangs after one node going down

2013-03-25 Thread Maurizio Giungato
Il 25/03/2013 17:49, Digimer ha scritto:
 On 03/25/2013 08:44 AM, Maurizio Giungato wrote:
 Il 22/03/2013 16:27, Digimer ha scritto:
 On 03/22/2013 11:21 AM, Maurizio Giungato wrote:
 Il 22/03/2013 00:34, Digimer ha scritto:
 On 03/21/2013 02:09 PM, Maurizio Giungato wrote:
 Il 21/03/2013 18:48, Maurizio Giungato ha scritto:
 Il 21/03/2013 18:14, Digimer ha scritto:
 On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
 Hi guys,

 my goal is to create a reliable virtualization environment using
 CentOS
 6.4 and KVM, I've three nodes and a clustered GFS2.

 The enviroment is up and working, but I'm worry for the
 reliability, if
 I turn the network interface down on one node to simulate a crash
 (for
 example on the node node6.blade):

 1) GFS2 hangs (processes go in D state) until node6.blade get
 fenced
 2) not only node6.blade get fenced, but also node5.blade!

 Help me to save my last neurons!

 Thanks
 Maurizio

 DLM, the distributed lock manager provided by the cluster, is
 designed to block when a known goes into an unknown state. It does
 not unblock until that node is confirmed to be fenced. This is by
 design. GFS2, rgmanager and clustered LVM all use DLM, so they 
 will
 all block as well.

 As for why two nodes get fenced, you will need to share more about
 your configuration.

 My configuration is very simple I attached cluster.conf and hosts
 files.
 This is the row I added in /etc/fstab:
 /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2
 defaults,noatime,nodiratime 0 0

 I set also fallback_to_local_locking = 0 in lvm.conf (but nothing
 change)

 PS: I had two virtualization enviroments working like a charm on
 OCFS2, but since Centos 6.x I'm not able to install it, there is 
 same
 way to achieve the same results with GFS2 (with GFS2 sometime 
 I've a
 crash after only a service network restart [I've many interfaces
 then this operation takes more than 10 seconds], with OCFS2 I've
 never
 had this problem.

 Thanks
 I attached my logs from /var/log/cluster/*

 The configuration itself seems ok, though I think you can safely take
 qdisk out to simplify things. That's neither here nor there though.

 This concerns me:

 Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent
 fence_bladecenter result: error from agent
 Mar 21 19:00:14 fenced fence lama6.blade failed

 How are you triggering the failure(s)? The failed fence would
 certainly help explain the delays. As I mentioned earlier, DLM is
 designed to block when a node is in an unknowned state (failed but 
 not
 yet successfully fenced).

 As an aside; I do my HA VMs using clustered LVM LVs as the backing
 storage behind the VMs. GFS2 is an excellent file system, but it is
 expensive. Putting your VMs directly on the LV takes them out of the
 equation

 I used 'service network stop' to simulate the failure, the node get
 fenced through fence_bladecenter (BladeCenter HW)

 Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM
 LVs, I'm trying for many hours to reproduce the issue

 - only the node where I execute 'service network stop' get fenced
 - using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain
 writable also while fencing take place

 All seems to work like a charm now.

 I'd like to understand what was happening. I'll try for same day 
 before
 trusting it.

 Thank you so much.
 Maurizio


 Testing testing testing. It's good that you plan to test before
 trusting. I wish everyone had that philosophy!

 The clustered locking for LVM comes into play for
 activating/inactivating, creating, deleting, resizing and so on. It
 does not affect what happens in an LV. That's why an LV remains
 writeable when a fence is pending. However, I feel this is safe
 because rgmanager won't recover a VM on another node until the lost
 node is fenced.

 Cheers

 Thank you very much! The cluster continue working like a charm. Failure
 after failure I mean :)

 We are not using rgmanager fault management because doesn't have a check
 about the memory availability on the destination node, then we prefer to
 manage this situation with custom script we wrote.

 last questions:
 - have you any advice to improve the tollerance against network 
 failures?
 - to avoid having a gfs2 only for VM's xml, I've thought to keep them on
 each node synced with rsync. Any alternatives?
 - If I want to have only the clustered LVM without no other functions,
 can you advice about a minimal configuration? (for example I think that
 rgmanager is not necessary)

 Thank you in advance

 For network redundancy, I use two switches and bonded (mode=1) links 
 with one link going to either switch. This way, losing a NIC or a 
 switch won't break the cluster. Details here:

 https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Network

 Using rsync to keep the XML files in sync is fine, if you really don't 
 want to use GFS2.

 You do not need rgmanager for clvmd to work. All you need is the base 
 cluster.conf (and working fencing, as 

Re: [CentOS-virt] GFS2 hangs after one node going down

2013-03-22 Thread Digimer
On 03/22/2013 11:21 AM, Maurizio Giungato wrote:
 Il 22/03/2013 00:34, Digimer ha scritto:
 On 03/21/2013 02:09 PM, Maurizio Giungato wrote:
 Il 21/03/2013 18:48, Maurizio Giungato ha scritto:
 Il 21/03/2013 18:14, Digimer ha scritto:
 On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
 Hi guys,

 my goal is to create a reliable virtualization environment using
 CentOS
 6.4 and KVM, I've three nodes and a clustered GFS2.

 The enviroment is up and working, but I'm worry for the
 reliability, if
 I turn the network interface down on one node to simulate a crash
 (for
 example on the node node6.blade):

 1) GFS2 hangs (processes go in D state) until node6.blade get fenced
 2) not only node6.blade get fenced, but also node5.blade!

 Help me to save my last neurons!

 Thanks
 Maurizio

 DLM, the distributed lock manager provided by the cluster, is
 designed to block when a known goes into an unknown state. It does
 not unblock until that node is confirmed to be fenced. This is by
 design. GFS2, rgmanager and clustered LVM all use DLM, so they will
 all block as well.

 As for why two nodes get fenced, you will need to share more about
 your configuration.

 My configuration is very simple I attached cluster.conf and hosts
 files.
 This is the row I added in /etc/fstab:
 /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2
 defaults,noatime,nodiratime 0 0

 I set also fallback_to_local_locking = 0 in lvm.conf (but nothing
 change)

 PS: I had two virtualization enviroments working like a charm on
 OCFS2, but since Centos 6.x I'm not able to install it, there is same
 way to achieve the same results with GFS2 (with GFS2 sometime I've a
 crash after only a service network restart [I've many interfaces
 then this operation takes more than 10 seconds], with OCFS2 I've never
 had this problem.

 Thanks
 I attached my logs from /var/log/cluster/*

 The configuration itself seems ok, though I think you can safely take
 qdisk out to simplify things. That's neither here nor there though.

 This concerns me:

 Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent
 fence_bladecenter result: error from agent
 Mar 21 19:00:14 fenced fence lama6.blade failed

 How are you triggering the failure(s)? The failed fence would
 certainly help explain the delays. As I mentioned earlier, DLM is
 designed to block when a node is in an unknowned state (failed but not
 yet successfully fenced).

 As an aside; I do my HA VMs using clustered LVM LVs as the backing
 storage behind the VMs. GFS2 is an excellent file system, but it is
 expensive. Putting your VMs directly on the LV takes them out of the
 equation

 I used 'service network stop' to simulate the failure, the node get
 fenced through fence_bladecenter (BladeCenter HW)

 Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM
 LVs, I'm trying for many hours to reproduce the issue

 - only the node where I execute 'service network stop' get fenced
 - using fallback_to_local_locking = 0 in lvm.conf LVM LVs  remain
 writable also while fencing take place

 All seems to work like a charm now.

 I'd like to understand what was happening. I'll try for same day before
 trusting it.

 Thank you so much.
 Maurizio


Testing testing testing. It's good that you plan to test before 
trusting. I wish everyone had that philosophy!

The clustered locking for LVM comes into play for 
activating/inactivating, creating, deleting, resizing and so on. It does 
not affect what happens in an LV. That's why an LV remains writeable 
when a fence is pending. However, I feel this is safe because rgmanager 
won't recover a VM on another node until the lost node is fenced.

Cheers

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?
___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] GFS2 hangs after one node going down

2013-03-21 Thread Digimer
On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
 Hi guys,

 my goal is to create a reliable virtualization environment using CentOS
 6.4 and KVM, I've three nodes and a clustered GFS2.

 The enviroment is up and working, but I'm worry for the reliability, if
 I turn the network interface down on one node to simulate a crash (for
 example on the node node6.blade):

 1) GFS2 hangs (processes go in D state) until node6.blade get fenced
 2) not only node6.blade get fenced, but also node5.blade!

 Help me to save my last neurons!

 Thanks
 Maurizio

DLM, the distributed lock manager provided by the cluster, is designed 
to block when a known goes into an unknown state. It does not unblock 
until that node is confirmed to be fenced. This is by design. GFS2, 
rgmanager and clustered LVM all use DLM, so they will all block as well.

As for why two nodes get fenced, you will need to share more about your 
configuration.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?
___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] GFS2 hangs after one node going down

2013-03-21 Thread Maurizio Giungato

Il 21/03/2013 18:48, Maurizio Giungato ha scritto:

Il 21/03/2013 18:14, Digimer ha scritto:

On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

Hi guys,

my goal is to create a reliable virtualization environment using CentOS
6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if
I turn the network interface down on one node to simulate a crash (for
example on the node node6.blade):

1) GFS2 hangs (processes go in D state) until node6.blade get fenced
2) not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks
Maurizio


DLM, the distributed lock manager provided by the cluster, is 
designed to block when a known goes into an unknown state. It does 
not unblock until that node is confirmed to be fenced. This is by 
design. GFS2, rgmanager and clustered LVM all use DLM, so they will 
all block as well.


As for why two nodes get fenced, you will need to share more about 
your configuration.



My configuration is very simple I attached cluster.conf and hosts files.
This is the row I added in /etc/fstab:
/dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 
defaults,noatime,nodiratime 0 0


I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on 
OCFS2, but since Centos 6.x I'm not able to install it, there is same 
way to achieve the same results with GFS2 (with GFS2 sometime I've a 
crash after only a service network restart [I've many interfaces 
then this operation takes more than 10 seconds], with OCFS2 I've never 
had this problem.


Thanks 

I attached my logs from /var/log/cluster/*




Mar 21 19:00:10 fenced fencing node lama6.blade
Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter 
result: error from agent
Mar 21 19:00:14 fenced fence lama6.blade failed
Mar 21 19:00:17 fenced fencing node lama6.blade
Mar 21 19:00:39 fenced fence lama6.blade success
Mar 21 19:00:45 fenced fencing node lama5.blade
Mar 21 19:00:57 fenced fence lama5.blade success


Mar 21 18:59:00 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Mar 21 18:59:00 corosync [QUORUM] Members[3]: 1 2 3
Mar 21 18:59:00 corosync [QUORUM] Members[3]: 1 2 3
Mar 21 18:59:00 corosync [CPG   ] chosen downlist: sender r(0) ip(20.11.11.104) 
; members(old:2 left:0)
Mar 21 18:59:00 corosync [MAIN  ] Completed service synchronization, ready to 
provide service.
Mar 21 18:59:41 corosync [TOTEM ] A processor failed, forming new configuration.
Mar 21 19:00:10 corosync [QUORUM] Members[2]: 1 2
Mar 21 19:00:10 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Mar 21 19:00:10 corosync [CPG   ] chosen downlist: sender r(0) ip(20.11.11.104) 
; members(old:3 left:1)
Mar 21 19:00:10 corosync [MAIN  ] Completed service synchronization, ready to 
provide service.
Mar 21 19:00:33 corosync [TOTEM ] A processor failed, forming new configuration.
Mar 21 19:00:45 corosync [QUORUM] Members[1]: 1
Mar 21 19:00:45 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Mar 21 19:00:45 corosync [CPG   ] chosen downlist: sender r(0) ip(20.11.11.104) 
; members(old:2 left:1)
Mar 21 19:00:45 corosync [MAIN  ] Completed service synchronization, ready to 
provide service.


Mar 21 19:00:10 rgmanager State change: lama6.blade DOWN
Mar 21 19:00:45 rgmanager State change: lama5.blade DOWN


Mar 21 19:00:10 fenced fencing node lama6.blade
Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter 
result: error from agent
Mar 21 19:00:14 fenced fence lama6.blade failed
Mar 21 19:00:17 fenced fencing node lama6.blade
Mar 21 19:00:39 fenced fence lama6.blade success
Mar 21 19:00:45 fenced fencing node lama5.blade
Mar 21 19:00:57 fenced fence lama5.blade success


Mar 21 19:00:27 qdiskd Writing eviction notice for node 3
Mar 21 19:00:28 qdiskd Writing eviction notice for node 2
Mar 21 19:00:28 qdiskd Node 3 evicted
Mar 21 19:00:29 qdiskd Node 2 evicted

___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] GFS2 hangs after one node going down

2013-03-21 Thread Digimer
On 03/21/2013 02:09 PM, Maurizio Giungato wrote:
 Il 21/03/2013 18:48, Maurizio Giungato ha scritto:
 Il 21/03/2013 18:14, Digimer ha scritto:
 On 03/21/2013 01:11 PM, Maurizio Giungato wrote:
 Hi guys,

 my goal is to create a reliable virtualization environment using CentOS
 6.4 and KVM, I've three nodes and a clustered GFS2.

 The enviroment is up and working, but I'm worry for the reliability, if
 I turn the network interface down on one node to simulate a crash (for
 example on the node node6.blade):

 1) GFS2 hangs (processes go in D state) until node6.blade get fenced
 2) not only node6.blade get fenced, but also node5.blade!

 Help me to save my last neurons!

 Thanks
 Maurizio

 DLM, the distributed lock manager provided by the cluster, is
 designed to block when a known goes into an unknown state. It does
 not unblock until that node is confirmed to be fenced. This is by
 design. GFS2, rgmanager and clustered LVM all use DLM, so they will
 all block as well.

 As for why two nodes get fenced, you will need to share more about
 your configuration.

 My configuration is very simple I attached cluster.conf and hosts files.
 This is the row I added in /etc/fstab:
 /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2
 defaults,noatime,nodiratime 0 0

 I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

 PS: I had two virtualization enviroments working like a charm on
 OCFS2, but since Centos 6.x I'm not able to install it, there is same
 way to achieve the same results with GFS2 (with GFS2 sometime I've a
 crash after only a service network restart [I've many interfaces
 then this operation takes more than 10 seconds], with OCFS2 I've never
 had this problem.

 Thanks
 I attached my logs from /var/log/cluster/*

The configuration itself seems ok, though I think you can safely take 
qdisk out to simplify things. That's neither here nor there though.

This concerns me:

Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent fence_bladecenter 
result: error from agent
Mar 21 19:00:14 fenced fence lama6.blade failed

How are you triggering the failure(s)? The failed fence would certainly 
help explain the delays. As I mentioned earlier, DLM is designed to 
block when a node is in an unknowned state (failed but not yet 
successfully fenced).

As an aside; I do my HA VMs using clustered LVM LVs as the backing 
storage behind the VMs. GFS2 is an excellent file system, but it is 
expensive. Putting your VMs directly on the LV takes them out of the 
equation.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?
___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


Re: [CentOS-virt] GFS2 hangs after one node going down

2013-03-21 Thread Zoltan Frombach
It's not related to your problem. Just a note: when you use the noatime 
mounting option in fstab then you do not need to use nodiratime because 
noatime takes care of both.


Zoltan

On 3/21/2013 6:48 PM, Maurizio Giungato wrote:

Il 21/03/2013 18:14, Digimer ha scritto:

On 03/21/2013 01:11 PM, Maurizio Giungato wrote:

Hi guys,

my goal is to create a reliable virtualization environment using CentOS
6.4 and KVM, I've three nodes and a clustered GFS2.

The enviroment is up and working, but I'm worry for the reliability, if
I turn the network interface down on one node to simulate a crash (for
example on the node node6.blade):

1) GFS2 hangs (processes go in D state) until node6.blade get fenced
2) not only node6.blade get fenced, but also node5.blade!

Help me to save my last neurons!

Thanks
Maurizio


DLM, the distributed lock manager provided by the cluster, is 
designed to block when a known goes into an unknown state. It does 
not unblock until that node is confirmed to be fenced. This is by 
design. GFS2, rgmanager and clustered LVM all use DLM, so they will 
all block as well.


As for why two nodes get fenced, you will need to share more about 
your configuration.



My configuration is very simple I attached cluster.conf and hosts files.
This is the row I added in /etc/fstab:
/dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 
defaults,noatime,nodiratime 0 0


I set also fallback_to_local_locking = 0 in lvm.conf (but nothing change)

PS: I had two virtualization enviroments working like a charm on 
OCFS2, but since Centos 6.x I'm not able to install it, there is same 
way to achieve the same results with GFS2 (with GFS2 sometime I've a 
crash after only a service network restart [I've many interfaces 
then this operation takes more than 10 seconds], with OCFS2 I've never 
had this problem.


Thanks














___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt


___
CentOS-virt mailing list
CentOS-virt@centos.org
http://lists.centos.org/mailman/listinfo/centos-virt