Re: [Linux-cluster] problems with clvmd and lvms on rhel6.1

Digimer Fri, 10 Aug 2012 10:17:35 -0700

Could well be. As I mentioned, no fencing == things break.


On 08/10/2012 01:00 PM, Chip Burke wrote:

See my thread earlier as I am having similar issues. I am testing this
soon, but I "think" the issue in my case is setting up SCSI fencing before
GFS2. So essentially it has nothing to fence off of, sees it as a fault,
and never recovers. I "think" my fix will be establish the LVMs, GFS2 etc
then put in the SCSI fence so that it can actually create the private
reservations. Then the fun begins in pulling the plug randomly to see how
it behaves.
________________________________________
Chip Burke







On 8/10/12 12:46 PM, "Digimer" <[email protected]> wrote:

Not sure if it relates, but I can say that without fencing, things will
break in strange ways. The reason is that if anything triggers a fault,
the cluster blocks by design and stays blocked until a fence call
succeeds (which is impossible without fencing configured in the first
place).

Can you please setup fencing, test to make sure it works (using
'fence_node rhel2.local' from rhel1.local, then in reverse)? Once this
is done, test again for your problem. If it still exists, please paste
the updated cluster.conf then. Also please include syslog from both
nodes around the time of your LVM tests.

digimer

On 08/10/2012 12:38 PM, Poós Krisztián wrote:

This is the cluster conf, Which is a clone of the problematic system on
a test environment (without the ORacle and SAP instances, only focusing
on this LVM issue, with an LVM resource)

[root@rhel2 ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="7" name="teszt">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="rhel1.local" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="rhel2.local" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3"/>
        <fencedevices/>
        <rm>
                <failoverdomains>
                        <failoverdomain name="all" nofailback="1" ordered="1" 
restricted="0">
                                <failoverdomainnode name="rhel1.local" 
priority="1"/>
                                <failoverdomainnode name="rhel2.local" 
priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <lvm lv_name="teszt-lv" name="teszt-lv" 
vg_name="teszt"/>
                        <fs device="/dev/teszt/teszt-lv" fsid="43679" 
fstype="ext4"
mountpoint="/lvm" name="teszt-fs"/>
                </resources>
                <service autostart="1" domain="all" exclusive="0" name="teszt"
recovery="disable">
                        <lvm ref="teszt-lv"/>
                        <fs ref="teszt-fs"/>
                </service>
        </rm>
        <quorumd label="qdisk"/>
</cluster>

Here are the log parts:
Aug 10 17:21:21 rgmanager I am node #2
Aug 10 17:21:22 rgmanager Resource Group Manager Starting
Aug 10 17:21:22 rgmanager Loading Service Data
Aug 10 17:21:29 rgmanager Initializing Services
Aug 10 17:21:31 rgmanager /dev/dm-2 is not mounted
Aug 10 17:21:31 rgmanager Services Initialized
Aug 10 17:21:31 rgmanager State change: Local UP
Aug 10 17:21:31 rgmanager State change: rhel1.local UP
Aug 10 17:23:23 rgmanager Starting stopped service service:teszt
Aug 10 17:23:25 rgmanager Failed to activate logical volume,
teszt/teszt-lv
Aug 10 17:23:25 rgmanager Attempting cleanup of teszt/teszt-lv
Aug 10 17:23:29 rgmanager Failed second attempt to activate
teszt/teszt-lv
Aug 10 17:23:29 rgmanager start on lvm "teszt-lv" returned 1 (generic
error)
Aug 10 17:23:29 rgmanager #68: Failed to start service:teszt; return
value: 1
Aug 10 17:23:29 rgmanager Stopping service service:teszt
Aug 10 17:23:30 rgmanager stop: Could not match /dev/teszt/teszt-lv with
a real device
Aug 10 17:23:30 rgmanager stop on fs "teszt-fs" returned 2 (invalid
argument(s))
Aug 10 17:23:31 rgmanager #12: RG service:teszt failed to stop;
intervention required
Aug 10 17:23:31 rgmanager Service service:teszt is failed
Aug 10 17:24:09 rgmanager #43: Service service:teszt has failed; can not
start.
Aug 10 17:24:09 rgmanager #13: Service service:teszt failed to stop
cleanly
Aug 10 17:25:12 rgmanager Starting stopped service service:teszt
Aug 10 17:25:14 rgmanager Failed to activate logical volume,
teszt/teszt-lv
Aug 10 17:25:15 rgmanager Attempting cleanup of teszt/teszt-lv
Aug 10 17:25:17 rgmanager Failed second attempt to activate
teszt/teszt-lv
Aug 10 17:25:18 rgmanager start on lvm "teszt-lv" returned 1 (generic
error)
Aug 10 17:25:18 rgmanager #68: Failed to start service:teszt; return
value: 1
Aug 10 17:25:18 rgmanager Stopping service service:teszt
Aug 10 17:25:19 rgmanager stop: Could not match /dev/teszt/teszt-lv with
a real device
Aug 10 17:25:19 rgmanager stop on fs "teszt-fs" returned 2 (invalid
argument(s))


After I manually started the lvm on node1 and tried to switch it on
node2 it's not able to start it.

Regards,
Krisztian


On 08/10/2012 05:15 PM, Digimer wrote:

On 08/10/2012 11:07 AM, Poós Krisztián wrote:

Dear all,

I hope that anyone run into this problem in the past, so maybe can
help
me resolving this issue.

There is a 2 node rhel cluster with quorum also.
There are clustered lvms, where the -c- flag is on.
If I start clvmd all the clustered lvms became online.

After this if I start rgmanager, it deactivates all the volumes, and
not
able to activate them anymore as there are no such devices anymore
during the startup of the service, so after this, the service fails.
All lvs remain without the active flag.

I can manually bring it up, but only if after clvmd is started, I set
the lvms manually offline by the lvchange -an <lv>
After this, when I start rgmanager, it can take it online without
problems. However I think, this action should be done by the rgmanager
itself. All the logs is full with the next:
rgmanager Making resilient: lvchange -an ....
rgmanager lv_exec_resilient failed
rgmanager lv_activate_resilient stop failed on ....

As well, sometimes the lvs/clvmd commands are also hanging. I have to
restart clvmd to make it work again. (sometimes killing it)

Anyone has any idea, what to check?

Thanks and regards,
Krisztian


Please paste your cluster.conf file with minimal edits.



--
Digimer
Papers and Projects: https://alteeve.com

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Digimer
Papers and Projects: https://alteeve.com

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] problems with clvmd and lvms on rhel6.1

Reply via email to