Re: [DRBD-user] Secondary node io-error

Velayutham, Prakash Sun, 14 Oct 2012 06:42:00 -0700

On Oct 10, 2012, at 10:01 PM, Velayutham, Prakash wrote:

> On Oct 10, 2012, at 5:01 AM, Lars Ellenberg wrote:
> 
>> On Wed, Oct 10, 2012 at 03:42:02AM +0000, Velayutham, Prakash wrote:
>>> 
>>> On Oct 8, 2012, at 9:19 AM, Velayutham, Prakash wrote:
>>> 
>>>> On Oct 8, 2012, at 4:55 AM, Lars Ellenberg wrote:
>>>> 
>>>>> On Sat, Oct 06, 2012 at 01:08:43PM +0000, Velayutham, Prakash wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I recently got a DRBD (8.4.2-2) cluster up (still testing). It seems to 
>>>>>> work nicely with Pacemaker CRM in several scenarios I have tested. Here 
>>>>>> is my config.
>>>>>> 
>>>>>> global {
>>>>>>             usage-count     yes;
>>>>>> }
>>>>>> 
>>>>>> common {
>>>>>>     handlers {
>>>>>>             outdate-peer    /usr/lib/drbd/crm-fence-peer.sh;
>>>>>>             fence-peer      /usr/lib/drbd/crm-fence-peer.sh;
>>>>>>             after-resync-target     /usr/lib/drbd/crm-unfence-peer.sh;
>>>>>>             local-io-error "/usr/lib/drbd/notify-io-error.sh; 
>>>>>> /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger 
>>>>>> ; halt -f";
>>>>>>             split-brain "/usr/lib/drbd/notify-split-brain.sh root";
>>>>>>     }
>>>>>> 
>>>>>>     startup {
>>>>>>             degr-wfc-timeout        0;
>>>>>>     }
>>>>>> 
>>>>>>     net {
>>>>>>             shared-secret   1QP69G4kWDslx2TMiaEStI6bwaGH5y8d;
>>>>>>             after-sb-0pri discard-zero-changes;
>>>>>>             after-sb-1pri discard-secondary;
>>>>>>             after-sb-2pri disconnect;
>>>>>>     }
>>>>>> 
>>>>>>     disk {
>>>>>>             on-io-error     call-local-io-error;
>>>>>>             fencing resource-and-stonith;
>>>>>>     }
>>>>>> 
>>>>>> }
>>>>>> 
>>>>>> The io-error handler only gets called when the primary node has a disk
>>>>>> issue. I have not seen the secondary node call the "local-io-error"
>>>>>> handler when it had disk access issues. Is this by design?
>>>>> 
>>>>> No.
>>>>> 
>>>>> "Works for me", though.
>>>>> 
>>>>> Can you please double check?
>>>>> And if in fact you can reproduce, tell us how, including logs?
>> 
>>>> If I disable all the FC ports in the fiber switch just for the
>>>> primary node, the node fences, reboots and comes up, as I would
>>>> expect. With the exact same config, if I disable the FC ports just
>>>> for the secondary node, the node just sits there and it even shows
>>>> up as Secondary in /proc/drbd.
>> 
>>>> That sounds odd and sounds like the
>>>> config should be "diskless", but it is "call-local-io-error".
>> 
>> Huh? What has "config" to do with things,
>> and what exactly is "config diskless"?
>> 
>> 
>>>> Which logs are you wanting me to share?
>> 
>> Those that show DRBD detecting an IO error,
>> but not calling the io-error handler.
>> 
>>>> Thanks,
>>>> Prakash
>>> 
>>> Just wanted to add this. I repeated my test again and get the exact
>>> same results again. Here is /proc/drbd of the primary (bmimysqlt3) and
>>> secondary (bmimysqlt4) before the secondary's disk is cut off
>>> (disabling the fiber switch port that the secondary is connected to)
>>> 
>>> [root@bmimysqlt3 ~]# cat /proc/drbd 
>>> version: 8.4.2 (api:1/proto:86-101)
>>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by 
>>> [email protected], 2012-10-02 00:02:32
>>> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>>>   ns:184 nr:0 dw:160 dr:14317 al:6 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>>> 
>>> [root@bmimysqlt4 ~]# cat /proc/drbd 
>>> version: 8.4.2 (api:1/proto:86-101)
>>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by 
>>> [email protected], 2012-10-02 00:02:32
>>> 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
>>>   ns:0 nr:184 dw:184 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>>> 
>>> Here is /proc/drbd of primary and secondary about 5 minutes after the disk 
>>> is cut off.
>>> 
>>> [root@bmimysqlt3 ~]# cat /proc/drbd 
>>> version: 8.4.2 (api:1/proto:86-101)
>>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by 
>>> [email protected], 2012-10-02 00:02:32
>>> 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
>>>   ns:184 nr:0 dw:160 dr:14317 al:6 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>> 
>> No additional writes.
>> 
>>> [root@bmimysqlt4 ~]# cat /proc/drbd 
>>> version: 8.4.2 (api:1/proto:86-101)
>>> GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by 
>>> [email protected], 2012-10-02 00:02:32
>>> 0: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
>>>   ns:0 nr:184 dw:184 dr:0 al:0 bm:6 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
>> 
>> Nothing transfered, nothing written, nothing changed.
>> 
>>> As you can see, there is absolutely nothing there to suggest that the
>>> secondary even noticed the io-error.
>>> 
>>> I can't understand what is going on.
>> 
>> Do you realize that you need to do IO to get (and then be able to notice) IO 
>> errors?
>> 
>> Cheers,
>> 
>>      Lars
> 
> Wow, feeling like an idiot now. Sorry for the false alarm. I just got 
> confused because the primary node got fenced right away without any sort of 
> manual write operation from me, but the secondary did not exhibit that same 
> behavior.
> 
> Thanks,
> Prakash


However, I have hit the snag again, in a different scenario.

/etc/drbd.d/global_common.conf:

global {
                usage-count     yes;
}

common {
        startup {
                degr-wfc-timeout        0;
        }

        net {
                cram-hmac-alg   sha1;
                shared-secret   xxxxxx;
        }

        disk {
                on-io-error     call-local-io-error;
        }

}

/etc/drbd.d/mysql1.res:

resource mysql1 {

        handlers {
                fence-peer      /usr/lib/drbd/crm-fence-peer.sh;
                local-io-error  "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt 
-f";
                split-brain     "/usr/lib/drbd/notify-split-brain.sh root";
                after-resync-target     /usr/lib/drbd/crm-unfence-peer.sh;
        }

        net {
                after-sb-0pri   discard-zero-changes;
                after-sb-1pri   discard-secondary;
        }

        disk {
                fencing resource-and-stonith;
        }

        on node1 {
                volume 0 {
                        device          /dev/drbd1;
                        disk            /dev/mapper/mysql_data1;
                        flexible-meta-disk      internal;
                }
                address         x.x.x.x:7788;
        }
        on node2 {
                volume 0 {
                        device          /dev/drbd1;
                        disk            /dev/mapper/mysql_data1;
                        flexible-meta-disk      internal;
                }
                address         x.x.x.x:7788;
        }
}


/etc/drbd.d/mysql2.res:

resource mysql2 {

        handlers {
                fence-peer      /usr/lib/drbd/crm-fence-peer.sh;
                local-io-error  "/usr/lib/drbd/notify-io-error.sh; 
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt 
-f";
                split-brain     "/usr/lib/drbd/notify-split-brain.sh root";
                after-resync-target     /usr/lib/drbd/crm-unfence-peer.sh;
        }

        net {
                after-sb-0pri   discard-zero-changes;
                after-sb-1pri   discard-secondary;
        }

        disk {
                fencing resource-and-stonith;
        }

        on node1 {
                volume 0 {
                        device          /dev/drbd2;
                        disk            /dev/mapper/mysql_data2;
                        flexible-meta-disk      internal;
                }
                address         x.x.x.x:7789;
        }
        on node2 {
                volume 0 {
                        device          /dev/drbd2;
                        disk            /dev/mapper/mysql_data2;
                        flexible-meta-disk      internal;
                }
                address         x.x.x.x:7789;
        }
}

So there are 2 resources (mysql1, mysql2)

mysql1 is primary on node1, secondary on node2. An ext4 file system on this 
DRBD volume (/fs1).
mysql2 is primary on node2, secondary on node1. An ext4 file system on this 
DRBD volume (/fs2).

If I disable the FC ports on node1, it reboots instantaneously. No need to 
create any IO for this to happen. mysql1 gets promoted to primary on node2 and 
all is fine.

But, if I disable the FC ports on node2, nothing happens. Even if I do "mkdir 
/fs2/newdir", the command just hangs. I would expect that command to create the 
necessary IO error and reboot the node. Any thoughts?

Thanks,
Prakash
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] Secondary node io-error

Reply via email to