Hi Luca,

we (Ganeti) use drbdsetup directly because it's the more "programmatic" way
of configuring DRBD. We don't want to manage the configuration file, but
just reconfigure machines again as they join a cluster. You could think of
Ganeti as managing the DRBD configuration in a different way.

Anyway, drbdadm is just a wrapper around drbdsetup. In particular, "drbdadm
--allow-two-primaries=yes r0" just calls "drbdsetup net-options
ipv4:<local_ip>:11001 ipv4:<remote_ip>:11001 --allow-two-primaries=yes"
(according to
http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=blob;f=user/drbdadm_main.c;h=8179625f7bef7172c07974dd63bf76ddb10b0d60;hb=HEAD#l1668).
Can adding "--protocol=C" really make a difference? Especially if
--allow-two-primaries only works with protocol C anyways?
Additionally, the actual problem occurs as soon as I issue a "drbdsetup
primary 0" (which is what "drbdadm primary r0" calls as well).

As I stated, I tried to call the above command on one and on both sides
simultaneously, with the same outcome.

So, am I hitting a DRBD bug here? Or do you have other ideas of what I
could do wrong?

Cheers,
Thomas


On Fri, Apr 26, 2013 at 5:04 PM, Luca Fornasari <[email protected]>wrote:

> Hi Thomas,
>
> In line reply below
>
> On Fri, Apr 26, 2013 at 4:14 PM, Thomas Thrainer <[email protected]>wrote:
>
>> Hi Luca,
>>
>> (CC'd drbd-user, I guess that might be helpful for others as well)
>>
>
> Just reply to the list; I'm subscribed ;)
>
>
>> We're not using drbdadm but drbdsetup directly.
>>
>> I tried `drbdsetup net-options ipv4:<local_ip>:11001
>> ipv4:<remote_ip>:11001 --protocol C --allow-two-
>> primaries=yes` (i.e. I stripped the repeated options), but the result is
>> still the same.
>>
>
> I' not 100% sure but I think that repeating ipv4:<local_ip>:local_port
> ipv4:<remote_ip>:remote_port restart the connection; during off-load time
> that happens fast enough while during high-load fails.
> Just try to use "drbdadm --allow-two-primaries=yes r0" on one node only.
> Do you have a good reason to use drbdsetup directly?
>
> Cheers,
> Luca
>
>
>> Note however, that the problem occurs only every now and then, and
>> primarily when there is load on the disk(s).
>>
>> BTW, I actually do set two disks to dual-primary mode at the same time
>> (using different connections/resources tough), and one disk normally works
>> while the other fails (is't not deterministic which of disk fails).
>>
>> Cheers,
>> Thomas
>>
>>
>> On Fri, Apr 26, 2013 at 3:58 PM, Luca Fornasari <[email protected]
>> > wrote:
>>
>>> Hi Thomas,
>>>
>>> Just execute the following on one node only:
>>>
>>> drbdadm net-options --protocol=C --allow-two-primaries r0
>>>
>>> I guess that the command you are issuing just try to restart an already
>>> running resource.
>>>
>>> Cheers,
>>> Luca
>>>
>>>
>>> On Fri, Apr 26, 2013 at 2:27 PM, Thomas Thrainer <[email protected]>wrote:
>>>
>>>>  Hi,
>>>>
>>>> I've encountered a problem with DRBD 8.4.2 when I try to enable
>>>> --allow-two-primaries on the fly and immediately promoting the secondary to
>>>> primary afterwards.
>>>> The problem doesn't occur always, and it seems like it is more likely
>>>> to happen when there is more load on the device.
>>>>
>>>> The exact command sequence is as follows:
>>>>
>>>> Executed on primary and secondary node simultaneously (but also happens
>>>> if only executed on secondary):
>>>>
>>>> drbdsetup net-options ipv4:<loc_ip>:11001 ipv4:<rem_ip>:11001
>>>> --protocol C --after-sb-0pri discard-zero-changes --after-sb-1pri consensus
>>>> --allow-two-primaries=yes --cram-hmac-alg md5 --shared-secret <secret>
>>>> drbdsetup primary 1
>>>>
>>>> BTW, the only options which differs in regard to the previously issued
>>>> drbdsetup connect command is --allow-two-primaries. The rest (protocol,
>>>> secret, etc.) are just repeated.
>>>>
>>>> The outcome is that both nodes end up in the StandAlone state.
>>>>
>>>> Their respective kernel log messages are:
>>>>
>>>> (Old) primary:
>>>> Apr 26 11:19:42 primary kernel: [181721.646750] block drbd0: peer(
>>>> Secondary -> Primary )
>>>> Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer(
>>>> Secondary -> Primary )
>>>> Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock
>>>> was shut down by peer
>>>> Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer(
>>>> Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
>>>> DUnknown )
>>>> Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short
>>>> read (expected size 16)
>>>> Apr 26 11:19:42 primary kernel: [181722.057914] block drbd1: new
>>>> current UUID
>>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>>> Apr 26 11:19:42 primary kernel: [181722.057964] d-con resource1:
>>>> asender terminated
>>>> Apr 26 11:19:42 primary kernel: [181722.057977] d-con resource1:
>>>> Terminating asender thread
>>>> Apr 26 11:19:42 primary kernel: [181722.058485] d-con resource1:
>>>> Connection closed
>>>> Apr 26 11:19:42 primary kernel: [181722.067019] d-con resource1: conn(
>>>> BrokenPipe -> Unconnected )
>>>> Apr 26 11:19:42 primary kernel: [181722.067027] d-con resource1:
>>>> receiver terminated
>>>> Apr 26 11:19:42 primary kernel: [181722.067032] d-con resource1:
>>>> Restarting receiver thread
>>>> Apr 26 11:19:42 primary kernel: [181722.067036] d-con resource1:
>>>> receiver (re)started
>>>> Apr 26 11:19:42 primary kernel: [181722.067045] d-con resource1: conn(
>>>> Unconnected -> WFConnection )
>>>> Apr 26 11:19:43 primary kernel: [181722.558370] d-con resource1:
>>>> Handshake successful: Agreed network protocol version 101
>>>> Apr 26 11:19:43 primary kernel: [181722.558702] d-con resource1: Peer
>>>> authenticated using 16 bytes HMAC
>>>> Apr 26 11:19:43 primary kernel: [181722.558747] d-con resource1: conn(
>>>> WFConnection -> WFReportParams )
>>>> Apr 26 11:19:43 primary kernel: [181722.558754] d-con resource1:
>>>> Starting asender thread (from drbd_r_resource [2039])
>>>> Apr 26 11:19:43 primary kernel: [181722.560436] block drbd1:
>>>> drbd_sync_handshake:
>>>> Apr 26 11:19:43 primary kernel: [181722.560445] block drbd1: self
>>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>>> bits:3072 flags:0
>>>> Apr 26 11:19:43 primary kernel: [181722.560454] block drbd1: peer
>>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
>>>> flags:0
>>>> Apr 26 11:19:43 primary kernel: [181722.560466] block drbd1:
>>>> uuid_compare()=100 by rule 90
>>>> Apr 26 11:19:43 primary kernel: [181722.560474] block drbd1: helper
>>>> command: /bin/true initial-split-brain minor-1
>>>> Apr 26 11:19:43 primary kernel: [181722.565127] d-con resource1: conn(
>>>> WFReportParams -> NetworkFailure )
>>>> Apr 26 11:19:43 primary kernel: [181722.565134] d-con resource1:
>>>> asender terminated
>>>> Apr 26 11:19:43 primary kernel: [181722.565138] d-con resource1:
>>>> Terminating asender thread
>>>> Apr 26 11:19:43 primary kernel: [181722.570459] block drbd1: helper
>>>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
>>>> Apr 26 11:19:43 primary kernel: [181722.570488] block drbd1: helper
>>>> command: /bin/true split-brain minor-1
>>>> Apr 26 11:19:43 primary kernel: [181722.583047] block drbd1: helper
>>>> command: /bin/true split-brain minor-1 exit code 0 (0x0)
>>>> Apr 26 11:19:43 primary kernel: [181722.583073] d-con resource1: conn(
>>>> NetworkFailure -> Disconnecting )
>>>> Apr 26 11:19:43 primary kernel: [181722.583143] d-con resource1:
>>>> Connection closed
>>>> Apr 26 11:19:43 primary kernel: [181722.586237] d-con resource1: conn(
>>>> Disconnecting -> StandAlone )
>>>> Apr 26 11:19:43 primary kernel: [181722.586245] d-con resource1:
>>>> receiver terminated
>>>> Apr 26 11:19:43 primary kernel: [181722.586249] d-con resource1:
>>>> Terminating receiver thread
>>>> Apr 26 11:19:46 primary kernel: [181726.054479] br974: port 2(vif126.0)
>>>> entering forwarding state
>>>> Apr 26 11:19:46 primary kernel: [181726.058824] br974: port 2(vif126.0)
>>>> entering disabled state
>>>>
>>>> (Old) secondary:
>>>> Apr 26 11:19:42 secondary kernel: [1809212.315376] block drbd0: role(
>>>> Secondary -> Primary )
>>>> Apr 26 11:19:42 secondary kernel: [1809212.338517] block drbd1: role(
>>>> Secondary -> Primary )
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726247] d-con resource1:
>>>> peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk(
>>>> UpToDate -> DUnknown )
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726278] block drbd1: new
>>>> current UUID
>>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726310] d-con resource1:
>>>> asender terminated
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726340] d-con resource1:
>>>> Terminating asender thread
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726719] d-con resource1:
>>>> Connection closed
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726749] d-con resource1:
>>>> conn( ProtocolError -> Unconnected )
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726755] d-con resource1:
>>>> receiver terminated
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726759] d-con resource1:
>>>> Restarting receiver thread
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726763] d-con resource1:
>>>> receiver (re)started
>>>> Apr 26 11:19:42 secondary kernel: [1809212.726771] d-con resource1:
>>>> conn( Unconnected -> WFConnection )
>>>> Apr 26 11:19:43 secondary kernel: [1809213.226864] d-con resource1:
>>>> Handshake successful: Agreed network protocol version 101
>>>> Apr 26 11:19:43 secondary kernel: [1809213.227199] d-con resource1:
>>>> Peer authenticated using 16 bytes HMAC
>>>> Apr 26 11:19:43 secondary kernel: [1809213.227238] d-con resource1:
>>>> conn( WFConnection -> WFReportParams )
>>>> Apr 26 11:19:43 secondary kernel: [1809213.227245] d-con resource1:
>>>> Starting asender thread (from drbd_r_resource [20607])
>>>> Apr 26 11:19:43 secondary kernel: [1809213.231289] block drbd1:
>>>> drbd_sync_handshake:
>>>> Apr 26 11:19:43 secondary kernel: [1809213.231297] block drbd1: self
>>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0
>>>> flags:0
>>>> Apr 26 11:19:43 secondary kernel: [1809213.231306] block drbd1: peer
>>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF
>>>> bits:3072 flags:0
>>>> Apr 26 11:19:43 secondary kernel: [1809213.231315] block drbd1:
>>>> uuid_compare()=100 by rule 90
>>>> Apr 26 11:19:43 secondary kernel: [1809213.231322] block drbd1: helper
>>>> command: /bin/true initial-split-brain minor-1
>>>> Apr 26 11:19:43 secondary kernel: [1809213.232460] block drbd1: helper
>>>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0)
>>>> Apr 26 11:19:43 secondary kernel: [1809213.232494] block drbd1: helper
>>>> command: /bin/true split-brain minor-1
>>>> Apr 26 11:19:43 secondary kernel: [1809213.233512] block drbd1: helper
>>>> command: /bin/true split-brain minor-1 exit code 0 (0x0)
>>>> Apr 26 11:19:43 secondary kernel: [1809213.233539] d-con resource1:
>>>> conn( WFReportParams -> Disconnecting )
>>>> Apr 26 11:19:43 secondary kernel: [1809213.233574] d-con resource1:
>>>> asender terminated
>>>> Apr 26 11:19:43 secondary kernel: [1809213.233579] d-con resource1:
>>>> Terminating asender thread
>>>> Apr 26 11:19:43 secondary kernel: [1809213.233631] d-con resource1:
>>>> Connection closed
>>>> Apr 26 11:19:43 secondary kernel: [1809213.233662] d-con resource1:
>>>> conn( Disconnecting -> StandAlone )
>>>> Apr 26 11:19:43 secondary kernel: [1809213.233667] d-con resource1:
>>>> receiver terminated
>>>> Apr 26 11:19:43 secondary kernel: [1809213.233672] d-con resource1:
>>>> Terminating receiver thread
>>>>
>>>>
>>>> What am I doing wrong? Is there a requirement to wait for a
>>>> sync/propagation of properties/random amount of time before promoting the
>>>> secondary to primary? Is this a bug?
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>> --
>>>> Thomas Thrainer | Software Engineer | [email protected] |
>>>>
>>>>  Google Germany GmbH
>>>> Dienerstr. 12
>>>> 80331 München
>>>>
>>>> Registergericht und -nummer: Hamburg, HRB 86891
>>>> Sitz der Gesellschaft: Hamburg
>>>> Geschäftsführer: Graham Law, Katherine Stephens
>>>>
>>>> _______________________________________________
>>>> drbd-user mailing list
>>>> [email protected]
>>>> http://lists.linbit.com/mailman/listinfo/drbd-user
>>>>
>>>>
>>>
>>
>>
>> --
>> Thomas Thrainer | Software Engineer | [email protected] |
>>
>>  Google Germany GmbH
>> Dienerstr. 12
>> 80331 München
>>
>> Registergericht und -nummer: Hamburg, HRB 86891
>> Sitz der Gesellschaft: Hamburg
>> Geschäftsführer: Graham Law, Katherine Stephens
>>
>
>
> _______________________________________________
> drbd-user mailing list
> [email protected]
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>


-- 
Thomas Thrainer | Software Engineer | [email protected] |

Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Katherine Stephens
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to