Hi Luca, we (Ganeti) use drbdsetup directly because it's the more "programmatic" way of configuring DRBD. We don't want to manage the configuration file, but just reconfigure machines again as they join a cluster. You could think of Ganeti as managing the DRBD configuration in a different way.
Anyway, drbdadm is just a wrapper around drbdsetup. In particular, "drbdadm --allow-two-primaries=yes r0" just calls "drbdsetup net-options ipv4:<local_ip>:11001 ipv4:<remote_ip>:11001 --allow-two-primaries=yes" (according to http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=blob;f=user/drbdadm_main.c;h=8179625f7bef7172c07974dd63bf76ddb10b0d60;hb=HEAD#l1668). Can adding "--protocol=C" really make a difference? Especially if --allow-two-primaries only works with protocol C anyways? Additionally, the actual problem occurs as soon as I issue a "drbdsetup primary 0" (which is what "drbdadm primary r0" calls as well). As I stated, I tried to call the above command on one and on both sides simultaneously, with the same outcome. So, am I hitting a DRBD bug here? Or do you have other ideas of what I could do wrong? Cheers, Thomas On Fri, Apr 26, 2013 at 5:04 PM, Luca Fornasari <[email protected]>wrote: > Hi Thomas, > > In line reply below > > On Fri, Apr 26, 2013 at 4:14 PM, Thomas Thrainer <[email protected]>wrote: > >> Hi Luca, >> >> (CC'd drbd-user, I guess that might be helpful for others as well) >> > > Just reply to the list; I'm subscribed ;) > > >> We're not using drbdadm but drbdsetup directly. >> >> I tried `drbdsetup net-options ipv4:<local_ip>:11001 >> ipv4:<remote_ip>:11001 --protocol C --allow-two- >> primaries=yes` (i.e. I stripped the repeated options), but the result is >> still the same. >> > > I' not 100% sure but I think that repeating ipv4:<local_ip>:local_port > ipv4:<remote_ip>:remote_port restart the connection; during off-load time > that happens fast enough while during high-load fails. > Just try to use "drbdadm --allow-two-primaries=yes r0" on one node only. > Do you have a good reason to use drbdsetup directly? > > Cheers, > Luca > > >> Note however, that the problem occurs only every now and then, and >> primarily when there is load on the disk(s). >> >> BTW, I actually do set two disks to dual-primary mode at the same time >> (using different connections/resources tough), and one disk normally works >> while the other fails (is't not deterministic which of disk fails). >> >> Cheers, >> Thomas >> >> >> On Fri, Apr 26, 2013 at 3:58 PM, Luca Fornasari <[email protected] >> > wrote: >> >>> Hi Thomas, >>> >>> Just execute the following on one node only: >>> >>> drbdadm net-options --protocol=C --allow-two-primaries r0 >>> >>> I guess that the command you are issuing just try to restart an already >>> running resource. >>> >>> Cheers, >>> Luca >>> >>> >>> On Fri, Apr 26, 2013 at 2:27 PM, Thomas Thrainer <[email protected]>wrote: >>> >>>> Hi, >>>> >>>> I've encountered a problem with DRBD 8.4.2 when I try to enable >>>> --allow-two-primaries on the fly and immediately promoting the secondary to >>>> primary afterwards. >>>> The problem doesn't occur always, and it seems like it is more likely >>>> to happen when there is more load on the device. >>>> >>>> The exact command sequence is as follows: >>>> >>>> Executed on primary and secondary node simultaneously (but also happens >>>> if only executed on secondary): >>>> >>>> drbdsetup net-options ipv4:<loc_ip>:11001 ipv4:<rem_ip>:11001 >>>> --protocol C --after-sb-0pri discard-zero-changes --after-sb-1pri consensus >>>> --allow-two-primaries=yes --cram-hmac-alg md5 --shared-secret <secret> >>>> drbdsetup primary 1 >>>> >>>> BTW, the only options which differs in regard to the previously issued >>>> drbdsetup connect command is --allow-two-primaries. The rest (protocol, >>>> secret, etc.) are just repeated. >>>> >>>> The outcome is that both nodes end up in the StandAlone state. >>>> >>>> Their respective kernel log messages are: >>>> >>>> (Old) primary: >>>> Apr 26 11:19:42 primary kernel: [181721.646750] block drbd0: peer( >>>> Secondary -> Primary ) >>>> Apr 26 11:19:42 primary kernel: [181721.669870] block drbd1: peer( >>>> Secondary -> Primary ) >>>> Apr 26 11:19:42 primary kernel: [181722.057848] d-con resource1: sock >>>> was shut down by peer >>>> Apr 26 11:19:42 primary kernel: [181722.057872] d-con resource1: peer( >>>> Primary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> >>>> DUnknown ) >>>> Apr 26 11:19:42 primary kernel: [181722.057881] d-con resource1: short >>>> read (expected size 16) >>>> Apr 26 11:19:42 primary kernel: [181722.057914] block drbd1: new >>>> current UUID >>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF >>>> Apr 26 11:19:42 primary kernel: [181722.057964] d-con resource1: >>>> asender terminated >>>> Apr 26 11:19:42 primary kernel: [181722.057977] d-con resource1: >>>> Terminating asender thread >>>> Apr 26 11:19:42 primary kernel: [181722.058485] d-con resource1: >>>> Connection closed >>>> Apr 26 11:19:42 primary kernel: [181722.067019] d-con resource1: conn( >>>> BrokenPipe -> Unconnected ) >>>> Apr 26 11:19:42 primary kernel: [181722.067027] d-con resource1: >>>> receiver terminated >>>> Apr 26 11:19:42 primary kernel: [181722.067032] d-con resource1: >>>> Restarting receiver thread >>>> Apr 26 11:19:42 primary kernel: [181722.067036] d-con resource1: >>>> receiver (re)started >>>> Apr 26 11:19:42 primary kernel: [181722.067045] d-con resource1: conn( >>>> Unconnected -> WFConnection ) >>>> Apr 26 11:19:43 primary kernel: [181722.558370] d-con resource1: >>>> Handshake successful: Agreed network protocol version 101 >>>> Apr 26 11:19:43 primary kernel: [181722.558702] d-con resource1: Peer >>>> authenticated using 16 bytes HMAC >>>> Apr 26 11:19:43 primary kernel: [181722.558747] d-con resource1: conn( >>>> WFConnection -> WFReportParams ) >>>> Apr 26 11:19:43 primary kernel: [181722.558754] d-con resource1: >>>> Starting asender thread (from drbd_r_resource [2039]) >>>> Apr 26 11:19:43 primary kernel: [181722.560436] block drbd1: >>>> drbd_sync_handshake: >>>> Apr 26 11:19:43 primary kernel: [181722.560445] block drbd1: self >>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF >>>> bits:3072 flags:0 >>>> Apr 26 11:19:43 primary kernel: [181722.560454] block drbd1: peer >>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0 >>>> flags:0 >>>> Apr 26 11:19:43 primary kernel: [181722.560466] block drbd1: >>>> uuid_compare()=100 by rule 90 >>>> Apr 26 11:19:43 primary kernel: [181722.560474] block drbd1: helper >>>> command: /bin/true initial-split-brain minor-1 >>>> Apr 26 11:19:43 primary kernel: [181722.565127] d-con resource1: conn( >>>> WFReportParams -> NetworkFailure ) >>>> Apr 26 11:19:43 primary kernel: [181722.565134] d-con resource1: >>>> asender terminated >>>> Apr 26 11:19:43 primary kernel: [181722.565138] d-con resource1: >>>> Terminating asender thread >>>> Apr 26 11:19:43 primary kernel: [181722.570459] block drbd1: helper >>>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0) >>>> Apr 26 11:19:43 primary kernel: [181722.570488] block drbd1: helper >>>> command: /bin/true split-brain minor-1 >>>> Apr 26 11:19:43 primary kernel: [181722.583047] block drbd1: helper >>>> command: /bin/true split-brain minor-1 exit code 0 (0x0) >>>> Apr 26 11:19:43 primary kernel: [181722.583073] d-con resource1: conn( >>>> NetworkFailure -> Disconnecting ) >>>> Apr 26 11:19:43 primary kernel: [181722.583143] d-con resource1: >>>> Connection closed >>>> Apr 26 11:19:43 primary kernel: [181722.586237] d-con resource1: conn( >>>> Disconnecting -> StandAlone ) >>>> Apr 26 11:19:43 primary kernel: [181722.586245] d-con resource1: >>>> receiver terminated >>>> Apr 26 11:19:43 primary kernel: [181722.586249] d-con resource1: >>>> Terminating receiver thread >>>> Apr 26 11:19:46 primary kernel: [181726.054479] br974: port 2(vif126.0) >>>> entering forwarding state >>>> Apr 26 11:19:46 primary kernel: [181726.058824] br974: port 2(vif126.0) >>>> entering disabled state >>>> >>>> (Old) secondary: >>>> Apr 26 11:19:42 secondary kernel: [1809212.315376] block drbd0: role( >>>> Secondary -> Primary ) >>>> Apr 26 11:19:42 secondary kernel: [1809212.338517] block drbd1: role( >>>> Secondary -> Primary ) >>>> Apr 26 11:19:42 secondary kernel: [1809212.726247] d-con resource1: >>>> peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( >>>> UpToDate -> DUnknown ) >>>> Apr 26 11:19:42 secondary kernel: [1809212.726278] block drbd1: new >>>> current UUID >>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF >>>> Apr 26 11:19:42 secondary kernel: [1809212.726310] d-con resource1: >>>> asender terminated >>>> Apr 26 11:19:42 secondary kernel: [1809212.726340] d-con resource1: >>>> Terminating asender thread >>>> Apr 26 11:19:42 secondary kernel: [1809212.726719] d-con resource1: >>>> Connection closed >>>> Apr 26 11:19:42 secondary kernel: [1809212.726749] d-con resource1: >>>> conn( ProtocolError -> Unconnected ) >>>> Apr 26 11:19:42 secondary kernel: [1809212.726755] d-con resource1: >>>> receiver terminated >>>> Apr 26 11:19:42 secondary kernel: [1809212.726759] d-con resource1: >>>> Restarting receiver thread >>>> Apr 26 11:19:42 secondary kernel: [1809212.726763] d-con resource1: >>>> receiver (re)started >>>> Apr 26 11:19:42 secondary kernel: [1809212.726771] d-con resource1: >>>> conn( Unconnected -> WFConnection ) >>>> Apr 26 11:19:43 secondary kernel: [1809213.226864] d-con resource1: >>>> Handshake successful: Agreed network protocol version 101 >>>> Apr 26 11:19:43 secondary kernel: [1809213.227199] d-con resource1: >>>> Peer authenticated using 16 bytes HMAC >>>> Apr 26 11:19:43 secondary kernel: [1809213.227238] d-con resource1: >>>> conn( WFConnection -> WFReportParams ) >>>> Apr 26 11:19:43 secondary kernel: [1809213.227245] d-con resource1: >>>> Starting asender thread (from drbd_r_resource [20607]) >>>> Apr 26 11:19:43 secondary kernel: [1809213.231289] block drbd1: >>>> drbd_sync_handshake: >>>> Apr 26 11:19:43 secondary kernel: [1809213.231297] block drbd1: self >>>> 9CE29D13EEB7B4B3:041691050BDB6491:FA1A2A8EC7D3D7CF:FA192A8EC7D3D7CF bits:0 >>>> flags:0 >>>> Apr 26 11:19:43 secondary kernel: [1809213.231306] block drbd1: peer >>>> DEEF411AB544C5D3:041691050BDB6491:FA1A2A8EC7D3D7CE:FA192A8EC7D3D7CF >>>> bits:3072 flags:0 >>>> Apr 26 11:19:43 secondary kernel: [1809213.231315] block drbd1: >>>> uuid_compare()=100 by rule 90 >>>> Apr 26 11:19:43 secondary kernel: [1809213.231322] block drbd1: helper >>>> command: /bin/true initial-split-brain minor-1 >>>> Apr 26 11:19:43 secondary kernel: [1809213.232460] block drbd1: helper >>>> command: /bin/true initial-split-brain minor-1 exit code 0 (0x0) >>>> Apr 26 11:19:43 secondary kernel: [1809213.232494] block drbd1: helper >>>> command: /bin/true split-brain minor-1 >>>> Apr 26 11:19:43 secondary kernel: [1809213.233512] block drbd1: helper >>>> command: /bin/true split-brain minor-1 exit code 0 (0x0) >>>> Apr 26 11:19:43 secondary kernel: [1809213.233539] d-con resource1: >>>> conn( WFReportParams -> Disconnecting ) >>>> Apr 26 11:19:43 secondary kernel: [1809213.233574] d-con resource1: >>>> asender terminated >>>> Apr 26 11:19:43 secondary kernel: [1809213.233579] d-con resource1: >>>> Terminating asender thread >>>> Apr 26 11:19:43 secondary kernel: [1809213.233631] d-con resource1: >>>> Connection closed >>>> Apr 26 11:19:43 secondary kernel: [1809213.233662] d-con resource1: >>>> conn( Disconnecting -> StandAlone ) >>>> Apr 26 11:19:43 secondary kernel: [1809213.233667] d-con resource1: >>>> receiver terminated >>>> Apr 26 11:19:43 secondary kernel: [1809213.233672] d-con resource1: >>>> Terminating receiver thread >>>> >>>> >>>> What am I doing wrong? Is there a requirement to wait for a >>>> sync/propagation of properties/random amount of time before promoting the >>>> secondary to primary? Is this a bug? >>>> >>>> Thanks, >>>> Thomas >>>> >>>> -- >>>> Thomas Thrainer | Software Engineer | [email protected] | >>>> >>>> Google Germany GmbH >>>> Dienerstr. 12 >>>> 80331 München >>>> >>>> Registergericht und -nummer: Hamburg, HRB 86891 >>>> Sitz der Gesellschaft: Hamburg >>>> Geschäftsführer: Graham Law, Katherine Stephens >>>> >>>> _______________________________________________ >>>> drbd-user mailing list >>>> [email protected] >>>> http://lists.linbit.com/mailman/listinfo/drbd-user >>>> >>>> >>> >> >> >> -- >> Thomas Thrainer | Software Engineer | [email protected] | >> >> Google Germany GmbH >> Dienerstr. 12 >> 80331 München >> >> Registergericht und -nummer: Hamburg, HRB 86891 >> Sitz der Gesellschaft: Hamburg >> Geschäftsführer: Graham Law, Katherine Stephens >> > > > _______________________________________________ > drbd-user mailing list > [email protected] > http://lists.linbit.com/mailman/listinfo/drbd-user > > -- Thomas Thrainer | Software Engineer | [email protected] | Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Katherine Stephens
_______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
