Hello,
I have over 10 sites in production with exactly the same standard M/S
installation :
2 Nodes (server1 and server2)
Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux
DrbD 8.3.11 (api:88/proto:86-96)
Corosync 1.4.2-3
All sites are exactly identic because we deploy them with an automatic
installation DVD built with SimpleCDD.
We have a serious problem on 1 site, sometimes, the MASTER node switch from
server1 to server2 with no reason, and return back to server1. Sometimes the
system toggle 2 or 3 times before return back to normal state.
This issue is not periodic. Sometimes it's happened after 2mounth of stability,
or it can happened 15days after the last time.
This situation is critical because it can happened that the toggle corrupts
some data, this is reflected by MySQL tables marked as crashed. (and our
software stops)
Could you help to determine the possible root causes why the cluster become
instable ?
I Suspected first the LAN but I done some tests in Labs, and when we make
errors on the LAN we have in the log something like "conn( WFConnection ->
NetworkFailure )". It's not the case in production site. LAN semms to be OK.
Here is the production logs for server1 and server2 :
Server1 :
May 1 08:15:13 server1 kernel: [3865117.570629] block drbd0: role( Primary ->
Secondary )
May 1 08:15:13 server1 kernel: [3865117.570661] block drbd0: bitmap WRITE of 0
pages took 0 jiffies
May 1 08:15:13 server1 kernel: [3865117.570671] block drbd0: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
May 1 08:15:14 server1 kernel: [3865117.842211] block drbd0: peer( Secondary
-> Unknown ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
May 1 08:15:14 server1 kernel: [3865117.842384] block drbd0: asender terminated
May 1 08:15:14 server1 kernel: [3865117.842389] block drbd0: Terminating
drbd0_asender
May 1 08:15:14 server1 kernel: [3865117.842486] block drbd0: Connection closed
May 1 08:15:14 server1 kernel: [3865117.842504] block drbd0: conn(
Disconnecting -> StandAlone )
May 1 08:15:14 server1 kernel: [3865117.842587] block drbd0: receiver
terminated
May 1 08:15:14 server1 kernel: [3865117.842591] block drbd0: Terminating
drbd0_receiver
May 1 08:15:14 server1 kernel: [3865117.842595] block drbd0: disk( UpToDate ->
Failed )
May 1 08:15:14 server1 kernel: [3865117.842633] block drbd0: disk( Failed ->
Diskless )
May 1 08:15:14 server1 kernel: [3865117.842665] block drbd0: drbd_bm_resize
called with capacity == 0
May 1 08:15:14 server1 kernel: [3865117.842674] block drbd0: worker terminated
May 1 08:15:14 server1 kernel: [3865117.842678] block drbd0: Terminating
drbd0_worker
May 1 08:17:42 server1 kernel: [3865265.972144] block drbd0: Starting worker
thread (from drbdsetup [57250])
May 1 08:17:42 server1 kernel: [3865265.972269] block drbd0: disk( Diskless ->
Attaching )
May 1 08:17:42 server1 kernel: [3865265.975419] block drbd0: Found 4
transactions (192 active extents) in activity log.
May 1 08:17:42 server1 kernel: [3865265.975426] block drbd0: Method to ensure
write ordering: flush
May 1 08:17:42 server1 kernel: [3865265.975434] block drbd0: drbd_bm_resize
called with capacity == 3948344
May 1 08:17:42 server1 kernel: [3865265.975459] block drbd0: resync bitmap:
bits=493543 words=7712 pages=16
May 1 08:17:42 server1 kernel: [3865265.975464] block drbd0: size = 1928 MB
(1974172 KB)
May 1 08:17:42 server1 kernel: [3865265.975795] block drbd0: bitmap READ of 16
pages took 0 jiffies
May 1 08:17:42 server1 kernel: [3865265.975848] block drbd0: recounting of set
bits took additional 0 jiffies
May 1 08:17:42 server1 kernel: [3865265.975853] block drbd0: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
May 1 08:17:42 server1 kernel: [3865265.975861] block drbd0: disk( Attaching
-> UpToDate )
May 1 08:17:42 server1 kernel: [3865265.975866] block drbd0: attached to UUIDs
089C9C45FDE4ABBC:0000000000000000:62F362153A7923B6:62F262153A7923B6
May 1 08:17:42 server1 kernel: [3865265.990736] block drbd0: conn( StandAlone
-> Unconnected )
May 1 08:17:42 server1 kernel: [3865265.990760] block drbd0: Starting receiver
thread (from drbd0_worker [57251])
May 1 08:17:42 server1 kernel: [3865265.990899] block drbd0: receiver
(re)started
May 1 08:17:42 server1 kernel: [3865265.990909] block drbd0: conn( Unconnected
-> WFConnection )
May 1 08:17:42 server1 kernel: [3865266.489465] block drbd0: Handshake
successful: Agreed network protocol version 96
May 1 08:17:42 server1 kernel: [3865266.489795] block drbd0: Peer
authenticated using 20 bytes of 'sha1' HMAC
May 1 08:17:42 server1 kernel: [3865266.489808] block drbd0: conn(
WFConnection -> WFReportParams )
May 1 08:17:42 server1 kernel: [3865266.489921] block drbd0: Starting asender
thread (from drbd0_receiver [57283])
May 1 08:17:42 server1 kernel: [3865266.490137] block drbd0:
data-integrity-alg: <not-used>
May 1 08:17:42 server1 kernel: [3865266.490167] block drbd0:
drbd_sync_handshake:
May 1 08:17:42 server1 kernel: [3865266.490173] block drbd0: self
089C9C45FDE4ABBC:0000000000000000:62F362153A7923B6:62F262153A7923B6 bits:0
flags:0
May 1 08:17:42 server1 kernel: [3865266.490178] block drbd0: peer
8FE26C139FB94071:089C9C45FDE4ABBC:62F362153A7923B6:62F262153A7923B6 bits:297
flags:0
May 1 08:17:42 server1 kernel: [3865266.490183] block drbd0: uuid_compare()=-1
by rule 50
May 1 08:17:42 server1 kernel: [3865266.490193] block drbd0: peer( Unknown ->
Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated )
pdsk( DUnknown -> UpToDate )
May 1 08:17:42 server1 kernel: [3865266.492311] block drbd0: conn( WFBitMapT
-> WFSyncUUID )
May 1 08:17:42 server1 kernel: [3865266.496178] block drbd0: updated sync uuid
089D9C45FDE4ABBC:0000000000000000:62F362153A7923B6:62F262153A7923B6
May 1 08:17:42 server1 kernel: [3865266.496315] block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0
May 1 08:17:42 server1 kernel: [3865266.499086] block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
May 1 08:17:42 server1 kernel: [3865266.499097] block drbd0: conn( WFSyncUUID
-> SyncTarget ) disk( Outdated -> Inconsistent )
May 1 08:17:42 server1 kernel: [3865266.499110] block drbd0: Began resync as
SyncTarget (will sync 1188 KB [297 bits set]).
May 1 08:17:42 server1 kernel: [3865266.558764] block drbd0: Resync done
(total 1 sec; paused 0 sec; 1188 K/sec)
May 1 08:17:42 server1 kernel: [3865266.558774] block drbd0: updated UUIDs
8FE26C139FB94070:0000000000000000:089D9C45FDE4ABBC:089C9C45FDE4ABBC
May 1 08:17:42 server1 kernel: [3865266.558782] block drbd0: conn( SyncTarget
-> Connected ) disk( Inconsistent -> UpToDate )
May 1 08:17:42 server1 kernel: [3865266.558838] block drbd0: helper command:
/sbin/drbdadm after-resync-target minor-0
May 1 08:17:42 server1 kernel: [3865266.561386] block drbd0: helper command:
/sbin/drbdadm after-resync-target minor-0 exit code 0 (0x0)
May 1 08:17:42 server1 kernel: [3865266.561574] block drbd0: bitmap WRITE of
10 pages took 0 jiffies
May 1 08:17:42 server1 kernel: [3865266.561583] block drbd0: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
May 1 08:17:49 server1 kernel: [3865273.492362] block drbd0: peer( Primary ->
Secondary )
May 1 08:17:50 server1 kernel: [3865273.759294] block drbd0: role( Secondary
-> Primary )
May 1 08:17:50 server1 kernel: [3865273.953401] EXT4-fs (drbd0): mounted
filesystem with ordered data mode. Opts: (null)
Server2:
May 1 06:47:02 server2 lpd[23019]: restarted
May 1 08:08:36 server2 kernel: [3865172.121119] block drbd0: peer( Primary ->
Secondary )
May 1 08:08:36 server2 kernel: [3865172.357331] block drbd0: peer( Secondary
-> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
May 1 08:08:36 server2 kernel: [3865172.357554] block drbd0: asender terminated
May 1 08:08:36 server2 kernel: [3865172.357562] block drbd0: Terminating
drbd0_asender
May 1 08:08:36 server2 kernel: [3865172.357739] block drbd0: Connection closed
May 1 08:08:36 server2 kernel: [3865172.357749] block drbd0: conn( TearDown ->
Unconnected )
May 1 08:08:36 server2 kernel: [3865172.357759] block drbd0: receiver
terminated
May 1 08:08:36 server2 kernel: [3865172.357762] block drbd0: Restarting
drbd0_receiver
May 1 08:08:36 server2 kernel: [3865172.357766] block drbd0: receiver
(re)started
May 1 08:08:36 server2 kernel: [3865172.357771] block drbd0: conn( Unconnected
-> WFConnection )
May 1 08:08:36 server2 kernel: [3865172.601231] block drbd0: role( Secondary
-> Primary )
May 1 08:08:36 server2 kernel: [3865172.601432] block drbd0: new current UUID
8FE26C139FB94071:089C9C45FDE4ABBC:62F362153A7923B6:62F262153A7923B6
May 1 08:08:36 server2 kernel: [3865172.785933] EXT4-fs (drbd0): mounted
filesystem with ordered data mode. Opts: (null)
May 1 08:11:05 server2 kernel: [3865321.006344] block drbd0: Handshake
successful: Agreed network protocol version 96
May 1 08:11:05 server2 kernel: [3865321.006741] block drbd0: Peer
authenticated using 20 bytes of 'sha1' HMAC
May 1 08:11:05 server2 kernel: [3865321.006754] block drbd0: conn(
WFConnection -> WFReportParams )
May 1 08:11:05 server2 kernel: [3865321.006890] block drbd0: Starting asender
thread (from drbd0_receiver [6537])
May 1 08:11:05 server2 kernel: [3865321.007160] block drbd0:
data-integrity-alg: <not-used>
May 1 08:11:05 server2 kernel: [3865321.007199] block drbd0:
drbd_sync_handshake:
May 1 08:11:05 server2 kernel: [3865321.007205] block drbd0: self
8FE26C139FB94071:089C9C45FDE4ABBC:62F362153A7923B6:62F262153A7923B6 bits:297
flags:0
May 1 08:11:05 server2 kernel: [3865321.007224] block drbd0: peer
089C9C45FDE4ABBC:0000000000000000:62F362153A7923B6:62F262153A7923B6 bits:0
flags:0
May 1 08:11:05 server2 kernel: [3865321.007234] block drbd0: uuid_compare()=1
by rule 70
May 1 08:11:05 server2 kernel: [3865321.007244] block drbd0: peer( Unknown ->
Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
May 1 08:11:05 server2 kernel: [3865321.010000] block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0
May 1 08:11:05 server2 kernel: [3865321.012921] block drbd0: helper command:
/sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
May 1 08:11:05 server2 kernel: [3865321.012932] block drbd0: conn( WFBitMapS
-> SyncSource ) pdsk( Consistent -> Inconsistent )
May 1 08:11:05 server2 kernel: [3865321.012942] block drbd0: Began resync as
SyncSource (will sync 1188 KB [297 bits set]).
May 1 08:11:05 server2 kernel: [3865321.012958] block drbd0: updated sync UUID
8FE26C139FB94071:089D9C45FDE4ABBC:089C9C45FDE4ABBC:62F362153A7923B6
May 1 08:11:05 server2 kernel: [3865321.076076] block drbd0: Resync done
(total 1 sec; paused 0 sec; 1188 K/sec)
May 1 08:11:05 server2 kernel: [3865321.076086] block drbd0: updated UUIDs
8FE26C139FB94071:0000000000000000:089D9C45FDE4ABBC:089C9C45FDE4ABBC
May 1 08:11:05 server2 kernel: [3865321.076096] block drbd0: conn( SyncSource
-> Connected ) pdsk( Inconsistent -> UpToDate )
May 1 08:11:05 server2 kernel: [3865321.076342] block drbd0: bitmap WRITE of
10 pages took 0 jiffies
May 1 08:11:05 server2 kernel: [3865321.076351] block drbd0: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
May 1 08:11:12 server2 kernel: [3865328.009119] block drbd0: role( Primary ->
Secondary )
May 1 08:11:12 server2 kernel: [3865328.009190] block drbd0: bitmap WRITE of 0
pages took 0 jiffies
May 1 08:11:12 server2 kernel: [3865328.009202] block drbd0: 0 KB (0 bits)
marked out-of-sync by on disk bit-map.
May 1 08:11:12 server2 kernel: [3865328.276278] block drbd0: peer( Secondary
-> Primary )
Thanks for your answers,
Best regards,
Benjamin Linier
ENGIE Mail Disclaimer: http://www.engie.com/disclaimer/disclaimer-fr.html
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user