Re: [DRBD-user] Possible IPoIB deadlock with DRBD

Matteo Tescione Fri, 16 Jan 2015 07:16:20 -0800

Yes I've seen the posts you suggested. 
Do you have it tested? 
we are using qib driver for qlogic 7342 adapters, don't know if someone ported 
to it too.


Regards,

--
matteo

----- Messaggio originale -----
> Da: "Eric Blevins" <[email protected]>
> A: "Matteo Tescione" <[email protected]>
> Cc: [email protected]
> Inviato: Venerdì, 16 gennaio 2015 15:50:17
> Oggetto: Re: [DRBD-user] Possible IPoIB deadlock with DRBD
> 
> The split brain would only happen on dual primary.
> 
> We have Mellanox MHEA28-XTC using mthca driver.
> 
> The potential IPoIB deadlock is only fixed in the mlx4 driver so far.
> 
> 
> 
> common {
>   net {
>     connect-int 20; #Default 10 units 1
>     timeout 180; #default 60 units .1
>     ping-int 30; #default 10 units 1
>     ping-timeout 10; #default 5 units .1
>     ko-count 20;
>     max-buffers 16000;
>     max-epoch-size 16000;
>     sndbuf-size 0;
>     rcvbuf-size 0;
>     unplug-watermark 16001;
>     verify-alg md5;
>   }
>   disk {
>     c-plan-ahead 10;
>     c-min-rate 30M;
>     c-max-rate 200M;
>     c-fill-target 20M;
>     al-extents 3389;
>     md-flushes no;
>     disk-barrier no;
>     disk-flushes no;
>   }
> }
> resource drbd0 {
>   device /dev/drbd0;
>   disk /dev/sdc;
>   meta-disk internal;
>   startup {
>     wfc-timeout  120;
>     degr-wfc-timeout 60;
>     outdated-wfc-timeout 60;
>     become-primary-on both;
>   }
>   disk {
>     c-max-rate 200M;
>     c-min-rate 30M;
>     c-fill-target 20M;
>     c-plan-ahead 10;
> 
>   }
>   net {
>     protocol C;
>     cram-hmac-alg sha1;
>     shared-secret "XXXXXXXXXXXXX";
>     allow-two-primaries;
>     after-sb-0pri discard-zero-changes;
>     after-sb-1pri discard-secondary;
>     after-sb-2pri disconnect;
>   }
>   on vm1 {
>     address x.x.x.1:7788;
>   }
>   on vm2 {
>     address x.x.x.2:7788;
>   }
> }
> 
> On Fri, Jan 16, 2015 at 5:32 AM, Matteo Tescione <[email protected]>
> wrote:
> > Hi Eric,
> >
> > it seems that I'm hitting the same deadlock, but I don't use dual
> > primary, and the split brain never occurs.
> >
> > Can you post your drbd config as long with the infiniband hba model
> > and version you're using?
> >
> > regards,
> >
> > --
> > matteo
> >
> > ----- Messaggio originale -----
> >> Da: "Eric Blevins" <[email protected]>
> >> A: [email protected]
> >> Inviato: Giovedì, 15 gennaio 2015 17:53:48
> >> Oggetto: [DRBD-user] Possible IPoIB deadlock with DRBD
> >>
> >> We are using Proxmox with DRBD in dual primary using IPoIB for
> >> transport
> >> Recently tested Proxmox upcoming 3.10 kernel based on the kernel
> >> from
> >> RHEL 7 and started having problems with DRBD.
> >>
> >> The kernel came with DRBD 8.4.3, I have also compiled and
> >> installed
> >> 8.4.5 and both experience the same problem.
> >>
> >> During times of heavy IO loads (backups) DRBD will timeout and
> >> split
> >> brain, I have included some logs below.
> >> I stumbled on a couple LKML threads that discusses a deadlock with
> >> IPoIB and IO that happens over the IPoIB such as iSCSI or NFS.
> >> https://lkml.org/lkml/2014/2/21/655
> >> http://lkml.org/lkml/2014/4/24/543
> >>
> >> Is it likely that DRBD could also trigger the deadlock discussed
> >> on
> >> LKML?
> >> If not, do you have any other suggestions on how I can prevent
> >> this
> >> timeout?
> >>
> >>
> >> Node A:
> >> Jan  5 03:23:51 vm6 kernel: [2221944.335766] drbd drbd0: peer(
> >> Primary
> >> -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate ->
> >> DUnknown
> >> )
> >> Jan  5 03:23:51 vm6 kernel: [2221944.335782] drbd drbd0: asender
> >> terminated
> >> Jan  5 03:23:51 vm6 kernel: [2221944.335784] drbd drbd0:
> >> Terminating
> >> drbd_a_drbd0
> >> Jan  5 03:23:51 vm6 kernel: [2221944.335846] block drbd0: new
> >> current
> >> UUID
> >> BD9DB97EC672F5C9:8F2DD469C771058B:925C07CF6316212D:925B07CF6316212D
> >> Jan  5 03:23:51 vm6 kernel: [2221944.347788] drbd drbd0:
> >> Connection
> >> closed
> >> Jan  5 03:23:51 vm6 kernel: [2221944.347834] drbd drbd0: conn(
> >> Timeout
> >> -> Unconnected )
> >> Jan  5 03:23:51 vm6 kernel: [2221944.347836] drbd drbd0: receiver
> >> terminated
> >>
> >>
> >> Node B:
> >> Jan  5 03:23:51 vm5 kernel: [2223090.170391] drbd drbd0: sock was
> >> shut
> >> down by peer
> >> Jan  5 03:23:51 vm5 kernel: [2223090.170409] drbd drbd0: peer(
> >> Primary
> >> -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate ->
> >> DUnknown )
> >> Jan  5 03:23:51 vm5 kernel: [2223090.170412] drbd drbd0: short
> >> read
> >> (expected size 16)
> >> Jan  5 03:23:51 vm5 kernel: [2223090.170421] drbd drbd0: asender
> >> terminated
> >> Jan  5 03:23:51 vm5 kernel: [2223090.170423] drbd drbd0:
> >> Terminating
> >> drbd_a_drbd0
> >> Jan  5 03:23:51 vm5 kernel: [2223090.170480] block drbd0: new
> >> current
> >> UUID
> >> 2628F73F9DAE5EDF:8F2DD469C771058B:925C07CF6316212D:925B07CF6316212D
> >> Jan  5 03:23:51 vm5 kernel: [2223090.185536] drbd drbd0:
> >> Connection
> >> closed
> >> Jan  5 03:23:51 vm5 kernel: [2223090.185585] drbd drbd0: conn(
> >> BrokenPipe -> Unconnected )
> >> Jan  5 03:23:51 vm5 kernel: [2223090.185587] drbd drbd0: receiver
> >> terminated
> >>
> >> Eric
> >> _______________________________________________
> >> drbd-user mailing list
> >> [email protected]
> >> http://lists.linbit.com/mailman/listinfo/drbd-user
> >>
> >>
> >> --
> >> This message has been scanned for viruses and dangerous content by
> >> RMnet MailScanner, and is believed to be clean.
> >>
> >> Click here to report this message as spam.
> >> http://efa1.rmnet.it/cgi-bin/learn-msg.cgi?id=4C1D868B16.A88D5&token=94b3a0f1dfd9db46184ad15228603c27
> >>
> >>
> 
> 
> --
> This message has been scanned for viruses and dangerous content by
> E.F.A. Project, and is believed to be clean.
> 
> Click here to report this message as spam.
> http://efa2.rmnet.it/cgi-bin/learn-msg.cgi?id=56D2360055.A4FCF&token=6f1b16f22f5f99bcc8213a40ef7ce29d
> 
> 
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] Possible IPoIB deadlock with DRBD

Reply via email to