[DRBD-user] Strange behavior with non-default ping timeout since 8.3.11

2012-09-19 Thread Andreas Hofmeister

Hi all,

We use drbd 8.3.11 as a dual-primary in a pacemaker (1.0.x) cluster.

In our setup, we need a somewhat larger ping-timeout (2s) due to 
interruptions during a firewall restart. That used to work well with 
8.3.10 but caused crm resource stop/start sequences to fail since 8.3.11.


A git bisect showed that this effect occurs since 
http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commit;h=a0c9e5442e3be2d17772f50e1cf1d714cbddc51d


It seems that PCMK executes the sequence drbdadm up + drbdadm primary 
rather quickly. If the drbdadm primary happens while drbd is still 
waiting for the connection being established (WFConnection), the 
resource startup fails, because then a split-brain is detected and then 
automatic resolution fails because by then both sides are already primary.


Above patch prolongs the time during which the problem may occur: with 
the old 100ms connection timeout it was rather unlikely to happen, with 
a 2s timeout it is almost guaranteed.


We were able to reproduce the problem with ping-timeout 20 on a running 
dual-primary with


  drbdadm down res; drbdadm up res; drbdadm primary res

This sequence however works:

  drbdadm down res ; drbdadm up res ; drbdadm wait-connect res;\
drbdadm primary test

Our test setup was a 3.0.41 kernel running drbd 8.3.13 under KVM.

Putting this

  test $rc = $OCF_SUCCESS  drbdadm wait-connect $DRBD_RESOURCE

into the drbd_start function of the RA seems to work for us.

Ciao
  Andi
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] drbdsetup occasionally hangs

2012-09-19 Thread Andreas Hofmeister

Hi,

We have drbd 8.3.11 or 8.3.13 dual-primary on a pacemaker cluster 
running on kernel 3.0.41.


The cluster  just does its work, nothing is stopped or started and then, 
after a week or so, we get a drbsetup locking-up (associated with below 
kernel trace) when we want to administer a resource.


Usually only one resource of several resources is affected, sometimes 
even two.


We have seen several such traces, with different drbdsetup sub-commands, 
all ending at the same place.


Could this be the problem addressed by

http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=commit;h=c586d79e49135831dbe0629e2d9a7b3739c615ef
Fix comparison of is_valid_transition()'s return code

in 8.4 ?

We fiddled that patch into a 8.3.13, which is currently running on a 
test machine, but since the problem only appears now and then it is hard 
to say if the problem is gone.


Has anyone an idea how to get into this state ?

TIA
  Andi

---8---
032012 Sep 10 17:17:01 cnode1 [609601.848157] INFO: task
drbdsetup:5670 blocked for more than 120 seconds.
032012 Sep 10 17:17:01 cnode1 [609601.848160] \echo 0 
/proc/sys/kernel/hung_task_timeout_secs\ disables this message.
062012 Sep 10 17:17:01 cnode1 [609601.848162] drbdsetup   D
 0  5670  1 0x0004
042012 Sep 10 17:17:01 cnode1 [609601.848166]  88000f423968
0082 88003ffd7c00 88000f423fd8
042012 Sep 10 17:17:01 cnode1 [609601.848170]  88000f423838
00012340 00012340 00012340
042012 Sep 10 17:17:01 cnode1 [609601.848173]  00012340
00012340 88000ee045c0 00012340
042012 Sep 10 17:17:01 cnode1 [609601.848177] Call Trace:
042012 Sep 10 17:17:01 cnode1 [609601.852026]  [8103960c] ?
spin_unlock_irqrestore+0x9/0xb
042012 Sep 10 17:17:01 cnode1 [609601.880322]  [810416d6] ?
__wake_up+0x43/0x50
042012 Sep 10 17:17:01 cnode1 [609601.884293]  [a03a745f] ?
put_ldev+0x85/0x8a [drbd]
042012 Sep 10 17:17:01 cnode1 [609601.916943]  [a03a7be5] ?
is_valid_state+0x73/0x1e3 [drbd]
042012 Sep 10 17:17:01 cnode1 [609601.916953]  [a03a698f] ?
spin_unlock_irqrestore+0x9/0xb [drbd]
042012 Sep 10 17:17:01 cnode1 [609601.916969]  [a03a7e22] ?
_req_st_cond+0xcd/0xdf [drbd]
042012 Sep 10 17:17:01 cnode1 [609601.919191]  [815ad428]
schedule+0x44/0x46
042012 Sep 10 17:17:01 cnode1 [609601.919208]  [a03aadb2]
drbd_req_state+0x1b6/0x2df [drbd]
042012 Sep 10 17:17:01 cnode1 [609601.919224]  [8105f3cc] ?
wake_up_bit+0x23/0x23
042012 Sep 10 17:17:01 cnode1 [609601.919241]  [a03aaefd]
_drbd_request_state+0x22/0xb2 [drbd]
042012 Sep 10 17:17:01 cnode1 [609601.919252]  [810bbcb6] ?
zone_statistics+0x77/0x7e
042012 Sep 10 17:17:01 cnode1 [609601.920356]  [810ab9da] ?
set_page_refcounted+0xd/0x1a
042012 Sep 10 17:17:01 cnode1 [609601.920401]  [810ade41] ?
get_page_from_freelist+0x58b/0x64d
042012 Sep 10 17:17:01 cnode1 [609601.920446]  [a03b1895]
drbd_nl_invalidate+0xa1/0x133 [drbd]
042012 Sep 10 17:17:01 cnode1 [609601.920462]  [a03b1c1d]
drbd_connector_callback+0x104/0x195 [drbd]
042012 Sep 10 17:17:01 cnode1 [609601.924378]  [a026446a]
cn_rx_skb+0xb0/0xd2 [cn]
042012 Sep 10 17:17:01 cnode1 [609601.936338]  [81514227]
netlink_unicast+0xe2/0x14b
042012 Sep 10 17:17:01 cnode1 [609601.963889]  [814f1ea6] ?
memcpy_fromiovec+0x42/0x73
042012 Sep 10 17:17:01 cnode1 [609601.963897]  [8151545c]
netlink_sendmsg+0x230/0x250
042012 Sep 10 17:17:01 cnode1 [609601.963909]  [814e71c1]
__sock_sendmsg_nosec+0x55/0x62
042012 Sep 10 17:17:01 cnode1 [609601.963913]  [814e8456]
__sock_sendmsg+0x39/0x42
042012 Sep 10 17:17:01 cnode1 [609601.963917]  [814e8c2e]
sock_sendmsg+0xa3/0xbc
042012 Sep 10 17:17:01 cnode1 [609601.963920]  [810c1137] ?
handle_pte_fault+0x2ef/0x843
042012 Sep 10 17:17:01 cnode1 [609601.963924]  [810c1e32] ?
handle_mm_fault+0x19c/0x1b3
042012 Sep 10 17:17:01 cnode1 [609601.963936]  [810eedbe] ?
fget_light+0x2f/0x7c
042012 Sep 10 17:17:01 cnode1 [609601.963939]  [814e8c71] ?
sockfd_lookup_light+0x1b/0x53
042012 Sep 10 17:17:01 cnode1 [609601.963943]  [814e91b6]
sys_sendto+0xfa/0x11f
042012 Sep 10 17:17:01 cnode1 [609601.963946]  [8151355b] ?
netlink_table_ungrab+0x2e/0x30
042012 Sep 10 17:17:01 cnode1 [609601.963949]  [81515609] ?
netlink_bind+0x106/0x11c
042012 Sep 10 17:17:01 cnode1 [609601.963952]  [814e9c33] ?
sys_bind+0x7d/0x91
042012 Sep 10 17:17:01 cnode1 [609601.963955]  [810ebd14] ?
spin_lock+0x9/0xb
042012 Sep 10 17:17:01 cnode1 [609601.963960]  [815b3a92]
system_call_fastpath+0x16/0x1b
---8---


___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Cluster filesystem question

2011-11-29 Thread Andreas Hofmeister

On 29.11.2011 21:17, Florian Haas wrote:

On Mon, Nov 28, 2011 at 9:26 PM, Lars Ellenberg



As this sort of issue currently pops up on IRC every other day, I've
just posted this rant:

http://fghaas.wordpress.com/2011/11/29/dual-primary-drbd-iscsi-and-multipath-dont-do-that/


Yes Florian, I did get that from Lars' response.

I'm still want to understand what the actual problems are. And no, 
google does not actually help in this regard.


If I knew what questions to ask and where to look at, I can beg or bribe 
the right people or even find the right code to patch.


Ciao
  Andi
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Cluster filesystem question

2011-11-29 Thread Andreas Hofmeister

On 30.11.2011 00:15, John Lauro wrote:


The problem is that part of iSCSI is a client saying, I want exclusive
access to a certain block(s), and then no other client can access that
block.


Yes, that reservation stuff is something a cluster aware iSCSI target 
would have to synchronize among the nodes.


However, the original claim which I was wondering about was, that 
multipath would not work with current iSCSI targets on Linux.


Ciao
  Andi
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Cluster filesystem question

2011-11-29 Thread Andreas Hofmeister

On 30.11.2011 02:08, John Lauro wrote:



Andreas, what happens if you block your two nodes from talking directly
to each other, but allow the client to talk to both?


Mhh, it basically depends what exactly happens when the split occours.

If the writes on both nodes would succeed, the client would sees an 
acknowledge from both nodes and will never retry. And since it is 
impossible to merge the changes when recovering from the split, writes 
to one of the writes would be lost.


If the write operations would block until the split is resolved (through 
shooting one node), the client should eventually detect a failure and 
deal with it - retry the request with the good target.


Ciao
  Andi
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Cluster filesystem question

2011-11-28 Thread Andreas Hofmeister

On 28.11.2011 21:26, Lars Ellenberg wrote:

On Fri, Nov 25, 2011 at 09:11:39PM +0100, Andreas Hofmeister wrote:

On 25.11.2011 17:08, John Lauro wrote:




@list: did anybody try such a thing ?


dual-primary iscsi targets for multipath: does not work.

iSCSI is a stateful protocol, there is more to it that than just reads and 
writes.
To run multipath (or multi-connections per session)
against *DISTINCT* targets [*] on separate nodes

** you'd need to have cluster aware iSCSI targets **

which coordinate with each other in some fashion.


Mhh, I just wondering what exactly could explode.

Surely one would (have to) synchronize the more static configuration 
aspects - ACLs, authentication etc, in some way.


Things like iSCSI reservation or limits on concurrent access to a single 
target is likely something that would require support from the target 
infrastructure. And indeed, it seems that none of the iSCSI targets on 
Linux would support that.


But for pure read/write aspects, multipathing requires support from the 
initiator anyways, no ?


TCP f.e. cannot guarantee the ordering of packets in different TCP 
streams in the first place, so the order of simultaneous writes to 
different portals is pretty much undefined (or a write on one path and a 
flush on the other) ?





To my knowledge, this does not exist (not for linux, anyways).

[*] which happen to live on top of data that, due to replication,
happens to be the same, most of the time, unless the replication link
was lost for whatever reason; in which case you absolutely want to make
sure that at least one box reboots hard before it even thinks about
completing or even submitting an other IO request...


... which is was just another reason why to have proper I/O-Fencingor 
STONITH in place, no ?


Ciao
  Andi


___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] DRBD inside KVM virtual machine

2011-10-21 Thread Andreas Hofmeister

On 21.10.2011 12:00, Nick Morrison wrote:

Am I mad?  Should it work?
It does. We run DRBD + Pcmk + OCFS2 for testing. I would not use such a 
setup in production though.



   Will performance suck compared with running DRBD directly on the physical 
machines?  I understand I will probably have high CPU usage during DRBD 
syncing, as QEMU's IO (even virtio) will probably load up the CPU, but perhaps 
this will be minimal,


You will see noticeable higher latencies with both, disk and network. 
macvtap may help a bit with the latter, without your network 
throughput would be limited too.



  or perhaps I can configure QEMU to let the VM guest talk very directly to the 
physical host's block device..


PCI Passthrough is somewhat problematic. Some 
Chipset/Board/BIOS/Kernel/PCIe-Card/Driver combinations may even work, 
but there is a good chance that you will see sucking performance and/or 
unexplainable crashes. Unless you have a LOT of time to track these 
problems, do not use PCI Passthrugh,


Ciao
  Andi


___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Using names instead of IPs in resource 'address' entries

2011-09-06 Thread Andreas Hofmeister

On 06.09.2011 19:07, Lars Ellenberg wrote:

The basic problem: if the link goes away because of one node chaning its
IP, we cannot possibly do a dns lookup from kernel space...
Well, we could. But that would be a good reason to kick us out of the
kernel and never talk to us again ;)

Yes, if you did that in kernel space, I guess.

But apparently there is an interface to let user space handle name 
lookups, see  Documentation/networking/dns_resolver.txt  in recent 
kernel sources. It seems to be used in CIFS and NFS.


Ciao
  Andi
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Directly connected GigE ports bonded together no switch

2011-08-10 Thread Andreas Hofmeister

On 10.08.2011 19:04, Herman wrote:

If this is the case, maybe arp monitoring is more reliable for direct
connections since NIC failure (which may fail but still have link up) is
more likely than cable failure?  Maybe I don't have a good understanding
of this.


With switches in between, ARP monitoring is a bit dangerous because you 
either need a switch that answers the ARP queries (i.e it must be 
manageable) or you have another machine that answers. But then what 
happens when that other machine is down ...


With direct connections between hosts this does not matter though - 
whether the other side does not answer due to a broken cable, an 
exploded NIC or just plain reboot is just the same.



In addition, I tried to use scp to test the throughput through the
bonded link, but I actually got almost the same results via
active-backup as with balance-rr.  Am I doing something wrong?



With SSH you will see basically the rate at which the hosts can 
en-/decrypt packets. Better try something like iperf or some file 
server that does not encrypt traffic, NFS or FTP for example.


Ciao
  Andi


___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Directly connected GigE ports bonded together no switch

2011-08-09 Thread Andreas Hofmeister

On 09.08.2011 16:46, Herman wrote:


Also, right now I'm using mode=active-backup.  Would one of the 
other modes allow higher throughput and still allow automatic failover 
and transparency to DRBD?


Try round-robin in your situation, it is the only bonding mode that 
gives higher throughput for a single connection.


Ciao
  Andi
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] 8.3.11 on RHEL 5.7 w/ MTU of 9000 fails

2011-07-23 Thread Andreas Hofmeister

On 23.07.2011 06:30, Digimer wrote:

   Any way to debug this? If it looks like a DRBD bug,


I don't think so. I have DRBD running with jumbo frames on several 
machines and it just work (albeit with 0.8.10).


Check you networking.

Try ping -c1 -Mdo -s 8192 other node, that should tell you if you 
get jumbo frames to the other side or if there is a problem on the IP 
layer.


If you get something like From other node icmp_seq=1 Frag needed and 
DF set (mtu = actual MTU), check your devices MTU and check your 
routing as it is actually possible to define a MTU per route.


If you just get no answer or the response is too short, check the specs 
of your network hardware. The supported size for jumbo frames varies 
widely, not just between vendors but also between NIC chips from the 
same vendor. In some cases, even different chips support by the same 
driver differ in the maximum frame size.


Ciao
  Andi
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] 8.3.11 on RHEL 5.7 w/ MTU of 9000 fails

2011-07-23 Thread Andreas Hofmeister

On 23.07.2011 21:54, Digimer wrote:


Thanks for the reply Andy.

The hardware (Intel Pro1000/CT NICs[1] and a D-Link DGS-3100-24[2]) 
both support 9kb+ frames (and JFs are enabled in the switch). With the 
MTU on either node's DRBD interface (eth1) set to 9000, I confirmed 
that JFs worked using a method very similar to what you suggested:


 --- an-node07.sn ping statistics --- 1 packets transmitted, 1
 received, 0% packet loss, time 0ms rtt min/avg/max/mdev =
 0.686/0.686/0.686/0.000 ms 

 node2:  [root@an-node07 ~]# ifconfig eth1 mtu 9000
 [root@an-node07 ~]# ifconfig eth1 eth1 Link encap:Ethernet
 HWaddr 00:1B:21:72:96:F2 inet addr:192.168.2.77 Bcast:192.168.2.255
 Mask:255.255.255.0 inet6 addr: fe80::21b:21ff:fe72:96f2/64
 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX
 packets:30593 errors:0 dropped:0 overruns:0 frame:0 TX packets:26756
 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
 RX bytes:31161397 (29.7 MiB) TX bytes:30838239 (29.4 MiB)
 Interrupt:17 Memory:feae-feb0[root@an-node07 ~]# clear; 
tcpdump -i eth1 icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol 
decode

listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes
15:45:10.474134 IP an-node06.sn  an-node07.sn: ICMP echo request, id 
11554, seq 1, length 8908
15:45:10.475236 IP an-node07.sn  an-node06.sn: ICMP echo reply, id 
11554, seq 1, length 8908




Ah. The '-Mdo' argument to ping just spares you the 'tcpdump'.

1. 
http://www.intel.com/products/desktop/adapters/gigabit-ct/gigabit-ct-overview.htm

2. http://dlink.ca/products/?pid=DGS-3100-24


Mhh, I have a peer of nodes - actually with DRBD 0.8.11, not 0.8.10 as I 
thought - and Intel 82571EB NICs working rather well, but then these are 
connected back-to-back.


We had some bad experience with a D-Link switch in the past. Don't 
remember the model, but we eventually scraped it because of the troubles.


If you use pacemaker+corosync, you may want to check the netmtu 
setting in corosync.conf. That should not affect DRBD though.


Ciao
  Andi
___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Poor DRBD performance, HELP!

2011-06-26 Thread Andreas Hofmeister

Hi,

On 21.06.2011 17:36, Felix Frank wrote:

On 06/21/2011 05:15 PM, Noah Mehl wrote:

The results are the same :(

Sorry to hear it.


'dd' is probably not a good benchmark when throughput approaches the 
1GByte/s range.


With 'dd' on a system with similar performance (8x15k SAS disks, LSI, 
10GBE), I get 1GByte/s into the local block device but only 600MByte/s 
into the DRBD device. Using another benchmark (fio with async-I/O and 
larer values for I/O-depth), I can get up to about 1.2GByte/s in both cases.


'dd' uses a simple read-write loop which is prone to throughput 
degradation due to latencies. Even though DRBD only adds the network 
latency, the amount of added latency becomes significant compared to the 
latency from the actual disk writes.


Also, Noah used probably far too big buffers for dd. It seems there is 
some significant overhead when the kernel has to move 1GByte chunks in 
and out of user-space:


---8---
# dd if=/dev/zero of=/dev/null bs=1M count=10240
...
10737418240 bytes (11 GB) copied, 1.34667 s, 8.0 GB/s
# dd if=/dev/zero of=/dev/null bs=1G count=10
...
10737418240 bytes (11 GB) copied, 2.32565 s, 4.6 GB/s
---8---


___
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user