[DRBD-user] Strange behavior with non-default ping timeout since 8.3.11
Hi all, We use drbd 8.3.11 as a dual-primary in a pacemaker (1.0.x) cluster. In our setup, we need a somewhat larger ping-timeout (2s) due to interruptions during a firewall restart. That used to work well with 8.3.10 but caused crm resource stop/start sequences to fail since 8.3.11. A git bisect showed that this effect occurs since http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commit;h=a0c9e5442e3be2d17772f50e1cf1d714cbddc51d It seems that PCMK executes the sequence drbdadm up + drbdadm primary rather quickly. If the drbdadm primary happens while drbd is still waiting for the connection being established (WFConnection), the resource startup fails, because then a split-brain is detected and then automatic resolution fails because by then both sides are already primary. Above patch prolongs the time during which the problem may occur: with the old 100ms connection timeout it was rather unlikely to happen, with a 2s timeout it is almost guaranteed. We were able to reproduce the problem with ping-timeout 20 on a running dual-primary with drbdadm down res; drbdadm up res; drbdadm primary res This sequence however works: drbdadm down res ; drbdadm up res ; drbdadm wait-connect res;\ drbdadm primary test Our test setup was a 3.0.41 kernel running drbd 8.3.13 under KVM. Putting this test $rc = $OCF_SUCCESS drbdadm wait-connect $DRBD_RESOURCE into the drbd_start function of the RA seems to work for us. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
[DRBD-user] drbdsetup occasionally hangs
Hi, We have drbd 8.3.11 or 8.3.13 dual-primary on a pacemaker cluster running on kernel 3.0.41. The cluster just does its work, nothing is stopped or started and then, after a week or so, we get a drbsetup locking-up (associated with below kernel trace) when we want to administer a resource. Usually only one resource of several resources is affected, sometimes even two. We have seen several such traces, with different drbdsetup sub-commands, all ending at the same place. Could this be the problem addressed by http://git.drbd.org/gitweb.cgi?p=drbd-8.4.git;a=commit;h=c586d79e49135831dbe0629e2d9a7b3739c615ef Fix comparison of is_valid_transition()'s return code in 8.4 ? We fiddled that patch into a 8.3.13, which is currently running on a test machine, but since the problem only appears now and then it is hard to say if the problem is gone. Has anyone an idea how to get into this state ? TIA Andi ---8--- 032012 Sep 10 17:17:01 cnode1 [609601.848157] INFO: task drbdsetup:5670 blocked for more than 120 seconds. 032012 Sep 10 17:17:01 cnode1 [609601.848160] \echo 0 /proc/sys/kernel/hung_task_timeout_secs\ disables this message. 062012 Sep 10 17:17:01 cnode1 [609601.848162] drbdsetup D 0 5670 1 0x0004 042012 Sep 10 17:17:01 cnode1 [609601.848166] 88000f423968 0082 88003ffd7c00 88000f423fd8 042012 Sep 10 17:17:01 cnode1 [609601.848170] 88000f423838 00012340 00012340 00012340 042012 Sep 10 17:17:01 cnode1 [609601.848173] 00012340 00012340 88000ee045c0 00012340 042012 Sep 10 17:17:01 cnode1 [609601.848177] Call Trace: 042012 Sep 10 17:17:01 cnode1 [609601.852026] [8103960c] ? spin_unlock_irqrestore+0x9/0xb 042012 Sep 10 17:17:01 cnode1 [609601.880322] [810416d6] ? __wake_up+0x43/0x50 042012 Sep 10 17:17:01 cnode1 [609601.884293] [a03a745f] ? put_ldev+0x85/0x8a [drbd] 042012 Sep 10 17:17:01 cnode1 [609601.916943] [a03a7be5] ? is_valid_state+0x73/0x1e3 [drbd] 042012 Sep 10 17:17:01 cnode1 [609601.916953] [a03a698f] ? spin_unlock_irqrestore+0x9/0xb [drbd] 042012 Sep 10 17:17:01 cnode1 [609601.916969] [a03a7e22] ? _req_st_cond+0xcd/0xdf [drbd] 042012 Sep 10 17:17:01 cnode1 [609601.919191] [815ad428] schedule+0x44/0x46 042012 Sep 10 17:17:01 cnode1 [609601.919208] [a03aadb2] drbd_req_state+0x1b6/0x2df [drbd] 042012 Sep 10 17:17:01 cnode1 [609601.919224] [8105f3cc] ? wake_up_bit+0x23/0x23 042012 Sep 10 17:17:01 cnode1 [609601.919241] [a03aaefd] _drbd_request_state+0x22/0xb2 [drbd] 042012 Sep 10 17:17:01 cnode1 [609601.919252] [810bbcb6] ? zone_statistics+0x77/0x7e 042012 Sep 10 17:17:01 cnode1 [609601.920356] [810ab9da] ? set_page_refcounted+0xd/0x1a 042012 Sep 10 17:17:01 cnode1 [609601.920401] [810ade41] ? get_page_from_freelist+0x58b/0x64d 042012 Sep 10 17:17:01 cnode1 [609601.920446] [a03b1895] drbd_nl_invalidate+0xa1/0x133 [drbd] 042012 Sep 10 17:17:01 cnode1 [609601.920462] [a03b1c1d] drbd_connector_callback+0x104/0x195 [drbd] 042012 Sep 10 17:17:01 cnode1 [609601.924378] [a026446a] cn_rx_skb+0xb0/0xd2 [cn] 042012 Sep 10 17:17:01 cnode1 [609601.936338] [81514227] netlink_unicast+0xe2/0x14b 042012 Sep 10 17:17:01 cnode1 [609601.963889] [814f1ea6] ? memcpy_fromiovec+0x42/0x73 042012 Sep 10 17:17:01 cnode1 [609601.963897] [8151545c] netlink_sendmsg+0x230/0x250 042012 Sep 10 17:17:01 cnode1 [609601.963909] [814e71c1] __sock_sendmsg_nosec+0x55/0x62 042012 Sep 10 17:17:01 cnode1 [609601.963913] [814e8456] __sock_sendmsg+0x39/0x42 042012 Sep 10 17:17:01 cnode1 [609601.963917] [814e8c2e] sock_sendmsg+0xa3/0xbc 042012 Sep 10 17:17:01 cnode1 [609601.963920] [810c1137] ? handle_pte_fault+0x2ef/0x843 042012 Sep 10 17:17:01 cnode1 [609601.963924] [810c1e32] ? handle_mm_fault+0x19c/0x1b3 042012 Sep 10 17:17:01 cnode1 [609601.963936] [810eedbe] ? fget_light+0x2f/0x7c 042012 Sep 10 17:17:01 cnode1 [609601.963939] [814e8c71] ? sockfd_lookup_light+0x1b/0x53 042012 Sep 10 17:17:01 cnode1 [609601.963943] [814e91b6] sys_sendto+0xfa/0x11f 042012 Sep 10 17:17:01 cnode1 [609601.963946] [8151355b] ? netlink_table_ungrab+0x2e/0x30 042012 Sep 10 17:17:01 cnode1 [609601.963949] [81515609] ? netlink_bind+0x106/0x11c 042012 Sep 10 17:17:01 cnode1 [609601.963952] [814e9c33] ? sys_bind+0x7d/0x91 042012 Sep 10 17:17:01 cnode1 [609601.963955] [810ebd14] ? spin_lock+0x9/0xb 042012 Sep 10 17:17:01 cnode1 [609601.963960] [815b3a92] system_call_fastpath+0x16/0x1b ---8--- ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Cluster filesystem question
On 29.11.2011 21:17, Florian Haas wrote: On Mon, Nov 28, 2011 at 9:26 PM, Lars Ellenberg As this sort of issue currently pops up on IRC every other day, I've just posted this rant: http://fghaas.wordpress.com/2011/11/29/dual-primary-drbd-iscsi-and-multipath-dont-do-that/ Yes Florian, I did get that from Lars' response. I'm still want to understand what the actual problems are. And no, google does not actually help in this regard. If I knew what questions to ask and where to look at, I can beg or bribe the right people or even find the right code to patch. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Cluster filesystem question
On 30.11.2011 00:15, John Lauro wrote: The problem is that part of iSCSI is a client saying, I want exclusive access to a certain block(s), and then no other client can access that block. Yes, that reservation stuff is something a cluster aware iSCSI target would have to synchronize among the nodes. However, the original claim which I was wondering about was, that multipath would not work with current iSCSI targets on Linux. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Cluster filesystem question
On 30.11.2011 02:08, John Lauro wrote: Andreas, what happens if you block your two nodes from talking directly to each other, but allow the client to talk to both? Mhh, it basically depends what exactly happens when the split occours. If the writes on both nodes would succeed, the client would sees an acknowledge from both nodes and will never retry. And since it is impossible to merge the changes when recovering from the split, writes to one of the writes would be lost. If the write operations would block until the split is resolved (through shooting one node), the client should eventually detect a failure and deal with it - retry the request with the good target. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Cluster filesystem question
On 28.11.2011 21:26, Lars Ellenberg wrote: On Fri, Nov 25, 2011 at 09:11:39PM +0100, Andreas Hofmeister wrote: On 25.11.2011 17:08, John Lauro wrote: @list: did anybody try such a thing ? dual-primary iscsi targets for multipath: does not work. iSCSI is a stateful protocol, there is more to it that than just reads and writes. To run multipath (or multi-connections per session) against *DISTINCT* targets [*] on separate nodes ** you'd need to have cluster aware iSCSI targets ** which coordinate with each other in some fashion. Mhh, I just wondering what exactly could explode. Surely one would (have to) synchronize the more static configuration aspects - ACLs, authentication etc, in some way. Things like iSCSI reservation or limits on concurrent access to a single target is likely something that would require support from the target infrastructure. And indeed, it seems that none of the iSCSI targets on Linux would support that. But for pure read/write aspects, multipathing requires support from the initiator anyways, no ? TCP f.e. cannot guarantee the ordering of packets in different TCP streams in the first place, so the order of simultaneous writes to different portals is pretty much undefined (or a write on one path and a flush on the other) ? To my knowledge, this does not exist (not for linux, anyways). [*] which happen to live on top of data that, due to replication, happens to be the same, most of the time, unless the replication link was lost for whatever reason; in which case you absolutely want to make sure that at least one box reboots hard before it even thinks about completing or even submitting an other IO request... ... which is was just another reason why to have proper I/O-Fencingor STONITH in place, no ? Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] DRBD inside KVM virtual machine
On 21.10.2011 12:00, Nick Morrison wrote: Am I mad? Should it work? It does. We run DRBD + Pcmk + OCFS2 for testing. I would not use such a setup in production though. Will performance suck compared with running DRBD directly on the physical machines? I understand I will probably have high CPU usage during DRBD syncing, as QEMU's IO (even virtio) will probably load up the CPU, but perhaps this will be minimal, You will see noticeable higher latencies with both, disk and network. macvtap may help a bit with the latter, without your network throughput would be limited too. or perhaps I can configure QEMU to let the VM guest talk very directly to the physical host's block device.. PCI Passthrough is somewhat problematic. Some Chipset/Board/BIOS/Kernel/PCIe-Card/Driver combinations may even work, but there is a good chance that you will see sucking performance and/or unexplainable crashes. Unless you have a LOT of time to track these problems, do not use PCI Passthrugh, Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Using names instead of IPs in resource 'address' entries
On 06.09.2011 19:07, Lars Ellenberg wrote: The basic problem: if the link goes away because of one node chaning its IP, we cannot possibly do a dns lookup from kernel space... Well, we could. But that would be a good reason to kick us out of the kernel and never talk to us again ;) Yes, if you did that in kernel space, I guess. But apparently there is an interface to let user space handle name lookups, see Documentation/networking/dns_resolver.txt in recent kernel sources. It seems to be used in CIFS and NFS. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Directly connected GigE ports bonded together no switch
On 10.08.2011 19:04, Herman wrote: If this is the case, maybe arp monitoring is more reliable for direct connections since NIC failure (which may fail but still have link up) is more likely than cable failure? Maybe I don't have a good understanding of this. With switches in between, ARP monitoring is a bit dangerous because you either need a switch that answers the ARP queries (i.e it must be manageable) or you have another machine that answers. But then what happens when that other machine is down ... With direct connections between hosts this does not matter though - whether the other side does not answer due to a broken cable, an exploded NIC or just plain reboot is just the same. In addition, I tried to use scp to test the throughput through the bonded link, but I actually got almost the same results via active-backup as with balance-rr. Am I doing something wrong? With SSH you will see basically the rate at which the hosts can en-/decrypt packets. Better try something like iperf or some file server that does not encrypt traffic, NFS or FTP for example. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Directly connected GigE ports bonded together no switch
On 09.08.2011 16:46, Herman wrote: Also, right now I'm using mode=active-backup. Would one of the other modes allow higher throughput and still allow automatic failover and transparency to DRBD? Try round-robin in your situation, it is the only bonding mode that gives higher throughput for a single connection. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] 8.3.11 on RHEL 5.7 w/ MTU of 9000 fails
On 23.07.2011 06:30, Digimer wrote: Any way to debug this? If it looks like a DRBD bug, I don't think so. I have DRBD running with jumbo frames on several machines and it just work (albeit with 0.8.10). Check you networking. Try ping -c1 -Mdo -s 8192 other node, that should tell you if you get jumbo frames to the other side or if there is a problem on the IP layer. If you get something like From other node icmp_seq=1 Frag needed and DF set (mtu = actual MTU), check your devices MTU and check your routing as it is actually possible to define a MTU per route. If you just get no answer or the response is too short, check the specs of your network hardware. The supported size for jumbo frames varies widely, not just between vendors but also between NIC chips from the same vendor. In some cases, even different chips support by the same driver differ in the maximum frame size. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] 8.3.11 on RHEL 5.7 w/ MTU of 9000 fails
On 23.07.2011 21:54, Digimer wrote: Thanks for the reply Andy. The hardware (Intel Pro1000/CT NICs[1] and a D-Link DGS-3100-24[2]) both support 9kb+ frames (and JFs are enabled in the switch). With the MTU on either node's DRBD interface (eth1) set to 9000, I confirmed that JFs worked using a method very similar to what you suggested: --- an-node07.sn ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.686/0.686/0.686/0.000 ms node2: [root@an-node07 ~]# ifconfig eth1 mtu 9000 [root@an-node07 ~]# ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:1B:21:72:96:F2 inet addr:192.168.2.77 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::21b:21ff:fe72:96f2/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:30593 errors:0 dropped:0 overruns:0 frame:0 TX packets:26756 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:31161397 (29.7 MiB) TX bytes:30838239 (29.4 MiB) Interrupt:17 Memory:feae-feb0[root@an-node07 ~]# clear; tcpdump -i eth1 icmp tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 96 bytes 15:45:10.474134 IP an-node06.sn an-node07.sn: ICMP echo request, id 11554, seq 1, length 8908 15:45:10.475236 IP an-node07.sn an-node06.sn: ICMP echo reply, id 11554, seq 1, length 8908 Ah. The '-Mdo' argument to ping just spares you the 'tcpdump'. 1. http://www.intel.com/products/desktop/adapters/gigabit-ct/gigabit-ct-overview.htm 2. http://dlink.ca/products/?pid=DGS-3100-24 Mhh, I have a peer of nodes - actually with DRBD 0.8.11, not 0.8.10 as I thought - and Intel 82571EB NICs working rather well, but then these are connected back-to-back. We had some bad experience with a D-Link switch in the past. Don't remember the model, but we eventually scraped it because of the troubles. If you use pacemaker+corosync, you may want to check the netmtu setting in corosync.conf. That should not affect DRBD though. Ciao Andi ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] Poor DRBD performance, HELP!
Hi, On 21.06.2011 17:36, Felix Frank wrote: On 06/21/2011 05:15 PM, Noah Mehl wrote: The results are the same :( Sorry to hear it. 'dd' is probably not a good benchmark when throughput approaches the 1GByte/s range. With 'dd' on a system with similar performance (8x15k SAS disks, LSI, 10GBE), I get 1GByte/s into the local block device but only 600MByte/s into the DRBD device. Using another benchmark (fio with async-I/O and larer values for I/O-depth), I can get up to about 1.2GByte/s in both cases. 'dd' uses a simple read-write loop which is prone to throughput degradation due to latencies. Even though DRBD only adds the network latency, the amount of added latency becomes significant compared to the latency from the actual disk writes. Also, Noah used probably far too big buffers for dd. It seems there is some significant overhead when the kernel has to move 1GByte chunks in and out of user-space: ---8--- # dd if=/dev/zero of=/dev/null bs=1M count=10240 ... 10737418240 bytes (11 GB) copied, 1.34667 s, 8.0 GB/s # dd if=/dev/zero of=/dev/null bs=1G count=10 ... 10737418240 bytes (11 GB) copied, 2.32565 s, 4.6 GB/s ---8--- ___ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user