Just a quick guess - is it possible you ran out of file descriptors/connections on the nodes or on a firewall on the way? I’ve seen this behaviour the other way around - when too many RBD devices were connected to one client. It would explain why it seems to work but hangs when the device is used.
Jan > On 24 Jun 2015, at 19:32, Robert LeBlanc <[email protected]> wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > I would make sure that your CRUSH rules are designed for such a failure. We > currently have two racks and can suffer a one rack loss without blocking I/O. > Here is what we do: > > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step choose firstn 2 type rack > step chooseleaf firstn 2 type host > step emit > } > All pools are size=4 and min_size=2 > > This puts only two copies in each rack so that only half of the objects can > be taken down by a rack loss. We also configure ceph with > "mon_osd_downout_subtree_limit = host" so that it won't automatically mark a > whole rack out (not that it would do a whole lot in our current 2 rack > config). > > Our network failure (dual Ethernet switches) is two racks, so our next > failure domain is what we call a PUD or 2 racks. The 3-4 rack configuration > is similar to the above with the choose changed to pud. Once we get to our > 5th rack of storage, our config changes to: > > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type pud > step emit > } > All pools are size=3 and min_size=2 > > In this configuration, only one copy is kept per PUD and we can lose two > racks in a PUD without blocking I/O in our cluster. > > Under the default CRUSH rules, it is possible to get two objects in one rack. > What does `ceph osd crush rule dump` show? > > -----BEGIN PGP SIGNATURE----- > Version: Mailvelope v0.13.1 > Comment: https://www.mailvelope.com <https://www.mailvelope.com/> > > wsFcBAEBCAAQBQJVium0CRDmVDuy+mK58QAAkyoP/0ZJ9vnxYwbGtanAUNc3 > gT/yT9j4P+l0IKAZHqM0Ypv1gmVG3jXi6aAtGe4nY5DZ8NmGQv/T0JkAXfTb > bzAQpnso4oQ3r7RaEGNmtZ4xJrHunAFabpSyQADAmR7IEmO2rYLRD4qeBRYP > TD7k3pGGqapXbWWoWZIytkihxjFODC3bP219K/awKn9pLMwzY4PyPyO0+Tbz > gL5vw62e+Gf2mUvWNIJQkQw0iFi572ZKyia7KMAjfOGw8DBCc3Df0xOkYp/9 > m3UHk+JNMb9bbld+o6XoI4+Jv/+b+PkS8BcsoIHqJ6Q3n47C6YBTNbSWnhCo > EayuLbX2BmnGyXdfaaAoDwW0uuLSY8Lz3vCJe1HxGOak0x0W1yB5pg9iqogV > SNG3xgSoZXNBFEVGciuTfZh7d0dcn1FvUuiQR6Cn06uDpkIkb07zbEZ7vZrf > 5AH0xTrXiA+q7PPMEXJGTIURUV4u1ZsVtoK2DgImhoh7mLC0dB3xeAh55aF3 > gQimmOJBXjRqmcMh/IoRfR+Ee4CKEAdIgh5tRztR1Ql3envGP7lBRMG3WeVR > J2/7vvWfoA5woYE4JJQz58DOOMqx1mbkGY20+qj7Ibgz3xpBp9JAurhXWc/f > MnG+2OHx4//BwWMyA3oVvycJ7aawxSxnRZvMbr9wzL10qe2bT4pk/ZV7Bshj > unZB > =fE8n > -----END PGP SIGNATURE----- > > ---------------- > Robert LeBlanc > GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > On Wed, Jun 24, 2015 at 7:44 AM, Lionel Bouton > <[email protected] <mailto:[email protected]>> > wrote: > On 06/24/15 14:44, Romero Junior wrote: >> Hi, >> >> >> >> We are setting up a test environment using Ceph as the main storage solution >> for my QEMU-KVM virtualization platform, and everything works fine except >> for the following: >> >> >> >> When I simulate a failure by powering off the switches on one of our three >> racks my virtual machines get into a weird state, the illustration might >> help you to fully understand what is going on: >> http://i.imgur.com/clBApzK.jpg <http://i.imgur.com/clBApzK.jpg> >> >> >> The PGs are distributed based on racks, there are not default crush rules. >> > > What is ceph -s telling while you are in this state ? > > 16000 pgs might be a problem: when your rack goes down, if your crushmap > rules distribute pgs based on rack, with size = 2 approximately 2/3 of your > pgs should be in a degraded state. This means that ~10666 pgs will have to > copy data to get back to a active+clean state. Your 2 other racks will then > be really busy. You can probably tune the recovery processes to avoid too > much interference with your normal VM I/Os. > You didn't tell where the monitors are placed (and there are only 2 on your > illustration which means any one of them being unreachable will bring down > your cluster). > > That said, I'm not sure that having a failure domain at the rack level when > you only have 3 racks is a good idea. What you end up with when a switch > fails is a reconfiguration of 2 third of your cluster, which is not desirable > in any case. If possible, either distribute the hardware in more racks (4 > racks : only 1/2 of your data will be affected, 5 racks only 2/5, ...) or > make the switches redundant (each server with OSD connected to 2 switches, > ...). > > Not that with 33 servers per rack, 3 OSD per server and 3 racks you have > approximately 300 disks. With so many disks, size=2 is probably too low to > get at a negligible probability of losing data (even if the failure case is 2 > amongst 100 and not 300). With only ~20 disks we already got near a 2 > simultaneous failure once (admitedly it was the combination of hardware and > human error in the earlier days of our cluster). We currently have one failed > disk and one giving signs (erratic performance) of hardware problems in a > span of a few weeks. > > Best regards, > > Lionel > > _______________________________________________ > ceph-users mailing list > [email protected] <mailto:[email protected]> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
