Just a quick guess - is it possible you ran out of file descriptors/connections 
on the nodes or on a firewall on the way? I’ve seen this behaviour the other 
way around - when too many RBD devices were connected to one client. It would 
explain why it seems to work but hangs when the device is used.

Jan

> On 24 Jun 2015, at 19:32, Robert LeBlanc <[email protected]> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> I would make sure that your CRUSH rules are designed for such a failure. We 
> currently have two racks and can suffer a one rack loss without blocking I/O. 
> Here is what we do:
> 
> rule replicated_ruleset {
>         ruleset 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step choose firstn 2 type rack
>         step chooseleaf firstn 2 type host
>         step emit
> }
> All pools are size=4 and min_size=2
> 
> This puts only two copies in each rack so that only half of the objects can 
> be taken down by a rack loss. We also configure ceph with 
> "mon_osd_downout_subtree_limit = host" so that it won't automatically mark a 
> whole rack out (not that it would do a whole lot in our current 2 rack 
> config).
> 
> Our network failure (dual Ethernet switches) is two racks, so our next 
> failure domain is what we call a PUD or 2 racks. The 3-4 rack configuration 
> is similar to the above with the choose changed to pud. Once we get to our 
> 5th rack of storage, our config changes to:
> 
> rule replicated_ruleset {
>         ruleset 0
>         type replicated
>         min_size 1
>         max_size 10
>         step take default
>         step chooseleaf firstn 0 type pud
>         step emit
> }
> All pools are size=3 and min_size=2
> 
> In this configuration, only one copy is kept per PUD and we can lose two 
> racks in a PUD without blocking I/O in our cluster.
> 
> Under the default CRUSH rules, it is possible to get two objects in one rack. 
> What does `ceph osd crush rule dump` show?
> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v0.13.1
> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> 
> wsFcBAEBCAAQBQJVium0CRDmVDuy+mK58QAAkyoP/0ZJ9vnxYwbGtanAUNc3
> gT/yT9j4P+l0IKAZHqM0Ypv1gmVG3jXi6aAtGe4nY5DZ8NmGQv/T0JkAXfTb
> bzAQpnso4oQ3r7RaEGNmtZ4xJrHunAFabpSyQADAmR7IEmO2rYLRD4qeBRYP
> TD7k3pGGqapXbWWoWZIytkihxjFODC3bP219K/awKn9pLMwzY4PyPyO0+Tbz
> gL5vw62e+Gf2mUvWNIJQkQw0iFi572ZKyia7KMAjfOGw8DBCc3Df0xOkYp/9
> m3UHk+JNMb9bbld+o6XoI4+Jv/+b+PkS8BcsoIHqJ6Q3n47C6YBTNbSWnhCo
> EayuLbX2BmnGyXdfaaAoDwW0uuLSY8Lz3vCJe1HxGOak0x0W1yB5pg9iqogV
> SNG3xgSoZXNBFEVGciuTfZh7d0dcn1FvUuiQR6Cn06uDpkIkb07zbEZ7vZrf
> 5AH0xTrXiA+q7PPMEXJGTIURUV4u1ZsVtoK2DgImhoh7mLC0dB3xeAh55aF3
> gQimmOJBXjRqmcMh/IoRfR+Ee4CKEAdIgh5tRztR1Ql3envGP7lBRMG3WeVR
> J2/7vvWfoA5woYE4JJQz58DOOMqx1mbkGY20+qj7Ibgz3xpBp9JAurhXWc/f
> MnG+2OHx4//BwWMyA3oVvycJ7aawxSxnRZvMbr9wzL10qe2bT4pk/ZV7Bshj
> unZB
> =fE8n
> -----END PGP SIGNATURE-----
> 
> ----------------
> Robert LeBlanc
> GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> On Wed, Jun 24, 2015 at 7:44 AM, Lionel Bouton 
> <[email protected] <mailto:[email protected]>> 
> wrote:
> On 06/24/15 14:44, Romero Junior wrote:
>> Hi,
>> 
>>  
>> 
>> We are setting up a test environment using Ceph as the main storage solution 
>> for my QEMU-KVM virtualization platform, and everything works fine except 
>> for the following:
>> 
>>  
>> 
>> When I simulate a failure by powering off the switches on one of our three 
>> racks my virtual machines get into a weird state, the illustration might 
>> help you to fully understand what is going on: 
>> http://i.imgur.com/clBApzK.jpg <http://i.imgur.com/clBApzK.jpg>
>>  
>> 
>> The PGs are distributed based on racks, there are not default crush rules.
>> 
> 
> What is ceph -s telling while you are in this state ?
> 
> 16000 pgs might be a problem: when your rack goes down, if your crushmap 
> rules distribute pgs based on rack, with size = 2 approximately 2/3 of your 
> pgs should be in a degraded state. This means that ~10666 pgs will have to 
> copy data to get back to a active+clean state. Your 2 other racks will then 
> be really busy. You can probably tune the recovery processes to avoid too 
> much interference with your normal VM I/Os.
> You didn't tell where the monitors are placed (and there are only 2 on your 
> illustration which means any one of them being unreachable will bring down 
> your cluster).
> 
> That said, I'm not sure that having a failure domain at the rack level when 
> you only have 3 racks is a good idea. What you end up with when a switch 
> fails is a reconfiguration of 2 third of your cluster, which is not desirable 
> in any case. If possible, either distribute the hardware in more racks (4 
> racks : only 1/2 of your data will be affected, 5 racks only 2/5, ...) or 
> make the switches redundant (each server with OSD connected to 2 switches, 
> ...).
> 
> Not that with 33 servers per rack, 3 OSD per server and 3 racks you have 
> approximately 300 disks. With so many disks, size=2 is probably too low to 
> get at a negligible probability of losing data (even if the failure case is 2 
> amongst 100 and not 300). With only ~20 disks we already got near a 2 
> simultaneous failure once (admitedly it was the combination of hardware and 
> human error in the earlier days of our cluster). We currently have one failed 
> disk and one giving signs     (erratic performance) of hardware problems in a 
> span of a few weeks.
> 
> Best regards,
> 
> Lionel
> 
> _______________________________________________
> ceph-users mailing list
> [email protected] <mailto:[email protected]>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to