Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

Wes Dillingham Tue, 21 Mar 2017 10:49:18 -0700

If you had set min_size to 1 you would not have seen the writes pause. a
min_size of 1 is dangerous though because it means you are 1 hard disk
failure away from losing the objects within that placement group entirely.
a min_size of 2 is generally considered the minimum you want but many
people ignore that advice, some wish they hadn't.


On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <[email protected]> wrote:

> Thanks everyone for the replies. Very informative. However, should I
> have expected writes to pause if I'd had min_size set to 1 instead of 2?
>
> And yes, I was under the false impression that my rdb devices was a
> single object. That explains what all those other things are on a test
> cluster where I only created a single object!
>
>
> --
> Adam Carheden
>
> On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> > This is because of the min_size specification. I would bet you have it
> > set at 2 (which is good).
> >
> > ceph osd pool get rbd min_size
> >
> > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
> > from each hosts) results in some of the objects only having 1 replica
> > min_size dictates that IO freezes for those objects until min_size is
> > achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#
> set-the-number-of-object-replicas
> >
> > I cant tell if your under the impression that your RBD device is a
> > single object. It is not. It is chunked up into many objects and spread
> > throughout the cluster, as Kjeti mentioned earlier.
> >
> > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hi,
> >
> >     rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
> >     will get you a "prefix", which then gets you on to
> >     rbd_header.<prefix>, rbd_header.prefix contains block size,
> >     striping, etc. The actual data bearing objects will be named
> >     something like rbd_data.prefix.%-016x.
> >
> >     Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
> >     <block size> of that image will be named rbd_data.
> >     86ce2ae8944a.000000000000, the second <block size> will be
> >     86ce2ae8944a.000000000001, and so on, chances are that one of these
> >     objects are mapped to a pg which has both host3 and host4 among it's
> >     replicas.
> >
> >     An rbd image will end up scattered across most/all osds of the pool
> >     it's in.
> >
> >     Cheers,
> >     -KJ
> >
> >     On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <[email protected]
> >     <mailto:[email protected]>> wrote:
> >
> >         I have a 4 node cluster shown by `ceph osd tree` below. Monitors
> are
> >         running on hosts 1, 2 and 3. It has a single replicated pool of
> size
> >         3. I have a VM with its hard drive replicated to OSDs 11(host3),
> >         5(host1) and 3(host2).
> >
> >         I can 'fail' any one host by disabling the SAN network interface
> and
> >         the VM keeps running with a simple slowdown in I/O performance
> >         just as
> >         expected. However, if 'fail' both nodes 3 and 4, I/O hangs on
> >         the VM.
> >         (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2
> >         still
> >         have quorum, so that shouldn't be an issue. The placement group
> >         still
> >         has 2 of its 3 replicas online.
> >
> >         Why does I/O hang even though host4 isn't running a monitor and
> >         doesn't have anything to do with my VM's hard drive.
> >
> >
> >         Size?
> >         # ceph osd pool get rbd size
> >         size: 3
> >
> >         Where's rbd_id.vm-100-disk-1?
> >         # ceph osd getmap -o /tmp/map && osdmaptool --pool 0
> >         --test-map-object
> >         rbd_id.vm-100-disk-1 /tmp/map
> >         got osdmap epoch 1043
> >         osdmaptool: osdmap file '/tmp/map'
> >          object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
> >
> >         # ceph osd tree
> >         ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >         -1 8.06160 root default
> >         -7 5.50308     room A
> >         -3 1.88754         host host1
> >          4 0.40369             osd.4       up  1.00000          1.00000
> >          5 0.40369             osd.5       up  1.00000          1.00000
> >          6 0.54008             osd.6       up  1.00000          1.00000
> >          7 0.54008             osd.7       up  1.00000          1.00000
> >         -2 3.61554         host host2
> >          0 0.90388             osd.0       up  1.00000          1.00000
> >          1 0.90388             osd.1       up  1.00000          1.00000
> >          2 0.90388             osd.2       up  1.00000          1.00000
> >          3 0.90388             osd.3       up  1.00000          1.00000
> >         -6 2.55852     room B
> >         -4 1.75114         host host3
> >          8 0.40369             osd.8       up  1.00000          1.00000
> >          9 0.40369             osd.9       up  1.00000          1.00000
> >         10 0.40369             osd.10      up  1.00000          1.00000
> >         11 0.54008             osd.11      up  1.00000          1.00000
> >         -5 0.80737         host host4
> >         12 0.40369             osd.12      up  1.00000          1.00000
> >         13 0.40369             osd.13      up  1.00000          1.00000
> >
> >
> >         --
> >         Adam Carheden
> >         _______________________________________________
> >         ceph-users mailing list
> >         [email protected] <mailto:[email protected]>
> >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >
> >
> >
> >
> >     --
> >     Kjetil Joergensen <[email protected] <mailto:[email protected]>>
> >     SRE, Medallia Inc
> >     Phone: +1 (650) 739-6580 <tel:(650)%20739-6580>
> >
> >     _______________________________________________
> >     ceph-users mailing list
> >     [email protected] <mailto:[email protected]>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >
> >
> >
> >
> > --
> > Respectfully,
> >
> > Wes Dillingham
> > [email protected] <mailto:[email protected]>
> > Research Computing | Infrastructure Engineer
> > Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
> >
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
[email protected]
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

Reply via email to