Thanks Christian,

I'm using a pool with size 3, min_size 1.

I can see the cluster serving I/O in a degraded after the OSD is marked down, but the problem we have is in the interval between the OSD failure event and the moment when that OSD is marked down.

In that interval (which can take up to 10 minutes) all the I/O operations directed to that OSD are blocked, thus all the virtual machines using the RBDs provided by the cluster hang, until the failed OSD is finally marked down.

Is this the expected operation of the cluster during failure?

Is it possible to make that time shorter so the I/O operations don't get blocked for so long?

Thanks,

On 11/04/2016 07:25 PM, Christian Wuerdig wrote:
What are your pool size and min_size settings? An object with less than min_size replicas will not receive I/O (http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas). So if size=2 and min_size=1 then an OSD failure means blocked operations to all objects located on the failed OSD until they have been replicated again.

On Sat, Nov 5, 2016 at 9:04 AM, fcid <[email protected] <mailto:[email protected]>> wrote:

    Dear ceph community,

    I'm working in a small ceph deployment for testing purposes, in
    which i want to test the high availability features of Ceph and
    how clients are affected during outages in the cluster.

    This small cluster is deployed using 3 servers on which are
    running 2 OSDs and 1 monitor each, and we are using it to serve
    Rados block devices for KVM hypervisors in other hosts. The ceph
    software was installed using ceph-deploy.

    For HA testing we are simulating disk failures by physically
    detaching OSD disks from servers and also by eliminating the power
    source from servers we want to fail.

    I have some doubts regarding the behavior during OSD and disk
    failures under light workloads.

    During disk failures, the cluster takes a long time to promote the
    secondary OSD to primary, thus blocking all the disk operations of
    virtual machines using RBD until the cluster map is updated with
    the failed OSD (which can take up to 10 minutes in our cluster).
    Is this the expected behavior of the OSD cluster? or should it be
    transparent to clients when the disks fails?

    Thanks in advance, kind regards.

    Configuration and version of our ceph cluster:

    root@ceph00:~# cat /etc/ceph/ceph.conf
    [global]
    fsid = 440fce60-3097-4f1c-a489-c170e65d8e09
    mon_initial_members = ceph00
    mon_host = 192.168.x1.x1
    auth_cluster_required = cephx
    auth_service_required = cephx
    auth_client_required = cephx
    public network = 192.168.x.x/x
    cluster network = y.y.y.y/y
    [osd]
    osd mkfs options = -f -i size=2048 -n size=64k
    osd mount options xfs = inode64,noatime,logbsize=256k
    osd journal size = 20480
    filestore merge threshold = 40
    filestore split multiple = 8
    filestore xattr use omap = true

    root@ceph00:~# ceph -v
    ceph version 10.2.3

-- Fernando Cid O.
    Ingeniero de Operaciones
    AltaVoz S.A.
    http://www.altavoz.net
    Viña del Mar, Valparaiso:
     2 Poniente 355 of 53
    +56 32 276 8060 <tel:%2B56%2032%20276%208060>
    Santiago:
     San Pío X 2460, oficina 304, Providencia
    +56 2 2585 4264 <tel:%2B56%202%202585%204264>

    _______________________________________________
    ceph-users mailing list
    [email protected] <mailto:[email protected]>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>



--
Fernando Cid O.
Ingeniero de Operaciones
AltaVoz S.A.
 http://www.altavoz.net
Viña del Mar, Valparaiso:
 2 Poniente 355 of 53
 +56 32 276 8060
Santiago:
 San Pío X 2460, oficina 304, Providencia
 +56 2 2585 4264

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to