Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

Mandar Naik Wed, 16 Aug 2017 09:07:08 -0700

Thanks a lot for the reply. To eliminate issue of root not being present
and duplicate entries
in crush map I have updated my crush map. Now I have default root and I
have crush hierarchy
without duplicate entries.


I have now created one pool local to host "ip-10-0-9-233" while other pool
local to host "ip-10-0-9-126"
using respective crush rules as pasted below. After host "ip-10-0-9-233"
gets full, requests to write new
keys to pool from host "ip-10-0-9-126" timed out.  From the "ceph pg dump"
output I see PGs only getting
stored at respective hosts. So pg interference across pools does not seem
to be an issue to me at least.

Purpose of keeping one pool local to host is not for the locality. With the
use case of single point of solution
for both local as well as replicated data clients need to know only the
pool name during read/write operations.

I am not sure if this use case fits with the ceph. So I am trying to
determine if there is any option in ceph to
make ceph understand that only one host is full and it could still serve
new write requests as long as they do
not touch the OSD that is full.


Test output:


#ceph osd dump


epoch 93

fsid 7a238d99-67ed-4610-540a-449043b3c24e

created 2017-08-16 09:34:15.580112

modified 2017-08-16 11:55:40.676234

flags sortbitwise,require_jewel_osds

pool 7 'ip-10-0-9-233-pool' replicated size 1 min_size 1 crush_ruleset 1
object_hash rjenkins pg_num 128 pgp_num 128 last_change 87 flags hashpspool
stripe_width 0

pool 8 'ip-10-0-9-126-pool' replicated size 1 min_size 1 crush_ruleset 2
object_hash rjenkins pg_num 128 pgp_num 128 last_change 92 flags hashpspool
stripe_width 0

max_osd 3


# ceph -s

cluster 7a238d99-67ed-4610-540a-449043b3c24e

health HEALTH_OK

monmap e3: 3 mons at {ip-10-0-9-126=10.0.9.126:6789/0,ip-10-0-9-233=10.0.9.
233:6789/0,ip-10-0-9-250=10.0.9.250:6789/0}

        election epoch 8, quorum 0,1,2 ip-10-0-9-126,ip-10-0-9-233,
ip-10-0-9-250

osdmap e93: 3 osds: 3 up, 3 in

        flags sortbitwise,require_jewel_osds

  pgmap v679: 256 pgs, 2 pools, 0 bytes data, 0 objects

        106 MB used, 134 GB / 134 GB avail

             256 active+clean


# ceph osd tree

ID WEIGHT  TYPE NAME                UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 0.13197 root default

-5 0.04399 rack ip-10-0-9-233-rack

-3 0.04399      host ip-10-0-9-233

0 0.04399          osd.0             up  1.00000       1.00000

-7 0.04399 rack ip-10-0-9-126-rack

-6 0.04399      host ip-10-0-9-126

1 0.04399          osd.1             up  1.00000       1.00000

-9 0.04399 rack ip-10-0-9-250-rack

-8 0.04399      host ip-10-0-9-250

2 0.04399          osd.2             up  1.00000       1.00000


# ceph osd crush rule list

[

"ip-10-0-9-233_ruleset",

"ip-10-0-9-126_ruleset",

"ip-10-0-9-250_ruleset",

"replicated_ruleset"

]


# ceph osd crush rule dump ip-10-0-9-233_ruleset

{

"rule_id": 0,

"rule_name": "ip-10-0-9-233_ruleset",

"ruleset": 1,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

    {

        "op": "take",

        "item": -5,

        "item_name": "ip-10-0-9-233-rack"

    },

    {

        "op": "chooseleaf_firstn",

        "num": 0,

        "type": "host"

    },

    {

        "op": "emit"

    }

]

}



# ceph osd crush rule dump ip-10-0-9-126_ruleset

{

"rule_id": 1,

"rule_name": "ip-10-0-9-126_ruleset",

"ruleset": 2,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

    {

        "op": "take",

        "item": -7,

        "item_name": "ip-10-0-9-126-rack"

    },

    {

        "op": "chooseleaf_firstn",

        "num": 0,

        "type": "host"

    },

    {

        "op": "emit"

    }

]

}


# ceph osd crush rule dump replicated_ruleset

{

"rule_id": 4,

"rule_name": "replicated_ruleset",

"ruleset": 4,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

    {

        "op": "take",

        "item": -1,

        "item_name": "default"

    },

    {

        "op": "chooseleaf_firstn",

        "num": 0,

        "type": "host"

    },

    {

        "op": "emit"

    }

]


# ceph -s

cluster 7a238d99-67ed-4610-540a-449043b3c24e

health HEALTH_ERR

        1 full osd(s)

        full,sortbitwise,require_jewel_osds flag(s) set

monmap e3: 3 mons at {ip-10-0-9-126=10.0.9.126:6789/0,ip-10-0-9-233=10.0.9.
233:6789/0,ip-10-0-9-250=10.0.9.250:6789/0}

        election epoch 8, quorum 0,1,2 ip-10-0-9-126,ip-10-0-9-233,
ip-10-0-9-250

osdmap e99: 3 osds: 3 up, 3 in

        flags full,sortbitwise,require_jewel_osds

  pgmap v920: 256 pgs, 2 pools, 42696 MB data, 2 objects

        44844 MB used, 93324 MB / 134 GB avail

             256 active+clean


# ceph osd df

ID WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE  VAR  PGS

0 0.04399  1.00000 46056M 43801M  2255M 95.10 3.00 128

1 0.04399  1.00000 46056M 36708k 46020M  0.08 0.00 128

2 0.04399  1.00000 46056M 34472k 46022M  0.07 0.00   0

          TOTAL   134G 43870M 94298M 31.75

MIN/MAX VAR: 0.00/3.00  STDDEV: 44.80


# ceph df

GLOBAL:

SIZE AVAIL   RAW USED %RAW USED

134G 94298M    43870M      31.75

POOLS:

NAME                ID USED    %USED MAX AVAIL OBJECTS

ip-10-0-9-233-pool 7   43760M 95.10      2255M        1

ip-10-0-9-126-pool 8       12      0     46020M        2


# rados -p ip-10-0-9-126-pool put hello1 world.txt

2017-08-16 14:34:02.740500 7f1ad820fa40  0 client.5008.objecter  FULL,
paused modify 0x7f1ad87df7a0 tid 1

error putting ip-10-0-9-126-pool/hello1: (110) Connection timed out


Excerpt from "ceph pg dump"


up   up_primary   acting  acting_primary

0'0 102:8   [1] 1    [1] 1

0'0 102:13  [0] 0    [0] 0




On Wed, Aug 16, 2017 at 1:54 PM, Luis Periquito <periqu...@gmail.com> wrote:

> Not going through the obvious of that crush map is just not looking
> correct or even sane... or that the policy itself doesn't sound very
> sane - but I'm sure you'll understand the caveats and issues it may
> present...
>
> what's most probably happening is that a (or several) pool is using
> those same OSDs and the requests to those PGs are also getting blocked
> because of the disk full. This turns that some (or all) of the
> remaining OSDs are waiting for that one to complete some IO, and
> whilst those OSDs have IOs waiting to complete it also stops
> responding to the IO that was only local.
>
> Adding more insanity to your architecture what should (the keyword
> here is should as I never tested, saw or even thought of such
> scenario) work would be OSDs to have local storage and OSDs to have
> distributed storage.
>
> As for the architecture itself, and not knowing much of your use-case,
> it may make sense to have local storage in something else than Ceph -
> you're not using any of the facilities it provides you, and having
> some overheads - or using a different strategy for it. IIRC there was
> a way to hint data locality to Ceph...
>
>
> On Wed, Aug 16, 2017 at 8:39 AM, Mandar Naik <mandar.p...@gmail.com>
> wrote:
> > Hi,
> > I just wanted to give a friendly reminder for this issue. I would
> appreciate
> > if someone
> > can help me out here. Also, please do let me know in case some more
> > information is
> > required here.
> >
> > On Thu, Aug 10, 2017 at 2:41 PM, Mandar Naik <mandar.p...@gmail.com>
> wrote:
> >>
> >> Hi Peter,
> >> Thanks a lot for the reply. Please find 'ceph osd df' output here -
> >>
> >> # ceph osd df
> >> ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
> >>  2 0.04399  1.00000 46056M 35576k 46021M  0.08 0.00   0
> >>  1 0.04399  1.00000 46056M 40148k 46017M  0.09 0.00 384
> >>  0 0.04399  1.00000 46056M 43851M  2205M 95.21 2.99 192
> >>  0 0.04399  1.00000 46056M 43851M  2205M 95.21 2.99 192
> >>  1 0.04399  1.00000 46056M 40148k 46017M  0.09 0.00 384
> >>  2 0.04399  1.00000 46056M 35576k 46021M  0.08 0.00   0
> >>               TOTAL   134G 43925M 94244M 31.79
> >> MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85
> >>
> >> I setup this cluster by manipulating CRUSH map using CLI. I had a
> default
> >> root
> >> before but it gave me an impression that since every rack is under a
> >> single
> >> root bucket its marking entire cluster down in case one of the osd is
> 95%
> >> full. So I
> >> removed root bucket but that still did not help me. No crush rule is
> >> referring
> >> to root bucket in the above mentioned case.
> >>
> >> Yes, I added one osd under two racks by linking host bucket from one
> rack
> >> to another
> >> using following command -
> >>
> >> "osd crush link <name> <args> [<args>...] :  link existing entry for
> >> <name> under location <args>"
> >>
> >>
> >> On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney
> >> <peter.malo...@brockmann-consult.de> wrote:
> >>>
> >>> I think a `ceph osd df` would be useful.
> >>>
> >>> And how did you set up such a cluster? I don't see a root, and you have
> >>> each osd in there more than once...is that even possible?
> >>>
> >>>
> >>>
> >>> On 08/10/17 08:46, Mandar Naik wrote:
> >>>
> >>> Hi,
> >>>
> >>> I am evaluating ceph cluster for a solution where ceph could be used
> for
> >>> provisioning
> >>>
> >>> pools which could be either stored local to a node or replicated
> across a
> >>> cluster.  This
> >>>
> >>> way ceph could be used as single point of solution for writing both
> local
> >>> as well as replicated
> >>>
> >>> data. Local storage helps avoid possible storage cost that comes with
> >>> replication factor of more
> >>>
> >>> than one and also provide availability as long as the data host is
> alive.
> >>>
> >>>
> >>> So I tried an experiment with Ceph cluster where there is one crush
> rule
> >>> which replicates data across
> >>>
> >>> nodes and other one only points to a crush bucket that has local ceph
> >>> osd. Cluster configuration
> >>>
> >>> is pasted below.
> >>>
> >>>
> >>> Here I observed that if one of the disk is full (95%) entire cluster
> goes
> >>> into error state and stops
> >>>
> >>> accepting new writes from/to other nodes. So ceph cluster became
> unusable
> >>> even though it’s only
> >>>
> >>> 32% full. The writes are blocked even for pools which are not touching
> >>> the full osd.
> >>>
> >>>
> >>> I have tried playing around crush hierarchy but it did not help. So is
> it
> >>> possible to store data in the above
> >>>
> >>> manner with Ceph ? If yes could we get cluster state in usable state
> >>> after one of the node is full ?
> >>>
> >>>
> >>>
> >>> # ceph df
> >>>
> >>>
> >>> GLOBAL:
> >>>
> >>>    SIZE     AVAIL      RAW USED     %RAW USED
> >>>
> >>>    134G     94247M       43922M         31.79
> >>>
> >>>
> >>> # ceph –s
> >>>
> >>>
> >>>    cluster ba658a02-757d-4e3c-7fb3-dc4bf944322f
> >>>
> >>>     health HEALTH_ERR
> >>>
> >>>            1 full osd(s)
> >>>
> >>>            full,sortbitwise,require_jewel_osds flag(s) set
> >>>
> >>>     monmap e3: 3 mons at
> >>> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:
> 6789/0,ip-10-0-9-210=10.0.9.210:6789/0}
> >>>
> >>>            election epoch 14, quorum 0,1,2
> >>> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210
> >>>
> >>>     osdmap e93: 3 osds: 3 up, 3 in
> >>>
> >>>            flags full,sortbitwise,require_jewel_osds
> >>>
> >>>      pgmap v630: 384 pgs, 6 pools, 43772 MB data, 18640 objects
> >>>
> >>>            43922 MB used, 94247 MB / 134 GB avail
> >>>
> >>>                 384 active+clean
> >>>
> >>>
> >>> # ceph osd tree
> >>>
> >>>
> >>> ID WEIGHT  TYPE NAME               UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >>>
> >>> -9 0.04399 rack ip-10-0-9-146-rack
> >>>
> >>> -8 0.04399     host ip-10-0-9-146
> >>>
> >>> 2 0.04399         osd.2                up  1.00000          1.00000
> >>>
> >>> -7 0.04399 rack ip-10-0-9-210-rack
> >>>
> >>> -6 0.04399     host ip-10-0-9-210
> >>>
> >>> 1 0.04399         osd.1                up  1.00000          1.00000
> >>>
> >>> -5 0.04399 rack ip-10-0-9-122-rack
> >>>
> >>> -3 0.04399     host ip-10-0-9-122
> >>>
> >>> 0 0.04399         osd.0                up  1.00000          1.00000
> >>>
> >>> -4 0.13197 rack rep-rack
> >>>
> >>> -3 0.04399     host ip-10-0-9-122
> >>>
> >>> 0 0.04399         osd.0                up  1.00000          1.00000
> >>>
> >>> -6 0.04399     host ip-10-0-9-210
> >>>
> >>> 1 0.04399         osd.1                up  1.00000          1.00000
> >>>
> >>> -8 0.04399     host ip-10-0-9-146
> >>>
> >>> 2 0.04399         osd.2                up  1.00000          1.00000
> >>>
> >>>
> >>> # ceph osd crush rule list
> >>>
> >>> [
> >>>
> >>>    "rep_ruleset",
> >>>
> >>>    "ip-10-0-9-122_ruleset",
> >>>
> >>>    "ip-10-0-9-210_ruleset",
> >>>
> >>>    "ip-10-0-9-146_ruleset"
> >>>
> >>> ]
> >>>
> >>>
> >>> # ceph osd crush rule dump rep_ruleset
> >>>
> >>> {
> >>>
> >>>    "rule_id": 0,
> >>>
> >>>    "rule_name": "rep_ruleset",
> >>>
> >>>    "ruleset": 0,
> >>>
> >>>    "type": 1,
> >>>
> >>>    "min_size": 1,
> >>>
> >>>    "max_size": 10,
> >>>
> >>>    "steps": [
> >>>
> >>>        {
> >>>
> >>>            "op": "take",
> >>>
> >>>            "item": -4,
> >>>
> >>>            "item_name": "rep-rack"
> >>>
> >>>        },
> >>>
> >>>        {
> >>>
> >>>            "op": "chooseleaf_firstn",
> >>>
> >>>            "num": 0,
> >>>
> >>>            "type": "host"
> >>>
> >>>        },
> >>>
> >>>        {
> >>>
> >>>            "op": "emit"
> >>>
> >>>        }
> >>>
> >>>    ]
> >>>
> >>> }
> >>>
> >>>
> >>> # ceph osd crush rule dump ip-10-0-9-122_ruleset
> >>>
> >>> {
> >>>
> >>>    "rule_id": 1,
> >>>
> >>>    "rule_name": "ip-10-0-9-122_ruleset",
> >>>
> >>>    "ruleset": 1,
> >>>
> >>>    "type": 1,
> >>>
> >>>    "min_size": 1,
> >>>
> >>>    "max_size": 10,
> >>>
> >>>    "steps": [
> >>>
> >>>        {
> >>>
> >>>            "op": "take",
> >>>
> >>>            "item": -5,
> >>>
> >>>            "item_name": "ip-10-0-9-122-rack"
> >>>
> >>>        },
> >>>
> >>>        {
> >>>
> >>>            "op": "chooseleaf_firstn",
> >>>
> >>>            "num": 0,
> >>>
> >>>            "type": "host"
> >>>
> >>>        },
> >>>
> >>>        {
> >>>
> >>>            "op": "emit"
> >>>
> >>>        }
> >>>
> >>>    ]
> >>>
> >>> }
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Mandar Naik.
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>>
> >>> --
> >>>
> >>> --------------------------------------------
> >>> Peter Maloney
> >>> Brockmann Consult
> >>> Max-Planck-Str. 2
> >>> 21502 Geesthacht
> >>> Germany
> >>> Tel: +49 4152 889 300
> >>> Fax: +49 4152 889 333
> >>> E-mail: peter.malo...@brockmann-consult.de
> >>> Internet: http://www.brockmann-consult.de
> >>> --------------------------------------------
> >>
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Mandar Naik.
> >
> >
> >
> >
> > --
> > Thanks,
> > Mandar Naik.
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
Thanks,
Mandar Naik.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

Reply via email to