Hi,

Thank you your input, had a look couple of osds, settings for these values the 
following:
#ceph daemon osd.8 config get  bluestore_min_alloc_size
{
    "bluestore_min_alloc_size": "0"
}
# ceph daemon osd.8 config get  bluestore_min_alloc_size_ssd
{
    "bluestore_min_alloc_size_ssd": "4096"
}
# ceph daemon osd.8 config get  bluestore_min_alloc_size_hdd
{
    "bluestore_min_alloc_size_hdd": "65536"
}

Not sure how the nvme handled btw but in the ceph-volume command could that be 
an issue that in the crush device class it is none so maybe picking up the 0?

====== osd.8 =======

  [block]       
/dev/ceph-block-8a6b3305-e4ce-4383-b7ec-db755711adf7/osd-block-051d5b20-43b8-4e40-b153-cac6ef927c50

      block device              
/dev/ceph-block-8a6b3305-e4ce-4383-b7ec-db755711adf7/osd-block-051d5b20-43b8-4e40-b153-cac6ef927c50
      block uuid                9ENS35-vBBu-UYrf-MpyQ-KVNP-X2qs-BfkD9S
      cephx lockbox secret      AQBdWa5f2icyFBAAKtHPILCYT3BMBuTkPTyz1w==
      cluster fsid              5a07ec50-4eee-4336-aa11-46ca76edcc24
      cluster name              ceph
      crush device class        None
      db device                 
/dev/ceph-block-dbs-2e368ab4-8b28-4adb-82f8-41205365f630/osd-block-db-03327da9-0fcf-4d27-8c3c-48ce3e9badd1
      db uuid                   2Z5B5e-Hsgx-Y3qZ-QsuQ-skVd-Qtku-DWSMev
      encrypted                 1
      osd fsid                  cfe7634d-858f-408f-81ba-1c80fa43038d
      osd id                    8
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdb

  [db]          
/dev/ceph-block-dbs-2e368ab4-8b28-4adb-82f8-41205365f630/osd-block-db-03327da9-0fcf-4d27-8c3c-48ce3e9badd1

      block device              
/dev/ceph-block-8a6b3305-e4ce-4383-b7ec-db755711adf7/osd-block-051d5b20-43b8-4e40-b153-cac6ef927c50
      block uuid                9ENS35-vBBu-UYrf-MpyQ-KVNP-X2qs-BfkD9S
      cephx lockbox secret      AQBdWa5f2icyFBAAKtHPILCYT3BMBuTkPTyz1w==
      cluster fsid              5a07ec50-4eee-4336-aa11-46ca76edcc24
      cluster name              ceph
      crush device class        None
      db device                 
/dev/ceph-block-dbs-2e368ab4-8b28-4adb-82f8-41205365f630/osd-block-db-03327da9-0fcf-4d27-8c3c-48ce3e9badd1
      db uuid                   2Z5B5e-Hsgx-Y3qZ-QsuQ-skVd-Qtku-DWSMev
      encrypted                 1
      osd fsid                  cfe7634d-858f-408f-81ba-1c80fa43038d
      osd id                    8
      osdspec affinity
      type                      db
      vdo                       0
      devices                   /dev/nvme0n1



Regarding memory target, I’ll remove from the config so it will pickup the 4gb, 
let’s see.

Buffered_io I’ll be a bit careful because in the past messed around with it and 
finally I had to cleanup all the osds manually, but will update the cluster 
once I can take it out from ERROR to 15.2.14 becauase that one has buffered_io 
enabled by default.

ty

From: Frédéric Nass <frederic.n...@univ-lorraine.fr>
Sent: Thursday, September 30, 2021 4:43 PM
To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>; Christian Wuerdig 
<christian.wuer...@gmail.com>
Cc: Ceph Users <ceph-users@ceph.io>
Subject: Re: [ceph-users] Re: osd_memory_target=level0 ?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !
________________________________

Hi,

As Christian said, osd_memory_target has nothing to do with rocksdb levels and 
will certainly not decide when overspilling occurs. With that said, I doubt any 
of us here ever gave 32GB of RAM to any OSD, so in case you're not sure that 
OSDs can handle that much memory correctly, I would advise you to lower this 
value to something more conservative like 4GB or 8GB of RAM. Just make sure you 
system doesn't make use of swap. Also, since your clients do a lot of reads, 
check the value of bluefs_buffered_io. It's default value changed a few times 
in the past and got back to true recently. It might really help to have it set 
to true.

Regarding overspilling, unless you tuned bluestore_rocksdb_options with custom 
max_bytes_for_level_base and max_bytes_for_level_multiplier, I think levels 
should still be roughly 3GB, 30GB and 300GB. I suppose you gave 600GB+ NVMe 
block.db partitions to each one of the 6 SSDs so you'd be good with that for 
most workloads I guess.

Have you checked bluestore_min_alloc_size, bluestore_min_alloc_size_hdd, 
bluestore_min_alloc_size_ssd of your OSDs ? If I'm not mistaken, the default 
32k value has now changed to 4k. If your OSDs were created with 32k alloc size 
then it might explain the unexpected overspilling with a lot of objects in the 
cluster.

Hope that helps,

Regards,

Frédéric.

--

Cordialement,



Frédéric Nass

Direction du Numérique

Sous-direction Infrastructures et Services



Tél : 03.72.74.11.35
Le 30/09/2021 à 10:02, Szabo, Istvan (Agoda) a écrit :

Hi Christian,



Yes, I very clearly know what is spillover, read that github leveled document 
in the last couple of days every day multiple time. (Answers for your questions 
are after the cluster background information).



About the cluster:

- users are doing continuously put/head/delete operations

- cluster iops: 10-50k read, 5000 write iops

- throughput: 142MiB/s  write and 662 MiB/s read

- Not containerized deployment, 3 cluster in multisite

- 3x mon/mgr/rgw (5 rgw in each mon, altogether 15 behind haproxy vip)



7 nodes and in each node the following config:

- 1x 1.92TB nvme for index pool

- 6x 15.3 TB osd SAS SSD (hpe VO015360JWZJN read intensive ssd, SKU P19911-B21 
in this document: https://h20195.www2.hpe.com/v2/getpdf.aspx/a00001288enw.pdf)

- 2x 1.92TB nvme  block.db for the 6 ssd (model: HPE KCD6XLUL1T92 SKU: 
P20131-B21 in this document 
https://h20195.www2.hpe.com/v2/getpdf.aspx/a00001288enw.pdf)

- osd deployed with dmcrypt

- data pool is on ec 4:2 other pools are on the ssds with 3 replica



Config file that we have on all nodes + on the mon nodes has the rgw definition 
also:

[global]

cluster network = 192.168.199.0/24

fsid = 5a07ec50-4eee-4336-aa11-46ca76edcc24

mon host = 
[v2:10.118.199.1:3300,v1:10.118.199.1:6789],[v2:10.118.199.2:3300,v1:10.118.199.2:6789],[v2:10.118.199.3:3300,v1:10.118.199.3:6789]

mon initial members = mon-2s01,mon-2s02,mon-2s03

osd pool default crush rule = -1

public network = 10.118.199.0/24

rgw_relaxed_s3_bucket_names = true

rgw_dynamic_resharding = false

rgw_enable_apis = s3, s3website, swift, swift_auth, admin, sts, iam, pubsub, 
notifications

#rgw_bucket_default_quota_max_objects = 1126400



[mon]

mon_allow_pool_delete = true

mon_pg_warn_max_object_skew = 0

mon_osd_nearfull_ratio = 70



[osd]

osd_max_backfills = 1

osd_recovery_max_active = 1

osd_recovery_op_priority = 1

osd_memory_target = 31490694621

# due to osd reboots, the below configs has been added to survive the suicide 
timeout

osd_scrub_during_recovery = true

osd_op_thread_suicide_timeout=3000

osd_op_thread_timeout=120



Stability issue that I mean:

- Pg increase still in progress, hasn’t been finished from 32-128 on the 
erasure coded data pool. 103 currently and the degraded objects are always 
stuck almost when finished, but at the end osd dies and start again the 
recovery process.

- compaction is happening all the time, so all the nvme drives are generating 
iowait continuously because it is 100% utilized (iowait is around 1-3). If I 
try to compact with ceph tell osd.x compact that is impossible, it will never 
finish, only with ctrl+c.

- At the beginning when we didn't have so much spilledover disks, I didn't mind 
it actually I was happy of the spillover because the underlaying ssd can take 
some load from the nvme, but after the osds started to reboot and I'd say 
started to collapse 1 by 1. When I monitor which osds are collapsing, it was 
always the one which was spillovered. This op thread and suicide timeout can 
keep a bit longer the osds up.

- Now ALL rgw started to die once 1 specific osd goes down, and this make total 
outage. In the logs there isn't anything about this, neither message, nor rgw 
log just like timeout the connections. This is unacceptable from the user's 
perspective that thay need to wait 1.5 hour until my manual compaction finished 
and I can start the osd.



Current cluster state ceph -s:

health: HEALTH_ERR

            12 OSD(s) experiencing BlueFS spillover

            4/1055038256 objects unfound (0.000%)

            noout flag(s) set

            Possible data damage: 2 pgs recovery_unfound

            Degraded data redundancy: 12341016/6328900227 objects degraded 
(0.195%), 16 pgs degraded, 21 pgs u

ndersized

            4 pgs not deep-scrubbed in time



  services:

    mon: 3 daemons, quorum mon-2s01,mon-2s02,mon-2s03 (age 2M)

    mgr: mon-2s01(active, since 2M), standbys: mon-2s03, mon-2s02

    osd: 49 osds: 49 up (since 101m), 49 in (since 4d); 23 remapped pgs

         flags noout

    rgw: 15 daemons active (mon-2s01.rgw0, mon-2s01.rgw1, mon-2s01.rgw2, 
mon-2s01.rgw3, mon-2s01.rgw4, mon-2s02.rgw0, mon-2s02.rgw1, mon-2s02.rgw2, 
mon-2s02.rgw3, mon-2s02.rgw4, mon-2s03.rgw0, mon-2s03.rgw1, mon-2s03.rgw2, 
mon-2s03.rgw3, mon-2s03.rgw4)



  task status:



  data:

    pools:   9 pools, 425 pgs

    objects: 1.06G objects, 67 TiB

    usage:   159 TiB used, 465 TiB / 623 TiB avail

    pgs:     12032346/6328762449 objects degraded (0.190%)

             68127707/6328762449 objects misplaced (1.076%)

             4/1055015441 objects unfound (0.000%)

             397 active+clean

             13  active+undersized+degraded+remapped+backfill_wait

             4   active+undersized+remapped+backfill_wait

             4   active+clean+scrubbing+deep

             2   active+recovery_unfound+undersized+degraded+remapped

             2   active+remapped+backfill_wait

             1   active+clean+scrubbing

             1   active+undersized+remapped+backfilling

             1   active+undersized+degraded+remapped+backfilling



  io:

    client:   256 MiB/s rd, 94 MiB/s wr, 17.70k op/s rd, 2.75k op/s wr

    recovery: 16 MiB/s, 223 objects/s



Ty



-----Original Message-----

From: Christian Wuerdig 
<christian.wuer...@gmail.com><mailto:christian.wuer...@gmail.com>

Sent: Thursday, September 30, 2021 1:01 PM

To: Szabo, Istvan (Agoda) 
<istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com>

Cc: Ceph Users <ceph-users@ceph.io><mailto:ceph-users@ceph.io>

Subject: Re: [ceph-users] osd_memory_target=level0 ?



Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

________________________________



Bluestore memory targets have nothing to do with spillover. It's already been 
said several times: The spillover warning is simply telling you that instead of 
writing data to your supposedly fast wal/blockdb device it's now hitting your 
slow device.



You've stated previously that your fast device is nvme and your slow device is 
SSD. So the spill-over is probably less of a problem than you think. It's 
currently unclear what your actual problem is and why you think it's to do with 
spill-over.



What model are your NVMEs and SSDs - what IOPS can each sustain (4k random 
write direct IO), what's their current load? What are the actual problems that 
you are observing, i.e. what does "stability problems" actually mean?



On Thu, 30 Sept 2021 at 18:33, Szabo, Istvan (Agoda) 
<istvan.sz...@agoda.com><mailto:istvan.sz...@agoda.com> wrote:



Hi,



Still suffering with the spilledover disks and stability issue in 3 of my 
cluster after uploaded 6-900 millions objects to the cluster. (Octopus 15.2.10).



I’ve set memory target around 31-32GB so could that be that the spilledover 
issue is coming from here?

So have mem target 31GB, next level would be 310 and after go to the 
underlaying ssd disk. So the 4 level doesn’t have space on the nvme.



Let’s say set to default 4GB, it would be 444GB the level0-3 so it

should fit in on the 600GB lvm assigned on the nvme for db with wal.



This is how it looks like, eg. Osd 27 even after 2 times manual

compact still spilled over :(



osd.1 spilled over 198 GiB metadata from 'db' device (303 GiB used of 596 GiB) 
to slow device

     osd.5 spilled over 251 GiB metadata from 'db' device (163 GiB used of 596 
GiB) to slow device

     osd.8 spilled over 61 GiB metadata from 'db' device (264 GiB used of 596 
GiB) to slow device

     osd.11 spilled over 260 GiB metadata from 'db' device (242 GiB used of 596 
GiB) to slow device

     osd.12 spilled over 149 GiB metadata from 'db' device (238 GiB used of 596 
GiB) to slow device

     osd.15 spilled over 259 GiB metadata from 'db' device (195 GiB used of 596 
GiB) to slow device

     osd.17 spilled over 10 GiB metadata from 'db' device (314 GiB used of 596 
GiB) to slow device

     osd.21 spilled over 324 MiB metadata from 'db' device (346 GiB used of 596 
GiB) to slow device

     osd.27 spilled over 12 GiB metadata from 'db' device (486 GiB used of 596 
GiB) to slow device

     osd.29 spilled over 61 GiB metadata from 'db' device (306 GiB used of 596 
GiB) to slow device

     osd.31 spilled over 59 GiB metadata from 'db' device (308 GiB used of 596 
GiB) to slow device

     osd.46 spilled over 69 GiB metadata from 'db' device (308 GiB

used of 596 GiB) to slow device



Also is there a way to fasten compaction? It takes 1-1.5 hours /osd to compact.



Thank you

_______________________________________________

ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To 
unsubscribe send an

email to ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

_______________________________________________

ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>

To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to