[ceph-users] Re: Mounting A RBD Via Kernal Modules
Hey Mathew, One more thing out of curiosity can you send the output of blockdev --getbsz on the rbd dev and rbd info? I'm using 16TB rbd images without issue, but I haven't updated to reef .2 yet. Cheers, Curt On Sun, 24 Mar 2024, 11:12 duluxoz, wrote: > Hi Curt, > > Nope, no dropped packets or errors - sorry, wrong tree :-) > > Thanks for chiming in. > > On 24/03/2024 20:01, Curt wrote: > > I may be barking up the wrong tree, but if you run ip -s link show > > yourNicID on this server or your OSDs do you see any > > errors/dropped/missed? > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Mounting A RBD Via Kernal Modules
I may be barking up the wrong tree, but if you run ip -s link show yourNicID on this server or your OSDs do you see any errors/dropped/missed? On Sun, 24 Mar 2024, 09:20 duluxoz, wrote: > Hi, > > Yeah, I've been testing various configurations since I sent my last > email - all to no avail. > > So I'm back to the start with a brand new 4T image which is rbdmapped to > /dev/rbd0. > > Its not formatted (yet) and so not mounted. > > Every time I attempt a mkfs.xfs /dev/rbd0 (or mkfs.xfs > /dev/rbd/my_pool/my_image) I get the errors I previous mentioned and the > resulting image then becomes unusable (in ever sense of the word). > > If I run a fdisk -l (before trying the mkfs.xfs) the rbd image shows up > in the list - no, I don't actually do a full fdisk on the image. > > An rbd info my_pool:my_image shows the same expected values on both the > host and ceph cluster. > > I've tried this with a whole bunch of different sized images from 100G > to 4T and all fail in exactly the same way. (My previous successful 100G > test I haven't been able to reproduce). > > I've also tried all of the above using an "admin" CephX(sp?) account - I > always can connect via rbdmap, but as soon as I try an mkfs.xfs it > fails. This failure also occurs with a mkfs.ext4 as well (all size drives). > > The Ceph Cluster is good (self reported and there are other hosts > happily connected via CephFS) and this host also has a CephFS mapping > which is working. > > Between running experiments I've gone over the Ceph Doco (again) and I > can't work out what's going wrong. > > There's also nothing obvious/helpful jumping out at me from the > logs/journal (sample below): > > ~~~ > > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno > 524773 0~65536 result -1 > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno > 524772 65536~4128768 result -1 > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1 > Mar 24 17:38:29 my_host.my_net.local kernel: blk_print_req_error: 119 > callbacks suppressed > Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector > 4298932352 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2 > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno > 524774 0~65536 result -1 > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno > 524773 65536~4128768 result -1 > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1 > Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector > 4298940544 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2 > ~~~ > > Any ideas what I should be looking at? > > And thank you for the help :-) > > On 24/03/2024 17:50, Alexander E. Patrakov wrote: > > Hi, > > > > Please test again, it must have been some network issue. A 10 TB RBD > > image is used here without any problems. > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: PG stuck at recovery
ECC 2+2 & 4+2 HDD only. On Tue, 20 Feb 2024, 00:25 Anthony D'Atri, wrote: > After wrangling with this myself, both with 17.2.7 and to an extent with > 17.2.5, I'd like to follow up here and ask: > > Those who have experienced this, were the affected PGs > > * Part of an EC pool? > * Part of an HDD pool? > * Both? > > > > > > You don't say anything about the Ceph version you are running. > > I had an similar issue with 17.2.7, and is seams to be an issue with > mclock, > > when I switch to wpq everything worked again. > > > > You can read more about it here > > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#OE3QXLAJIY6NU7PNMGHP47UK2CBZJPUG > > > > - Kai Stian Olstad > > > > > > On Tue, Feb 06, 2024 at 06:35:26AM -, LeonGao wrote: > >> Hi community > >> > >> We have a new Ceph cluster deployment with 100 nodes. When we are > draining an OSD host from the cluster, we see a small amount of PGs that > cannot make any progress to the end. From the logs and metrics, it seems > like the recovery progress is stuck (0 recovery ops for several days). > Would like to get some ideas on this. Re-peering and OSD restart do resolve > to mitigate the issue but we want to get to the root cause of it as > draining and recovery happen frequently. > >> > >> I have put some debugging information below. Any help is appreciated, > thanks! > >> > >> ceph -s > >> pgs: 4210926/7380034104 objects misplaced (0.057%) > >>41198 active+clean > >>71active+remapped+backfilling > >>12active+recovering > >> > >> One of the stuck PG: > >> 6.38f1 active+remapped+backfilling [313,643,727] > 313 [313,643,717] 313 > >> > >> PG query result: > >> > >> ceph pg 6.38f1 query > >> { > >> "snap_trimq": "[]", > >> "snap_trimq_len": 0, > >> "state": "active+remapped+backfilling", > >> "epoch": 246856, > >> "up": [ > >> 313, > >> 643, > >> 727 > >> ], > >> "acting": [ > >> 313, > >> 643, > >> 717 > >> ], > >> "backfill_targets": [ > >> "727" > >> ], > >> "acting_recovery_backfill": [ > >> "313", > >> "643", > >> "717", > >> "727" > >> ], > >> "info": { > >> "pgid": "6.38f1", > >> "last_update": "212333'38916", > >> "last_complete": "212333'38916", > >> "log_tail": "80608'37589", > >> "last_user_version": 38833, > >> "last_backfill": "MAX", > >> "purged_snaps": [], > >> "history": { > >> "epoch_created": 3726, > >> "epoch_pool_created": 3279, > >> "last_epoch_started": 243987, > >> "last_interval_started": 243986, > >> "last_epoch_clean": 220174, > >> "last_interval_clean": 220173, > >> "last_epoch_split": 3726, > >> "last_epoch_marked_full": 0, > >> "same_up_since": 238347, > >> "same_interval_since": 243986, > >> "same_primary_since": 3728, > >> "last_scrub": "212333'38916", > >> "last_scrub_stamp": "2024-01-29T13:43:10.654709+", > >> "last_deep_scrub": "212333'38916", > >> "last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+", > >> "last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+", > >> "prior_readable_until_ub": 0 > >> }, > >> "stats": { > >> "version": "212333'38916", > >> "reported_seq": 413425, > >> "reported_epoch": 246856, > >> "state": "active+remapped+backfilling", > >> "last_fresh": "2024-02-05T21:14:40.838785+", > >> "last_change": "2024-02-03T22:33:43.052272+", > >> "last_active": "2024-02-05T21:14:40.838785+", > >> "last_peered": "2024-02-05T21:14:40.838785+", > >> "last_clean": "2024-02-03T04:26:35.168232+", > >> "last_became_active": "2024-02-03T22:31:16.037823+", > >> "last_became_peered": "2024-02-03T22:31:16.037823+", > >> "last_unstale": "2024-02-05T21:14:40.838785+", > >> "last_undegraded": "2024-02-05T21:14:40.838785+", > >> "last_fullsized": "2024-02-05T21:14:40.838785+", > >> "mapping_epoch": 243986, > >> "log_start": "80608'37589", > >> "ondisk_log_start": "80608'37589", > >> "created": 3726, > >> "last_epoch_clean": 220174, > >> "parent": "0.0", > >> "parent_split_bits": 14, > >> "last_scrub": "212333'38916", > >> "last_scrub_stamp": "2024-01-29T13:43:10.654709+", > >> "last_deep_scrub": "212333'38916", > >> "last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+", > >> "last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+", > >> "objects_scrubbed": 17743, > >> "log_size": 1327, > >> "log_dups_size": 3000, > >>
[ceph-users] Re: Problems adding a new host via orchestration.
I don't use rocky, so stab in the dark and probably not the issue, but could selinux be blocking the process? Really long shot, but python3 is in the standard location? So if you run python3 --version as your ceph user it returns? Probably not much help, but figured I'd throw it out there. On Mon, 5 Feb 2024, 16:54 Gary Molenkamp, wrote: > I have verified the server's expected hostname (with `hostname`) matches > the hostname I am trying to use. > Just to be sure, I also ran: > cephadm check-host --expect-hostname > and it returns: > Hostname "" matches what is expected. > > On the current admin server where I am trying to add the host, the host > is reachable, the shortname even matches proper IP with dns search order. > Likewise, on the server where the mgr is running, I am able to confirm > reachability and DNS resolution for the new server as well. > > I thought this may be a DNS/name resolution issue as well, but I don't > see any errors in my setup wrt to host naming. > > Thanks > Gary > > > On 2024-02-03 06:46, Eugen Block wrote: > > Hi, > > > > I found this blog post [1] which reports the same error message. It > > seems a bit misleading because it appears to be about DNS. Can you check > > > > cephadm check-host --expect-hostname > > > > Or is that what you already tried? It's not entirely clear how you > > checked the hostname. > > > > Regards, > > Eugen > > > > [1] > > > https://blog.mousetech.com/ceph-distributed-file-system-for-the-enterprise/ceph-bogus-error-cannot-allocate-memory/ > > > > Zitat von Gary Molenkamp : > > > >> Happy Friday all. I was hoping someone could point me in the right > >> direction or clarify any limitations that could be impacting an issue > >> I am having. > >> > >> I'm struggling to add a new set of hosts to my ceph cluster using > >> cephadm and orchestration. When trying to add a host: > >> "ceph orch host add 172.31.102.41 --labels _admin" > >> returns: > >> "Error EINVAL: Can't communicate with remote host > >> `172.31.102.41`, possibly because python3 is not installed there: > >> [Errno 12] Cannot allocate memory" > >> > >> I've verified that the ceph ssh key works to the remote host, host's > >> name matches that returned from `hostname`, python3 is installed, and > >> "/usr/sbin/cephadm prepare-host" on the new hosts returns "host is > >> ok".In addition, the cluster ssh key works between hosts and the > >> existing hosts are able to ssh in using the ceph key. > >> > >> The existing ceph cluster is Pacific release using docker based > >> containerization on RockyLinux8 base OS. The new hosts are > >> RockyLinux9 based, with the cephadm being installed from Quincy release: > >> ./cephadm add-repo --release quincy > >> ./cephadm install > >> I did try installing cephadm from the Pacific release by changing the > >> repo to el8, but that did not work either. > >> > >> Is there a limitation is mixing RL8 and RL9 container hosts under > >> Pacific? Does this same limitation exist under Quincy? Is there a > >> python version dependency? > >> The reason for RL9 on the new hosts is to stage upgrading the OS's > >> for the cluster. I did this under Octopus for moving from Centos7 to > >> RL8. > >> > >> Thanks and I appreciate any feedback/pointers. > >> Gary > >> > >> > >> I've added the log trace here in case that helps (from `ceph log last > >> cephadm`) > >> > >> > >> > >> 2024-02-02T14:22:32.610048+ mgr.storage01.oonvfl (mgr.441023307) > >> 4957871 : cephadm [ERR] Can't communicate with remote host > >> `172.31.102.41`, possibly because python3 is not installed there: > >> [Errno 12] Cannot allocate memory > >> Traceback (most recent call last): > >> File "/usr/share/ceph/mgr/cephadm/serve.py", line 1524, in > >> _remote_connection > >> conn, connr = self.mgr._get_connection(addr) > >> File "/usr/share/ceph/mgr/cephadm/module.py", line 1370, in > >> _get_connection > >> sudo=True if self.ssh_user != 'root' else False) > >> File "/lib/python3.6/site-packages/remoto/backends/__init__.py", > >> line 35, in __init__ > >> self.gateway = self._make_gateway(hostname) > >> File "/lib/python3.6/site-packages/remoto/backends/__init__.py", > >> line 46, in _make_gateway > >> self._make_connection_string(hostname) > >> File "/lib/python3.6/site-packages/execnet/multi.py", line 133, in > >> makegateway > >> io = gateway_io.create_io(spec, execmodel=self.execmodel) > >> File "/lib/python3.6/site-packages/execnet/gateway_io.py", line > >> 121, in create_io > >> io = Popen2IOMaster(args, execmodel) > >> File "/lib/python3.6/site-packages/execnet/gateway_io.py", line 21, > >> in __init__ > >> self.popen = p = execmodel.PopenPiped(args) > >> File "/lib/python3.6/site-packages/execnet/gateway_base.py", line > >> 184, in PopenPiped > >> return self.subprocess.Popen(args, stdout=PIPE, stdin=PIPE) > >> File "/lib64/python3.6/subprocess.py", line 729, in __init__ > >>
[ceph-users] Re: RBD Image Returning 'Unknown Filesystem LVM2_member' On Mount - Help Please
Out of curiosity, how are you mapping the rbd? Have you tried using guestmount? I'm just spitballing, I have no experience with your issue, so probably not much help or useful. On Mon, 5 Feb 2024, 10:05 duluxoz, wrote: > ~~~ > Hello, > I think that /dev/rbd* devices are flitered "out" or not filter "in" by > the fiter > option in the devices section of /etc/lvm/lvm.conf. > So pvscan (pvs, vgs and lvs) don't look at your device. > ~~~ > > Hi Gilles, > > So the lvm filter from the lvm.conf file is set to the default of `filter > = [ "a|.*|" ]`, so that's accept every block device, so no luck there :-( > > > ~~~ > For Ceph based LVM volumes, you would do this to import: > Map every one of the RBDs to the host > Include this in /etc/lvm/lvm.conf: > types = [ "rbd", 1024 ] > pvscan > vgscan > pvs > vgs > If you see the VG: > vgimportclone -n /dev/rbd0 /dev/rbd1 ... --import > Now you should be able to vgchange -a y and see the LVs > ~~~ > > Hi Alex, > > Did the above as you suggested - the rbd devices (3 of them, none of which > were originally part of an lvm on the ceph servers - at least, not set up > manually by me) still do not show up using pvscan, etc. > > So I still can't mount any of them (not without re-creating a fs, anyway, > and thus losing the data I'm trying to read/import) - they all return the > same error message (see original post). > > Anyone got any other ideas? :-) > > Cheers > > Dulux-Oz > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] physical vs osd performance
Hello all, Looking at grafana reports, can anyone point me to documentation that outlines physical vs osd? https://docs.ceph.com/en/latest/monitoring/ gives some basic info, but I'm trying to get a better understanding. For instance, physical latency is 20ms and osd is 200ms, these are just made up for this example, why the huge difference? Same thing if for bytes or iops, just using latency as an example. Thanks, Curt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC Profiles & DR
Hi Patrick, Yes K and M are chunks, but the default crush map is a chunk per host, which is probably the best way to do it, but I'm no expert. I'm not sure why you would want to do a crush map with 2 chunks per host and min size 4 as it' s just asking for trouble at some point, in my opinion. Anyway, take a look at this post if your interested in doing 2 chunks per host it will give you an idea of crushmap setup, https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NB3M22GNAC7VNWW7YBVYTH6TBZOYLTWA/ . Regards, Curt On Wed, Dec 6, 2023 at 6:26 PM Patrick Begou < patrick.be...@univ-grenoble-alpes.fr> wrote: > Le 06/12/2023 à 00:11, Rich Freeman a écrit : > > On Tue, Dec 5, 2023 at 6:35 AM Patrick Begou > > wrote: > >> Ok, so I've misunderstood the meaning of failure domain. If there is no > >> way to request using 2 osd/node and node as failure domain, with 5 nodes > >> k=3+m=1 is not secure enough and I will have to use k=2+m=2, so like a > >> raid1 setup. A little bit better than replication in the point of view > >> of global storage capacity. > >> > > I'm not sure what you mean by requesting 2osd/node. If the failure > > domain is set to the host, then by default k/m refer to hosts, and the > > PGs will be spread across all OSDs on all hosts, but with any > > particular PG only being present on one OSD on each host. You can get > > fancy with device classes and crush rules and such and be more > > specific with how they're allocated, but that would be the typical > > behavior. > > > > Since k/m refer to hosts, then k+m must be less than or equal to the > > number of hosts or you'll have a degraded pool because there won't be > > enough hosts to allocate them all. It won't ever stack them across > > multiple OSDs on the same host with that configuration. > > > > k=2,m=2 with min=3 would require at least 4 hosts (k+m), and would > > allow you to operate degraded with a single host down, and the PGs > > would become inactive but would still be recoverable with two hosts > > down. While strictly speaking only 4 hosts are required, you'd do > > better to have more than that since then the cluster can immediately > > recover from a loss, assuming you have sufficient space. As you say > > it is no more space-efficient than RAID1 or size=2, and it suffers > > write amplification for modifications, but it does allow recovery > > after the loss of up to two hosts, and you can operate degraded with > > one host down which allows for somewhat high availability. > > > Hi Rich, > > My understood was that k and m were for EC chunks not hosts. Of > course if k and m are hosts the best choice would be k=2 and m=2. > > When Christian wrote: > /For example if you run an EC=4+2 profile on 3 hosts you can structure > your crushmap so that you have 2 chunks per host. This means even if one > host is down you are still guaranteed to have 4 chunks available./ > > This is that I had thought before (and using 5 nodes instead of 3 as the > Christian's example). But it does not match what you explain if k and m > are nodes. > > I'm a little bit confused with crushmap settings. > > Patrick > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] shrink db size
Hello, As far as I can tell there is no way to shrink a db/wal after creation. I recently added a new server to my cluster with SSD's for the wal/db and just used the ceph dashboard for deployment. I did not specify a db size, which is my mistake, it seems by default it uses "block.db has no size configuration, will fallback to using as much as possible". So now my issue is I added 2 more drives, but with no space left on the SSD's get "2 fast devices were passed, but none are available". I had 7 4TB HDD and 2 2Tb SSD's and added 2 more 4TB. Distribution is 4 and 3 between the 2 SSD's. I just want to confirm, the best option is to set the bluestore_block_db_size/wal_size, zap and let them be recreated with a size of 300 and 2. I chose those just because I have space. I'm not going to do them all at the same time. Cheers, Curt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Questions since updating to 18.0.2
Hello, We recently upgraded our cluster to version 18 and I've noticed some things that I'd like feedback on before I go down a rabbit hole for non-issues. cephadm was used for the upgrade and there were no issues. Cluster is 56 OSD's spinners for right now only used for RBD images. I've noticed active scrubs/deep scrubs. I don't remember seeing a large amount before, usually around 20-30 scrubs and 15 deep I think, now I will have 70 scrubs and 70 deep scrubs happening. Which I thought were limited to 1 per OSD or am I misunderstanding osd_max_scrubs? Everything on the cluster is currently at default values. The other thing I've noticed is since the upgrade it seems like any time backfill happens the client io drops, but neither is high to begin with, 30MiB/s read/write client IO drops to 10-15 with 200MiB/s backfill. Before upgrading backfill would be hitting 5-600 with 30 clientio. I realize lots of things could affect this and it could be separate from the cluster, I'm still investigating, but wanted to mention it incase someone could recommend a check or some change to Reef that could cause this. mclock profile is client_io. Thanks, Curt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Blank dashboard
Hello, Nevermind, sorry to disturb everyone. I just disabled it and re-enabled it and it now works, console errors are gone. This is version 17.2.6 btw. If anyone has any insight on what might have caused this, it would be interesting to know. Thanks, Curt On Mon, Jul 31, 2023 at 8:04 PM Curt wrote: > Hello, > > This is a strange one for me. My ceph dashboard just stopped loading, > nothing but a white page. I don't see anything in the logs and on browser > side the only error I see is Failed to load resource: > net::ERR_CONTENT_LENGTH_MISMATCH in chrome and Uncaught SyntaxError: expected > expression, got end of script for Firefox on the > file main.ddd4de0999172734.js. > > Nothing in the logs files, put mgr up to 20 on the log level. Any > suggestions? > > Thanks, > Curt > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Blank dashboard
Hello, This is a strange one for me. My ceph dashboard just stopped loading, nothing but a white page. I don't see anything in the logs and on browser side the only error I see is Failed to load resource: net::ERR_CONTENT_LENGTH_MISMATCH in chrome and Uncaught SyntaxError: expected expression, got end of script for Firefox on the file main.ddd4de0999172734.js. Nothing in the logs files, put mgr up to 20 on the log level. Any suggestions? Thanks, Curt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD stuck down
Hello, Have you increased the osd debug level to get more output? Does dmesg on the host machine report anything? Are there any smart errors on the drive? Regards, Curt On Thu, Jun 15, 2023, 13:30 Nicola Mori wrote: > Hi Dario, > > I think the connectivity is ok. My cluster has just a public interface, > and all of the other services on the same machine (osds and mgr) work > flawlessly so I guess the connectivity is ok. Or in other words, I don't > know what to look for in the network since all the other services work, > do you have any suggestion? > > Nicola > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Help needed to configure erasure coding LRC plugin
Hi, I've been following this thread with interest as it seems like a unique use case to expand my knowledge. I don't use LRC or anything outside basic erasure coding. What is your current crush steps rule? I know you made changes since your first post and had some thoughts I wanted to share, but wanted to see your rule first so I could try to visualize the distribution better. The only way I can currently visualize it working is with more servers, I'm thinking 6 or 9 per data center min, but that could be my lack of knowledge on some of the step rules. Thanks Curt On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < michel.jou...@ijclab.in2p3.fr> wrote: > Hi Eugen, > > Yes, sure, no problem to share it. I attach it to this email (as it may > clutter the discussion if inline). > > If somebody on the list has some clue on the LRC plugin, I'm still > interested by understand what I'm doing wrong! > > Cheers, > > Michel > > Le 04/05/2023 à 15:07, Eugen Block a écrit : > > Hi, > > > > I don't think you've shared your osd tree yet, could you do that? > > Apparently nobody else but us reads this thread or nobody reading this > > uses the LRC plugin. ;-) > > > > Thanks, > > Eugen > > > > Zitat von Michel Jouvin : > > > >> Hi, > >> > >> I had to restart one of my OSD server today and the problem showed up > >> again. This time I managed to capture "ceph health detail" output > >> showing the problem with the 2 PGs: > >> > >> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 > >> pgs down > >> pg 56.1 is down, acting > >> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] > >> pg 56.12 is down, acting > >> > [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] > >> > >> I still doesn't understand why, if I am supposed to survive to a > >> datacenter failure, I cannot survive to 3 OSDs down on the same host, > >> hosting shards for the PG. In the second case it is only 2 OSDs down > >> but I'm surprised they don't seem in the same "group" of OSD (I'd > >> expected all the the OSDs of one datacenter to be in the same groupe > >> of 5 if the order given really reflects the allocation done... > >> > >> Still interested by some explanation on what I'm doing wrong! Best > >> regards, > >> > >> Michel > >> > >> Le 03/05/2023 à 10:21, Eugen Block a écrit : > >>> I think I got it wrong with the locality setting, I'm still limited > >>> by the number of hosts I have available in my test cluster, but as > >>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with > >>> locality=datacenter could fit your requirement, at least with > >>> regards to the recovery bandwidth usage between DCs, but the > >>> resiliency would not match your requirement (one DC failure). That > >>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one > >>> parity chunk) across three DCs, in total 12 chunks. The min_size=7 > >>> would not allow an entire DC to go down, I'm afraid, you'd have to > >>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm > >>> still not sure if I got it right this time, but maybe you're better > >>> off without the LRC plugin with the limited number of hosts. Instead > >>> you could use the jerasure plugin with a profile like k=4 m=5 > >>> allowing an entire DC to fail without losing data access (we have > >>> one customer using that). > >>> > >>> Zitat von Eugen Block : > >>> > >>>> Hi, > >>>> > >>>> disclaimer: I haven't used LRC in a real setup yet, so there might > >>>> be some misunderstandings on my side. But I tried to play around > >>>> with one of my test clusters (Nautilus). Because I'm limited in the > >>>> number of hosts (6 across 3 virtual DCs) I tried two different > >>>> profiles with lower numbers to get a feeling for how that works. > >>>> > >>>> # first attempt > >>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc > >>>> k=4 m=2 l=3 crush-failure-domain=host > >>>> > >>>> For every third OSD one parity chunk is added, so 2 more chunks to > >>>> store ==> 8 chunks in total. Since my failure-domain is host and I > >>>> only have 6 I get incomplete PGs. > >
[ceph-users] Re: Help needed to configure erasure coding LRC plugin
Hello, What is your current setup, 1 server pet data center with 12 osd each? What is your current crush rule and LRC crush rule? On Fri, Apr 28, 2023, 12:29 Michel Jouvin wrote: > Hi, > > I think I found a possible cause of my PG down but still understand why. > As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, > m=6) but I have only 12 OSD servers in the cluster. To workaround the > problem I defined the failure domain as 'osd' with the reasoning that as > I was using the LRC plugin, I had the warranty that I could loose a site > without impact, thus the possibility to loose 1 OSD server. Am I wrong? > > Best regards, > > Michel > > Le 24/04/2023 à 13:24, Michel Jouvin a écrit : > > Hi, > > > > I'm still interesting by getting feedback from those using the LRC > > plugin about the right way to configure it... Last week I upgraded > > from Pacific to Quincy (17.2.6) with cephadm which is doing the > > upgrade host by host, checking if an OSD is ok to stop before actually > > upgrading it. I had the surprise to see 1 or 2 PGs down at some points > > in the upgrade (happened not for all OSDs but for every > > site/datacenter). Looking at the details with "ceph health detail", I > > saw that for these PGs there was 3 OSDs down but I was expecting the > > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm > > wondering if there is something wrong in our pool configuration (k=9, > > m=6, l=5). > > > > Cheers, > > > > Michel > > > > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : > >> Hi, > >> > >> Is somebody using LRC plugin ? > >> > >> I came to the conclusion that LRC k=9, m=3, l=4 is not the same as > >> jerasure k=9, m=6 in terms of protection against failures and that I > >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure > >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) suggests > >> that this LRC configuration gives something better than jerasure k=4, > >> m=2 as it is resilient to 3 drive failures (but not 4 if I understood > >> properly). So how many drives can fail in the k=9, m=6, l=5 > >> configuration first without loosing RW access and second without > >> loosing data? > >> > >> Another thing that I don't quite understand is that a pool created > >> with this configuration (and failure domain=osd, locality=datacenter) > >> has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd > >> expected something ~10 (depending on answer to the previous question)... > >> > >> Thanks in advance if somebody could provide some sort of > >> authoritative answer on these 2 questions. Best regards, > >> > >> Michel > >> > >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : > >>> Answering to myself, I found the reason for 2147483647: it's > >>> documented as a failure to find enough OSD (missing OSDs). And it is > >>> normal as I selected different hosts for the 15 OSDs but I have only > >>> 12 hosts! > >>> > >>> I'm still interested by an "expert" to confirm that LRC k=9, m=3, > >>> l=4 configuration is equivalent, in terms of redundancy, to a > >>> jerasure configuration with k=9, m=6. > >>> > >>> Michel > >>> > >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : > Hi, > > As discussed in another thread (Crushmap rule for multi-datacenter > erasure coding), I'm trying to create an EC pool spanning 3 > datacenters (datacenters are present in the crushmap), with the > objective to be resilient to 1 DC down, at least keeping the > readonly access to the pool and if possible the read-write access, > and have a storage efficiency better than 3 replica (let say a > storage overhead <= 2). > > In the discussion, somebody mentioned LRC plugin as a possible > jerasure alternative to implement this without tweaking the > crushmap rule to implement the 2-step OSD allocation. I looked at > the documentation > (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) > but I have some questions if someone has experience/expertise with > this LRC plugin. > > I tried to create a rule for using 5 OSDs per datacenter (15 in > total), with 3 (9 in total) being data chunks and others being > coding chunks. For this, based of my understanding of examples, I > used k=9, m=3, l=4. Is it right? Is this configuration equivalent, > in terms of redundancy, to a jerasure configuration with k=9, m=6? > > The resulting rule, which looks correct to me, is: > > > > { > "rule_id": 6, > "rule_name": "test_lrc_2", > "ruleset": 6, > "type": 3, > "min_size": 3, > "max_size": 15, > "steps": [ > { > "op": "set_chooseleaf_tries", > "num": 5 > }, > { > "op": "set_choose_tries", > "num": 100 > }, > { >
[ceph-users] Re: Very slow backfilling
p_num 32 autoscale_mode on last_change > 14979 flags hashpspool stripe_width 0 application rgw > pool 11 'ncy.rgw.control' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change > 14981 flags hashpspool stripe_width 0 application rgw > pool 12 'ncy.rgw.meta' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 15105 > lfor 0/15105/15103 flags hashpspool stripe_width 0 pg_autoscale_bias 4 > pg_num_min 8 application rgw > pool 13 'ncy.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 15236 > lfor 0/15236/15234 flags hashpspool stripe_width 0 pg_autoscale_bias 4 > pg_num_min 8 application rgw > pool 14 'ncy.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 0 > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change > 15241 flags hashpspool stripe_width 0 application rgw > > (EC32 is a erasure coding with 3 datas and 2 codes) > > No output with "ceph osd pool autoscale-status" > > Le jeu. 2 mars 2023 à 15:02, Curt a écrit : > >> Forgot to do a reply all. >> >> What does >> >> ceph osd df >> ceph osd dump | grep pool return? >> >> Are you using auto scaling? 289pg with 272tb of data and 60 osds, that >> seems like 3-4 pg per osd at almost 1TB each. Unless I'm thinking of this >> wrong. >> >> On Thu, Mar 2, 2023, 17:37 Joffrey wrote: >> >>> My Ceph Version is 17.2.5 and all configuration about osd_scrub* are >>> defaults. I tried some updates on osd-max-backfills but no change. >>> I have many HDD with NVME for db and all are connected in a 25G network. >>> >>> Yes, it's the same PG since 4 days. >>> >>> I got a failure on a HDD and get many days of recovery+backfilling last >>> 2 >>> weeks. Perhaps the 'not in time' warning is related to this. >>> >>> 'Jof >>> >>> Le jeu. 2 mars 2023 à 14:25, Anthony D'Atri a >>> écrit : >>> >>> > Run `ceph health detail`. >>> > >>> > Is it the same PG backfilling for a long time, or a different one over >>> > time? >>> > >>> > That it’s remapped makes me think that what you’re seeing is the >>> balancer >>> > doing its job. >>> > >>> > As far as the scrubbing, do you limit the times when scrubbing can >>> happen? >>> > Are these HDDs? EC? >>> > >>> > > On Mar 2, 2023, at 07:20, Joffrey wrote: >>> > > >>> > > Hi, >>> > > >>> > > I have many 'not {deep-}scrubbed in time' and a1 PG >>> remapped+backfilling >>> > > and I don't understand why this backfilling is taking so long. >>> > > >>> > > root@hbgt-ceph1-mon3:/# ceph -s >>> > > cluster: >>> > >id: c300532c-51fa-11ec-9a41-0050569c3b55 >>> > >health: HEALTH_WARN >>> > >15 pgs not deep-scrubbed in time >>> > >13 pgs not scrubbed in time >>> > > >>> > > services: >>> > >mon: 3 daemons, quorum >>> hbgt-ceph1-mon1,hbgt-ceph1-mon2,hbgt-ceph1-mon3 >>> > > (age 36h) >>> > >mgr: hbgt-ceph1-mon2.nteihj(active, since 2d), standbys: >>> > > hbgt-ceph1-mon1.thrnnu, hbgt-ceph1-mon3.gmfzqm >>> > >osd: 60 osds: 60 up (since 13h), 60 in (since 13h); 1 remapped pgs >>> > >rgw: 3 daemons active (3 hosts, 2 zones) >>> > > >>> > > data: >>> > >pools: 13 pools, 289 pgs >>> > >objects: 67.74M objects, 127 TiB >>> > >usage: 272 TiB used, 769 TiB / 1.0 PiB avail >>> > >pgs: 288 active+clean >>> > > 1 active+remapped+backfilling >>> > > >>> > > io: >>> > >client: 3.3 KiB/s rd, 1.5 MiB/s wr, 3 op/s rd, 8 op/s wr >>> > >recovery: 790 KiB/s, 0 objects/s >>> > > >>> > > >>> > > What can I do to understand this slow recovery (is it the backfill >>> > action ?) >>> > > >>> > > Thanks you >>> > > >>> > > 'Jof >>> > > ___ >>> > > ceph-users mailing list -- ceph-users@ceph.io >>> > > To unsubscribe send an email to ceph-users-le...@ceph.io >>> > >>> > >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Very slow backfilling
Forgot to do a reply all. What does ceph osd df ceph osd dump | grep pool return? Are you using auto scaling? 289pg with 272tb of data and 60 osds, that seems like 3-4 pg per osd at almost 1TB each. Unless I'm thinking of this wrong. On Thu, Mar 2, 2023, 17:37 Joffrey wrote: > My Ceph Version is 17.2.5 and all configuration about osd_scrub* are > defaults. I tried some updates on osd-max-backfills but no change. > I have many HDD with NVME for db and all are connected in a 25G network. > > Yes, it's the same PG since 4 days. > > I got a failure on a HDD and get many days of recovery+backfilling last 2 > weeks. Perhaps the 'not in time' warning is related to this. > > 'Jof > > Le jeu. 2 mars 2023 à 14:25, Anthony D'Atri a écrit : > > > Run `ceph health detail`. > > > > Is it the same PG backfilling for a long time, or a different one over > > time? > > > > That it’s remapped makes me think that what you’re seeing is the balancer > > doing its job. > > > > As far as the scrubbing, do you limit the times when scrubbing can > happen? > > Are these HDDs? EC? > > > > > On Mar 2, 2023, at 07:20, Joffrey wrote: > > > > > > Hi, > > > > > > I have many 'not {deep-}scrubbed in time' and a1 PG > remapped+backfilling > > > and I don't understand why this backfilling is taking so long. > > > > > > root@hbgt-ceph1-mon3:/# ceph -s > > > cluster: > > >id: c300532c-51fa-11ec-9a41-0050569c3b55 > > >health: HEALTH_WARN > > >15 pgs not deep-scrubbed in time > > >13 pgs not scrubbed in time > > > > > > services: > > >mon: 3 daemons, quorum > hbgt-ceph1-mon1,hbgt-ceph1-mon2,hbgt-ceph1-mon3 > > > (age 36h) > > >mgr: hbgt-ceph1-mon2.nteihj(active, since 2d), standbys: > > > hbgt-ceph1-mon1.thrnnu, hbgt-ceph1-mon3.gmfzqm > > >osd: 60 osds: 60 up (since 13h), 60 in (since 13h); 1 remapped pgs > > >rgw: 3 daemons active (3 hosts, 2 zones) > > > > > > data: > > >pools: 13 pools, 289 pgs > > >objects: 67.74M objects, 127 TiB > > >usage: 272 TiB used, 769 TiB / 1.0 PiB avail > > >pgs: 288 active+clean > > > 1 active+remapped+backfilling > > > > > > io: > > >client: 3.3 KiB/s rd, 1.5 MiB/s wr, 3 op/s rd, 8 op/s wr > > >recovery: 790 KiB/s, 0 objects/s > > > > > > > > > What can I do to understand this slow recovery (is it the backfill > > action ?) > > > > > > Thanks you > > > > > > 'Jof > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgrade not doing anything...
Did any of your cluster get partial upgrade? What about ceph -W cephadm, does that return anything or just hang, also what about ceph health detail? You can always try ceph orch upgrade pause and then orch upgrade resume, might kick something loose, so to speak. On Tue, Feb 28, 2023, 10:39 Jeremy Hansen wrote: > { > "target_image": "quay.io/ceph/ceph:v16.2.11", > "in_progress": true, > "services_complete": [], > "progress": "", > "message": "" > } > > Hasn’t changed in the past two hours. > > -jeremy > > > > On Monday, Feb 27, 2023 at 10:22 PM, Curt wrote: > What does Ceph orch upgrade status return? > > On Tue, Feb 28, 2023, 10:16 Jeremy Hansen wrote: > >> I’m trying to upgrade from 16.2.7 to 16.2.11. Reading the documentation, >> I cut and paste the orchestrator command to begin the upgrade, but I >> mistakenly pasted directly from the docs and it initiated an “upgrade” to >> 16.2.6. I stopped the upgrade per the docs and reissued the command >> specifying 16.2.11 but now I see no progress in ceph -s. Cluster is >> healthy but it feels like the upgrade process is just paused for some >> reason. >> >> Thanks! >> -jeremy >> >> >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgrade not doing anything...
What does Ceph orch upgrade status return? On Tue, Feb 28, 2023, 10:16 Jeremy Hansen wrote: > I’m trying to upgrade from 16.2.7 to 16.2.11. Reading the documentation, > I cut and paste the orchestrator command to begin the upgrade, but I > mistakenly pasted directly from the docs and it initiated an “upgrade” to > 16.2.6. I stopped the upgrade per the docs and reissued the command > specifying 16.2.11 but now I see no progress in ceph -s. Cluster is > healthy but it feels like the upgrade process is just paused for some > reason. > > Thanks! > -jeremy > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rbd map error: couldn't connect to the cluster!
Needs to be inside the " with your other commands. On Mon, Feb 27, 2023, 16:55 Thomas Schneider <74cmo...@gmail.com> wrote: > Hi, > > I get an error running this ceph auth get-or-create syntax: > > # ceph auth get-or-create client.${rbdName} mon "allow r" osd "allow > rwx pool ${rbdPoolName} object_prefix rbd_data.${imageID}; allow rwx > pool ${rbdPoolName} object_prefix rbd_header.${imageID}; allow rx pool > ${rbdPoolName} object_prefix rbd_id.${rbdName}"; allow class rbd > metadata_list pool ${rbdPoolName} -o > /etc/ceph/ceph.client.${rbdName}.keyring; > [client.VCT] > key = AQDGp/xj5EKrFRAArU7SyOVF8NFUC4lRCWwmCQ== > -bash: allow: command not found. > > THX > > Am 26.02.2023 um 12:46 schrieb Ilya Dryomov: > > allow class rbd metadata_list pool hdb_backup > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rbd map error: couldn't connect to the cluster!
What does 'rbd ls hbd_backup' return? Or is your pool VCT? Which if that's the case those should be switched. 'rbd map VCT/hdb_backup --id VCT --keyring /etc/ceph/ceph.client.VCT.keyring' On Thu, Feb 23, 2023 at 6:54 PM Thomas Schneider <74cmo...@gmail.com> wrote: > Hm... I'm not sure about the correct rbd command syntax, but I thought > it's correct. > > Anyway, using a different ID fails, too: > # rbd map hdb_backup/VCT --id client.VCT --keyring > /etc/ceph/ceph.client.VCT.keyring > rbd: couldn't connect to the cluster! > > # rbd map hdb_backup/VCT --id VCT --keyring > /etc/ceph/ceph.client.VCT.keyring > 2023-02-23T15:46:16.848+0100 7f222d19d700 -1 > librbd::image::GetMetadataRequest: 0x7f220c001ef0 handle_metadata_list: > failed to retrieve image metadata: (1) Operation not permitted > 2023-02-23T15:46:16.848+0100 7f222d19d700 -1 > librbd::image::RefreshRequest: failed to retrieve pool metadata: (1) > Operation not permitted > 2023-02-23T15:46:16.848+0100 7f222d19d700 -1 librbd::image::OpenRequest: > failed to refresh image: (1) Operation not permitted > 2023-02-23T15:46:16.848+0100 7f222c99c700 -1 librbd::ImageState: > 0x5569d8a16ba0 failed to open image: (1) Operation not permitted > rbd: error opening image VCT: (1) Operation not permitted > > > Am 23.02.2023 um 15:30 schrieb Eugen Block: > > You don't specify which client in your rbd command: > > > >> rbd map hdb_backup/VCT --id client --keyring > >> /etc/ceph/ceph.client.VCT.keyring > > > > Have you tried this (not sure about upper-case client names, haven't > > tried that)? > > > > rbd map hdb_backup/VCT --id VCT --keyring > > /etc/ceph/ceph.client.VCT.keyring > > > > > > Zitat von Thomas Schneider <74cmo...@gmail.com>: > > > >> Hello, > >> > >> I'm trying to mount RBD using rbd map, but I get this error message: > >> # rbd map hdb_backup/VCT --id client --keyring > >> /etc/ceph/ceph.client.VCT.keyring > >> rbd: couldn't connect to the cluster! > >> > >> Checking on Ceph server the required permission for relevant keyring > >> exists: > >> # ceph-authtool -l /etc/ceph/ceph.client.VCT.keyring > >> [client.VCT] > >> key = AQBj3LZjNGn/BhAAG8IqMyH0WLKi4kTlbjiW7g== > >> > >> # ceph auth get client.VCT > >> [client.VCT] > >> key = AQBj3LZjNGn/BhAAG8IqMyH0WLKi4kTlbjiW7g== > >> caps mon = "allow r" > >> caps osd = "allow rwx pool hdb_backup object_prefix > >> rbd_data.b768d4baac048b; allow rwx pool hdb_backup object_prefix > >> rbd_header.b768d4baac048b; allow rx pool hdb_backup object_prefix > >> rbd_id.VCT" > >> exported keyring for client.VCT > >> > >> > >> Can you please advise how to fix this error? > >> > >> > >> THX > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Telegraf plugin reset
Hello, If I had to guess : indicates a port number like :443, so it's expecting an int and you are passing a string. Try changing https to 443 On Thu, Sep 22, 2022 at 8:24 PM Nikhil Mitra (nikmitra) wrote: > Greetings, > > We are trying to use the telegraf module to send metrics to InfluxDB and > we keep facing the below error. Any help will be appreciated, thank you. > > # ceph telegraf config-show > Error EIO: Module 'telegraf' has experienced an error and cannot handle > commands: invalid literal for int() with base 10: 'https' > > # ceph config dump | grep -i telegraf > mgr advanced mgr/telegraf/address > tcp://test.xyz.com:https * > > ceph version 14.2.22-110.el7cp > > -- > Regards, > Nikhil Mitra > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph orch device ls extents
Hello, Ran into an interesting error today and I'm not sure best way to fix it. When I run 'ceph orch device ls', I get the following error "Insufficient space (<10 extents) on vgs, LVM detected, locked", on every HD. Here's the output of ceph-volume lvm list, incase it helps == osd.0 === [block] /dev/ceph-efb83a91-3c7b-4329-babc-017b0a00e95a/osd-block-b017780d-38f9-4da7-b9df-2da66e1aa0fd block device /dev/ceph-efb83a91-3c7b-4329-babc-017b0a00e95a/osd-block-b017780d-38f9-4da7-b9df-2da66e1aa0fd block uuid8kIdfD-kQSh-Mhe4-zRIL-b1Pf-PTaC-CVosbE cephx lockbox secret cluster fsid 1684fe88-aae0-11ec-9593-df430e3982a0 cluster name ceph crush device classNone encrypted 0 osd fsid b017780d-38f9-4da7-b9df-2da66e1aa0fd osd id0 osdspec affinity dashboard-admin-1648152609405 type block vdo 0 devices /dev/sdb == osd.10 == [block] /dev/ceph-a0e85035-cfe2-4070-b58a-a88ec964794c/osd-block-3c353f8c-ab0f-4589-9e98-4f840e86341a block device /dev/ceph-a0e85035-cfe2-4070-b58a-a88ec964794c/osd-block-3c353f8c-ab0f-4589-9e98-4f840e86341a block uuidgvvrMV-O98L-P6Sl-dnJT-NVwM-P85e-Reqql4 cephx lockbox secret cluster fsid 1684fe88-aae0-11ec-9593-df430e3982a0 cluster name ceph crush device classNone encrypted 0 osd fsid 3c353f8c-ab0f-4589-9e98-4f840e86341a osd id10 osdspec affinity dashboard-admin-1648152609405 type block vdo 0 devices /dev/sdh == osd.12 == lvdisplay --- Logical volume --- LV Path /dev/ceph-a0e85035-cfe2-4070-b58a-a88ec964794c/osd-block-3c353f8c-ab0f-4589-9e98-4f840e86341a LV Nameosd-block-3c353f8c-ab0f-4589-9e98-4f840e86341a VG Nameceph-a0e85035-cfe2-4070-b58a-a88ec964794c LV UUIDgvvrMV-O98L-P6Sl-dnJT-NVwM-P85e-Reqql4 LV Write Accessread/write LV Creation host, time hyperion02, 2022-03-24 20:12:17 + LV Status available # open 24 LV Size<1.82 TiB Current LE 476932 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:4 Let me know if you need any other information. Thanks, Curt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
On Wed, Jun 29, 2022 at 11:22 PM Curt wrote: > > > On Wed, Jun 29, 2022 at 9:55 PM Stefan Kooman wrote: > >> On 6/29/22 19:34, Curt wrote: >> > Hi Stefan, >> > >> > Thank you, that definitely helped. I bumped it to 20% for now and >> that's >> > giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s. I'll >> > see how that runs and then increase it a bit more if the cluster >> handles >> > it ok. >> > >> > Do you think it's worth enabling scrubbing while backfilling? >> >> If the cluster can cope with the extra load, sure. If it slows down the >> backfilling to levels that are too slow ... temporarily disable it. >> >> Since >> > this is going to take a while. I do have 1 inconsistent PG that has now >> > become 10 as it splits. >> >> Hmm. Well, if it finds broken PGs, for sure pause backfilling (ceph osd >> set nobackfill) and have it handle this ASAP: ceph pg repair $pg. >> Something is wrong, and you want to have this fixed sooner rather than >> later. >> > > When I try to run a repair nothing happens, if I try to list > inconsistent-obj I get No scrub information available for 12.12. If I tell > it to run a deep scrub, nothing. I'll set debug and see what I can find in > the logs. > Just to give a quick update. This one was my fault, I missed a flag. Once set correctly, scrubbed and repaired. It's now back to adding more PG's, which continue to get a bit faster as it expands. I'm now up to pg_num 1362 and pgp_num 1234, with backfills happening at 250-300 Mb/s 60-70 Objects/s. Thanks for all the help. > >> Not sure what hardware you have, but you might benefit from disabling >> write caches, see this link: >> >> https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches >> >> Thanks, I'm disabling cache and I'll see if it helps at all. > Gr. Stefan >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
On Wed, Jun 29, 2022 at 9:55 PM Stefan Kooman wrote: > On 6/29/22 19:34, Curt wrote: > > Hi Stefan, > > > > Thank you, that definitely helped. I bumped it to 20% for now and that's > > giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s. I'll > > see how that runs and then increase it a bit more if the cluster handles > > it ok. > > > > Do you think it's worth enabling scrubbing while backfilling? > > If the cluster can cope with the extra load, sure. If it slows down the > backfilling to levels that are too slow ... temporarily disable it. > > Since > > this is going to take a while. I do have 1 inconsistent PG that has now > > become 10 as it splits. > > Hmm. Well, if it finds broken PGs, for sure pause backfilling (ceph osd > set nobackfill) and have it handle this ASAP: ceph pg repair $pg. > Something is wrong, and you want to have this fixed sooner rather than > later. > When I try to run a repair nothing happens, if I try to list inconsistent-obj I get No scrub information available for 12.12. If I tell it to run a deep scrub, nothing. I'll set debug and see what I can find in the logs. > > Not sure what hardware you have, but you might benefit from disabling > write caches, see this link: > > https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches > > Gr. Stefan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
Hi Stefan, Thank you, that definitely helped. I bumped it to 20% for now and that's giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s. I'll see how that runs and then increase it a bit more if the cluster handles it ok. Do you think it's worth enabling scrubbing while backfilling? Since this is going to take a while. I do have 1 inconsistent PG that has now become 10 as it splits. ceph health detail HEALTH_ERR 21 scrub errors; Possible data damage: 10 pgs inconsistent; 2 pgs not deep-scrubbed in time [ERR] OSD_SCRUB_ERRORS: 21 scrub errors [ERR] PG_DAMAGED: Possible data damage: 10 pgs inconsistent pg 12.12 is active+clean+inconsistent, acting [28,1,37,0] pg 12.32 is active+clean+inconsistent, acting [37,3,14,22] pg 12.52 is active+clean+inconsistent, acting [4,33,7,23] pg 12.72 is active+remapped+inconsistent+backfilling, acting [37,3,14,22] pg 12.92 is active+remapped+inconsistent+backfilling, acting [28,1,37,0] pg 12.b2 is active+remapped+inconsistent+backfilling, acting [37,3,14,22] pg 12.d2 is active+clean+inconsistent, acting [4,33,7,23] pg 12.f2 is active+remapped+inconsistent+backfilling, acting [37,3,14,22] pg 12.112 is active+clean+inconsistent, acting [28,1,37,0] pg 12.132 is active+clean+inconsistent, acting [37,3,14,22] [WRN] PG_NOT_DEEP_SCRUBBED: 2 pgs not deep-scrubbed in time pg 4.13 not deep-scrubbed since 2022-06-16T03:15:16.758943+ pg 7.1 not deep-scrubbed since 2022-06-16T20:51:12.211259+ Thanks, Curt On Wed, Jun 29, 2022 at 5:53 PM Stefan Kooman wrote: > On 6/29/22 15:14, Curt wrote: > > > > > > Hi Stefan, > > > > Good to know. I see the default if .05 for misplaced_ratio. What do > > you recommend would be a safe number to increase it to? > > It depends. It might be safe to put it to 1. But I would slowly increase > it, have the manager increase pgp_num and see how the cluster copes with > the increased load. If you have hardly any client workload you might > bump this ratio quite a bit. At some point you would need to increase > osd max backfill to avoid having PGs waiting on backfill. > > Gr. Stefan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
On Wed, Jun 29, 2022 at 4:42 PM Stefan Kooman wrote: > On 6/29/22 11:21, Curt wrote: > > On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder wrote: > > > >> Hi, > >> > >> did you wait for PG creation and peering to finish after setting pg_num > >> and pgp_num? They should be right on the value you set and not lower. > >> > > Yes, only thing going on was backfill. It's still just slowly expanding > pg > > and pgp nums. I even ran the set command again. Here's the current > info > > ceph osd pool get EC-22-Pool all > > size: 4 > > min_size: 3 > > pg_num: 226 > > pgp_num: 98 > > This is coded in the mons and works like that from nautilus onwards: > > src/mon/OSDMonitor.cc > > ... > if (osdmap.require_osd_release < ceph_release_t::nautilus) { >// pre-nautilus osdmap format; increase pg_num directly >assert(n > (int)p.get_pg_num()); >// force pre-nautilus clients to resend their ops, since they >// don't understand pg_num_target changes form a new interval >p.last_force_op_resend_prenautilus = pending_inc.epoch; >// force pre-luminous clients to resend their ops, since they >// don't understand that split PGs now form a new interval. >p.last_force_op_resend_preluminous = pending_inc.epoch; >p.set_pg_num(n); > } else { >// set targets; mgr will adjust pg_num_actual and pgp_num later. >// make pgp_num track pg_num if it already matches. if it is set >// differently, leave it different and let the user control it >// manually. >if (p.get_pg_num_target() == p.get_pgp_num_target()) { > p.set_pgp_num_target(n); >} >p.set_pg_num_target(n); > } > ... > > So, when pg_num and pgp_num are the same when pg_num is increased, it > will slowly change pgp_num. If pgp_num is different (smaller, as it > cannot be bigger than pg_num) it will not touch pgp_num. > > You might speed up this process by increasing "target_max_misplaced_ratio" > > Gr. Stefan > Hi Stefan, Good to know. I see the default if .05 for misplaced_ratio. What do you recommend would be a safe number to increase it to? Thanks, Curt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder wrote: > Hi, > > did you wait for PG creation and peering to finish after setting pg_num > and pgp_num? They should be right on the value you set and not lower. > Yes, only thing going on was backfill. It's still just slowly expanding pg and pgp nums. I even ran the set command again. Here's the current info ceph osd pool get EC-22-Pool all size: 4 min_size: 3 pg_num: 226 pgp_num: 98 crush_rule: EC-22-Pool hashpspool: true allow_ec_overwrites: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 erasure_code_profile: EC-22-Pro fast_read: 0 pg_autoscale_mode: off eio: false bulk: false > > > How do you set the upmap balancer per pool? > > I'm afraid the answer is RTFM. I don't use it, but I believe to remember > one could configure it for equi-distribution of PGs for each pool. > > Ok, I'll dig around some more. I glanced at the balancer page and didn't see it. > Whenever you grow the cluster, you should make the same considerations > again and select numbers of PG per pool depending on number of objects, > capacity and performance. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Curt > Sent: 28 June 2022 16:33:24 > To: Frank Schilder > Cc: Robert Gallop; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: Ceph recovery network speed > > Hi Frank, > > Thank you for the thorough breakdown. I have increased the pg_num and > pgp_num to 1024 to start on the ec-22 pool. That is going to be my primary > pool with the most data. It looks like ceph slowly scales the pg up even > with autoscaling off, since I see target_pg_num 2048, pg_num 199. > > root@cephmgr:/# ceph osd pool set EC-22-Pool pg_num 2048 > set pool 12 pg_num to 2048 > root@cephmgr:/# ceph osd pool set EC-22-Pool pgp_num 2048 > set pool 12 pgp_num to 2048 > root@cephmgr:/# ceph osd pool get EC-22-Pool all > size: 4 > min_size: 3 > pg_num: 199 > pgp_num: 71 > crush_rule: EC-22-Pool > hashpspool: true > allow_ec_overwrites: true > nodelete: false > nopgchange: false > nosizechange: false > write_fadvise_dontneed: false > noscrub: false > nodeep-scrub: false > use_gmt_hitset: 1 > erasure_code_profile: EC-22-Pro > fast_read: 0 > pg_autoscale_mode: off > eio: false > bulk: false > > This cluster will be growing quit a bit over the next few months. I am > migrating data from their old Giant cluster to a new one, by the time I'm > done it should be 16 hosts with about 400TB of data. I'm guessing I'll have > to increase pg again later when I start adding more servers to the cluster. > > I will look into if SSD's are an option. How do you set the upmap > balancer per pool? Looking at ceph balancer status my mode is already > upmap. > > Thanks again, > Curt > > On Tue, Jun 28, 2022 at 1:23 AM Frank Schilder fr...@dtu.dk>> wrote: > Hi Curt, > > looking at what you sent here, I believe you are the victim of "the law of > large numbers really only holds for large numbers". In other words, the > statistics of small samples is biting you. The PG numbers of your pools are > so low that they lead to a very large imbalance of data- and IO placement. > In other words, in your cluster a few OSDs receive the majority of IO > requests and bottleneck the entire cluster. > > If I see this correctly, the PG num per drive varies from 14 to 40. That's > an insane imbalance. Also, on your EC pool PG_num is 128 but PGP_num is > only 48. The autoscaler is screwing it up for you. It will slowly increase > the number of active PGs, causing continuous relocation of objects for a > very long time. > > I think the recovery speed you see for 8 objects per second is not too bad > considering that you have an HDD only cluster. The speed does not increase, > because it is a small number of PGs sending data - a subset of the 32 you > had before. In addition, due to the imbalance of PGs per OSD, only a small > number of PGs will be able to send data. You will need patience to get out > of this corner. > > The first thing I would do is look at which pools are important for your > workload in the long run. I see 2 pools having a significant number of > objects: EC-22-Pool and default.rgw.buckets.data. EC-22-Pool has about 40 > times the number of objects and bytes as default.rgw.buckets.data. I would > scale both up in PG count with emphasis on EC-22-Pool. > > Your cluster can safely operate between 1100 and 2200 PGs with replication > <=4. If you don't plan to create more large pools, a good choice of > distributin
[ceph-users] Re: Ceph recovery network speed
Hi Frank, Thank you for the thorough breakdown. I have increased the pg_num and pgp_num to 1024 to start on the ec-22 pool. That is going to be my primary pool with the most data. It looks like ceph slowly scales the pg up even with autoscaling off, since I see target_pg_num 2048, pg_num 199. root@cephmgr:/# ceph osd pool set EC-22-Pool pg_num 2048 set pool 12 pg_num to 2048 root@cephmgr:/# ceph osd pool set EC-22-Pool pgp_num 2048 set pool 12 pgp_num to 2048 root@cephmgr:/# ceph osd pool get EC-22-Pool all size: 4 min_size: 3 pg_num: 199 pgp_num: 71 crush_rule: EC-22-Pool hashpspool: true allow_ec_overwrites: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 erasure_code_profile: EC-22-Pro fast_read: 0 pg_autoscale_mode: off eio: false bulk: false This cluster will be growing quit a bit over the next few months. I am migrating data from their old Giant cluster to a new one, by the time I'm done it should be 16 hosts with about 400TB of data. I'm guessing I'll have to increase pg again later when I start adding more servers to the cluster. I will look into if SSD's are an option. How do you set the upmap balancer per pool? Looking at ceph balancer status my mode is already upmap. Thanks again, Curt On Tue, Jun 28, 2022 at 1:23 AM Frank Schilder wrote: > Hi Curt, > > looking at what you sent here, I believe you are the victim of "the law of > large numbers really only holds for large numbers". In other words, the > statistics of small samples is biting you. The PG numbers of your pools are > so low that they lead to a very large imbalance of data- and IO placement. > In other words, in your cluster a few OSDs receive the majority of IO > requests and bottleneck the entire cluster. > > If I see this correctly, the PG num per drive varies from 14 to 40. That's > an insane imbalance. Also, on your EC pool PG_num is 128 but PGP_num is > only 48. The autoscaler is screwing it up for you. It will slowly increase > the number of active PGs, causing continuous relocation of objects for a > very long time. > > I think the recovery speed you see for 8 objects per second is not too bad > considering that you have an HDD only cluster. The speed does not increase, > because it is a small number of PGs sending data - a subset of the 32 you > had before. In addition, due to the imbalance of PGs per OSD, only a small > number of PGs will be able to send data. You will need patience to get out > of this corner. > > The first thing I would do is look at which pools are important for your > workload in the long run. I see 2 pools having a significant number of > objects: EC-22-Pool and default.rgw.buckets.data. EC-22-Pool has about 40 > times the number of objects and bytes as default.rgw.buckets.data. I would > scale both up in PG count with emphasis on EC-22-Pool. > > Your cluster can safely operate between 1100 and 2200 PGs with replication > <=4. If you don't plan to create more large pools, a good choice of > distributing this capacity might be > > EC-22-Pool: 1024 PGs (could be pushed up to 2048) > default.rgw.buckets.data: 256 PGs > > That's towards the lower end of available PGs. Please make your own > calculation and judgement. > > If you have settled on target numbers, change the pool sizes in one go, > that is, set PG_num and PGP_num to the same value right away. You might > need to turn autoscaler off for these 2 pools. The rebalancing will take a > long time and also not speed up, because the few sending PGs are the > bottleneck, not the receiving ones. You will have to sit it out. > > The goal is that, in the future, recovery and re-balancing are improved. > In my experience, a reasonably high PG count will also reduce latency of > client IO. > > Next thing to look at is distribution of PGs per OSD. This has an enormous > performance impact, because a few too busy OSDs can throttle an entire > cluster (its always the slowest disk that wins). I use the very simple > reweight by utilization method, but my pools do not share OSDs as yours do. > You might want to try the upmap balancer per pool to get PGs per pool > evenly spread out over OSDs. > > Lastly, if you can afford it and your hosts have a slot left, consider > buying one enterprise SSD per host for the meta-data pools to get this IO > away from the HDDs. If you buy a bunch of 128G or 256G SATA SSDs, you can > probably place everything except the EC-22-Pool on these drives, separating > completely. > > Hope that helps and maybe someone else has ideas as well? > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Curt > Sent: 27 Ju
[ceph-users] Re: Ceph recovery network speed
298 GiB5 KiB 1.4 GiB 1.5 TiB 16.05 0.61 26 up osd.23 24hdd 1.81940 1.0 1.8 TiB 735 GiB 733 GiB8 KiB 2.3 GiB 1.1 TiB 39.45 1.50 33 up osd.24 25hdd 1.81940 1.0 1.8 TiB 519 GiB 517 GiB5 KiB 1.4 GiB 1.3 TiB 27.85 1.06 26 up osd.25 26hdd 1.81940 1.0 1.8 TiB 483 GiB 481 GiB 614 KiB 1.7 GiB 1.3 TiB 25.94 0.99 28 up osd.26 27hdd 1.81940 1.0 1.8 TiB 226 GiB 225 GiB 1.5 MiB 1.0 GiB 1.6 TiB 12.11 0.46 17 up osd.27 28hdd 1.81940 1.0 1.8 TiB 443 GiB 441 GiB 24 KiB 1.5 GiB 1.4 TiB 23.76 0.91 21 up osd.28 29hdd 1.81940 1.0 1.8 TiB 801 GiB 799 GiB7 KiB 2.2 GiB 1.0 TiB 42.98 1.64 31 up osd.29 30hdd 1.81940 1.0 1.8 TiB 523 GiB 522 GiB 174 KiB 1.2 GiB 1.3 TiB 28.09 1.07 29 up osd.30 31hdd 1.81940 1.0 1.8 TiB 322 GiB 321 GiB4 KiB 1.2 GiB 1.5 TiB 17.30 0.66 26 up osd.31 44hdd 1.81940 1.0 1.8 TiB 541 GiB 540 GiB 136 KiB 1.4 GiB 1.3 TiB 29.06 1.11 24 up osd.44 -9 20.01337 - 20 TiB 5.3 TiB 5.2 TiB 25 MiB 16 GiB 15 TiB 26.25 1.00- host hyperion04 33hdd 1.81940 1.0 1.8 TiB 466 GiB 465 GiB 469 KiB 1.4 GiB 1.4 TiB 25.02 0.95 28 up osd.33 34hdd 1.81940 1.0 1.8 TiB 508 GiB 506 GiB2 KiB 1.8 GiB 1.3 TiB 27.28 1.04 30 up osd.34 35hdd 1.81940 1.0 1.8 TiB 521 GiB 520 GiB2 KiB 1.4 GiB 1.3 TiB 27.98 1.07 32 up osd.35 36hdd 1.81940 1.0 1.8 TiB 872 GiB 870 GiB3 KiB 2.3 GiB 991 GiB 46.81 1.78 40 up osd.36 37hdd 1.81940 1.0 1.8 TiB 443 GiB 441 GiB 136 KiB 1.2 GiB 1.4 TiB 23.75 0.91 25 up osd.37 38hdd 1.81940 1.0 1.8 TiB 138 GiB 137 GiB 24 MiB 647 MiB 1.7 TiB 7.40 0.28 27 up osd.38 39hdd 1.81940 1.0 1.8 TiB 638 GiB 637 GiB 622 KiB 1.7 GiB 1.2 TiB 34.26 1.31 33 up osd.39 40hdd 1.81940 1.0 1.8 TiB 444 GiB 443 GiB 14 KiB 1.4 GiB 1.4 TiB 23.85 0.91 25 up osd.40 41hdd 1.81940 1.0 1.8 TiB 477 GiB 476 GiB 264 KiB 1.3 GiB 1.4 TiB 25.60 0.98 31 up osd.41 42hdd 1.81940 1.0 1.8 TiB 514 GiB 513 GiB 35 KiB 1.2 GiB 1.3 TiB 27.61 1.05 29 up osd.42 43hdd 1.81940 1.0 1.8 TiB 358 GiB 356 GiB 111 KiB 1.2 GiB 1.5 TiB 19.19 0.73 24 up osd.43 TOTAL 80 TiB21 TiB21 TiB 32 MiB 69 GiB 59 TiB 26.23 MIN/MAX VAR: 0.12/2.36 STDDEV: 12.47 > > The number of objects in flight looks small. Your objects seem to have an > average size of 4MB and should recover with full bandwidth. Check with top > how much IO wait percentage you have on the OSD hosts. > iowait is 3.3% and load avg is 3.7, nothing crazy from what I can tell. > > The one thing that jumps to my eye though is, that you only have 22 dirty > PGs and they are all recovering/backfilling already. I wonder if you have a > problem with your crush rules, they might not do what you think they do. > You said you increased the PG count for EC-22-Pool to 128 (from what?) but > it doesn't really look like a suitable number of PGs has been marked for > backfilling. Can you post the output of "ceph osd pool get EC-22-Pool all"? > From 32 to 128 ceph osd pool get EC-22-Pool all size: 4 min_size: 3 pg_num: 128 pgp_num: 48 crush_rule: EC-22-Pool hashpspool: true allow_ec_overwrites: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 erasure_code_profile: EC-22-Pro fast_read: 0 pg_autoscale_mode: on eio: false bulk: false > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Curt > Sent: 27 June 2022 19:41:06 > To: Robert Gallop > Cc: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: Ceph recovery network speed > > I would love to see those types of speeds. I tried setting it all the way > to 0 and nothing, I did that before I sent the first email, maybe it was > your old post I got it from. > > osd_recovery_sleep_hdd 0.00 > > > override (mon[0.00]) > > On Mon, Jun 27, 2022 at 9:27 PM Robert Gallop <mailto:robert.gal...@gmail.com>> wrote: > I saw a major boost after having the sleep_hdd set to 0. Only after that > did I start staying at around 50
[ceph-users] Re: Ceph recovery network speed
I would love to see those types of speeds. I tried setting it all the way to 0 and nothing, I did that before I sent the first email, maybe it was your old post I got it from. osd_recovery_sleep_hdd 0.00 override (mon[0.00]) On Mon, Jun 27, 2022 at 9:27 PM Robert Gallop wrote: > I saw a major boost after having the sleep_hdd set to 0. Only after that > did I start staying at around 500MiB to 1.2GiB/sec and 1.5k obj/sec to 2.5k > obj/sec. > > Eventually it tapered back down, but for me sleep was the key, and > specifically in my case: > > osd_recovery_sleep_hdd > > On Mon, Jun 27, 2022 at 11:17 AM Curt wrote: > >> On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder wrote: >> >> > I think this is just how ceph is. Maybe you should post the output of >> > "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an >> > idea whether what you look at is expected or not. As I wrote before, >> object >> > recovery is throttled and the recovery bandwidth depends heavily on >> object >> > size. The interesting question is, how many objects per second are >> > recovered/rebalanced >> > >> data: >> pools: 11 pools, 369 pgs >> objects: 2.45M objects, 9.2 TiB >> usage: 20 TiB used, 60 TiB / 80 TiB avail >> pgs: 512136/9729081 objects misplaced (5.264%) >> 343 active+clean >> 22 active+remapped+backfilling >> >> io: >> client: 2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr >> recovery: 34 MiB/s, 8 objects/s >> >> Pool 12 is the only one with any stats. >> >> pool EC-22-Pool id 12 >> 510048/9545052 objects misplaced (5.344%) >> recovery io 36 MiB/s, 9 objects/s >> client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr >> >> --- RAW STORAGE --- >> CLASSSIZE AVAILUSED RAW USED %RAW USED >> hdd80 TiB 60 TiB 20 TiB20 TiB 25.45 >> TOTAL 80 TiB 60 TiB 20 TiB20 TiB 25.45 >> >> --- POOLS --- >> POOLID PGS STORED OBJECTS USED %USED MAX >> AVAIL >> .mgr 11 152 MiB 38 457 MiB 0 >> 9.2 TiB >> 21BadPool3 328 KiB1 12 KiB 0 >> 18 TiB >> .rgw.root4 32 1.3 KiB4 48 KiB 0 >> 9.2 TiB >> default.rgw.log 5 32 3.6 KiB 209 408 KiB 0 >> 9.2 TiB >> default.rgw.control 6 32 0 B8 0 B 0 >> 9.2 TiB >> default.rgw.meta 78 6.7 KiB 20 203 KiB 0 >> 9.2 TiB >> rbd_rep_pool 8 32 2.0 MiB5 5.9 MiB 0 >> 9.2 TiB >> default.rgw.buckets.index98 2.0 MiB 33 5.9 MiB 0 >> 9.2 TiB >> default.rgw.buckets.non-ec 10 32 1.4 KiB0 4.3 KiB 0 >> 9.2 TiB >> default.rgw.buckets.data11 32 232 GiB 61.02k 697 GiB 2.41 >> 9.2 TiB >> EC-22-Pool 12 128 9.8 TiB2.39M 20 TiB 41.55 >> 14 TiB >> >> >> >> > Maybe provide the output of the first two commands for >> > osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a >> bit >> > after setting these and then collect the output). Include the applied >> > values for osd_max_backfills* and osd_recovery_max_active* for one of >> the >> > OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e >> > osd_recovery_max_active). >> > >> >> I didn't notice any speed difference with sleep values changed, but I'll >> grab the stats between changes when I have a chance. >> >> ceph config show osd.19 | egrep >> 'osd_max_backfills|osd_recovery_max_active' >> osd_max_backfills1000 >> >> >> override mon[5] >> osd_recovery_max_active 1000 >> >> >> override >> osd_recovery_max_active_hdd 1000 >> >> >> override mon[5] >> osd_recovery_max_active_ssd 1000 >> >> >> override >> >> > >> > I don't really know if on such a small cluster one can expect more than >> > what you see. It has nothing to do with network speed
[ceph-users] Re: Ceph recovery network speed
On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder wrote: > I think this is just how ceph is. Maybe you should post the output of > "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an > idea whether what you look at is expected or not. As I wrote before, object > recovery is throttled and the recovery bandwidth depends heavily on object > size. The interesting question is, how many objects per second are > recovered/rebalanced > data: pools: 11 pools, 369 pgs objects: 2.45M objects, 9.2 TiB usage: 20 TiB used, 60 TiB / 80 TiB avail pgs: 512136/9729081 objects misplaced (5.264%) 343 active+clean 22 active+remapped+backfilling io: client: 2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr recovery: 34 MiB/s, 8 objects/s Pool 12 is the only one with any stats. pool EC-22-Pool id 12 510048/9545052 objects misplaced (5.344%) recovery io 36 MiB/s, 9 objects/s client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr --- RAW STORAGE --- CLASSSIZE AVAILUSED RAW USED %RAW USED hdd80 TiB 60 TiB 20 TiB20 TiB 25.45 TOTAL 80 TiB 60 TiB 20 TiB20 TiB 25.45 --- POOLS --- POOLID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 11 152 MiB 38 457 MiB 0 9.2 TiB 21BadPool3 328 KiB1 12 KiB 0 18 TiB .rgw.root4 32 1.3 KiB4 48 KiB 0 9.2 TiB default.rgw.log 5 32 3.6 KiB 209 408 KiB 0 9.2 TiB default.rgw.control 6 32 0 B8 0 B 0 9.2 TiB default.rgw.meta 78 6.7 KiB 20 203 KiB 0 9.2 TiB rbd_rep_pool 8 32 2.0 MiB5 5.9 MiB 0 9.2 TiB default.rgw.buckets.index98 2.0 MiB 33 5.9 MiB 0 9.2 TiB default.rgw.buckets.non-ec 10 32 1.4 KiB0 4.3 KiB 0 9.2 TiB default.rgw.buckets.data11 32 232 GiB 61.02k 697 GiB 2.41 9.2 TiB EC-22-Pool 12 128 9.8 TiB2.39M 20 TiB 41.55 14 TiB > Maybe provide the output of the first two commands for > osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a bit > after setting these and then collect the output). Include the applied > values for osd_max_backfills* and osd_recovery_max_active* for one of the > OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e > osd_recovery_max_active). > I didn't notice any speed difference with sleep values changed, but I'll grab the stats between changes when I have a chance. ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active' osd_max_backfills1000 override mon[5] osd_recovery_max_active 1000 override osd_recovery_max_active_hdd 1000 override mon[5] osd_recovery_max_active_ssd 1000 override > > I don't really know if on such a small cluster one can expect more than > what you see. It has nothing to do with network speed if you have a 10G > line. However, recovery is something completely different from a full > link-speed copy. > > I can tell you that boatloads of tiny objects are a huge pain for > recovery, even on SSD. Ceph doesn't raid up sections of disks against each > other, but object for object. This might be a feature request: that PG > space allocation and recovery should follow the model of LVM extends > (ideally match with LVM extends) to allow recovery/rebalancing larger > chunks of storage in one go, containing parts of a large or many small > objects. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Curt > Sent: 27 June 2022 17:35:19 > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] Re: Ceph recovery network speed > > Hello, > > I had already increased/changed those variables previously. I increased > the pg_num to 128. Which increased the number of PG's backfilling, but > speed is still only at 30 MiB/s avg and has been backfilling 23 pg for the > last several hours. Should I increase it higher than 128? > > I'm still trying to figure out if this is just how ceph is or if there is > a bottleneck somewhere. Like if I sftp a 10G file between servers it's > done in a couple min or less. Am I thinking of this wrong? > > Thanks, > Curt > > On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder fr...@dtu.dk>> wrote: > Hi Curt, > > as far as I understood, a 2+2
[ceph-users] Re: Ceph recovery network speed
Hello, I had already increased/changed those variables previously. I increased the pg_num to 128. Which increased the number of PG's backfilling, but speed is still only at 30 MiB/s avg and has been backfilling 23 pg for the last several hours. Should I increase it higher than 128? I'm still trying to figure out if this is just how ceph is or if there is a bottleneck somewhere. Like if I sftp a 10G file between servers it's done in a couple min or less. Am I thinking of this wrong? Thanks, Curt On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder wrote: > Hi Curt, > > as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD per > host busy. My experience is, that the algorithm for selecting PGs to > backfill/recover is not very smart. It could simply be that it doesn't find > more PGs without violating some of these settings: > > osd_max_backfills > osd_recovery_max_active > > I have never observed the second parameter to change anything (try any > ways). However, the first one has a large impact. You could try increasing > this slowly until recovery moves faster. Another parameter you might want > to try is > > osd_recovery_sleep_[hdd|ssd] > > Be careful as this will impact client IO. I could reduce the sleep for my > HDDs to 0.05. With your workload pattern, this might be something you can > tune as well. > > Having said that, I think you should increase your PG count on the EC pool > as soon as the cluster is healthy. You have only about 20 PGs per OSD and > large PGs will take unnecessarily long to recover. A higher PG count will > also make it easier for the scheduler to find PGs for recovery/backfill. > Aim for a number between 100 and 200. Give the pool(s) with most data > (#objects) the most PGs. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Curt > Sent: 24 June 2022 19:04 > To: Anthony D'Atri; ceph-users@ceph.io > Subject: [ceph-users] Re: Ceph recovery network speed > > 2 PG's shouldn't take hours to backfill in my opinion. Just 2TB enterprise > HD's. > > Take this log entry below, 72 minutes and still backfilling undersized? > Should it be that slow? > > pg 12.15 is stuck undersized for 72m, current state > active+undersized+degraded+remapped+backfilling, last acting > [34,10,29,NONE] > > Thanks, > Curt > > > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri > wrote: > > > Your recovery is slow *because* there are only 2 PGs backfilling. > > > > What kind of OSD media are you using? > > > > > On Jun 24, 2022, at 09:46, Curt wrote: > > > > > > Hello, > > > > > > I'm trying to understand why my recovery is so slow with only 2 pg > > > backfilling. I'm only getting speeds of 3-4/MiB/s on a 10G network. I > > > have tested the speed between machines with a few tools and all confirm > > 10G > > > speed. I've tried changing various settings of priority and recovery > > sleep > > > hdd, but still the same. Is this a configuration issue or something > else? > > > > > > It's just a small cluster right now with 4 hosts, 11 osd's per. Please > > let > > > me know if you need more information. > > > > > > Thanks, > > > Curt > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
On Sat, Jun 25, 2022 at 3:27 AM Anthony D'Atri wrote: > The pg_autoscaler aims IMHO way too low and I advise turning it off. > > > > > On Jun 24, 2022, at 11:11 AM, Curt wrote: > > > >> You wrote 2TB before, are they 2TB or 18TB? Is that 273 PGs total or > per > > osd? > > Sorry, 18TB of data and 273 PGs total. > > > >> `ceph osd df` will show you toward the right how many PGs are on each > > OSD. If you have multiple pools, some PGs will have more data than > others. > >> So take an average # of PGs per OSD and divide the actual HDD capacity > > by that. > > 20 pg on avg / 2TB(technically 1.8 I guess) which would be 10. > > I’m confused. Is 20 what `ceph osd df` is reporting? Send me the output > of Yes, 20 would be the avg pg count. ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL%USE VAR PGS STATUS 1hdd 1.81940 1.0 1.8 TiB 748 GiB 746 GiB 207 KiB 1.7 GiB 1.1 TiB 40.16 1.68 21 up 3hdd 1.81940 1.0 1.8 TiB 459 GiB 457 GiB3 KiB 1.2 GiB 1.4 TiB 24.61 1.03 20 up 5hdd 1.81940 1.0 1.8 TiB 153 GiB 152 GiB 32 KiB 472 MiB 1.7 TiB 8.20 0.34 15 up 7hdd 1.81940 1.0 1.8 TiB 471 GiB 470 GiB 83 KiB 1.0 GiB 1.4 TiB 25.27 1.06 24 up 9hdd 1.81940 1.0 1.8 TiB 1.0 TiB 1022 GiB 136 KiB 2.4 GiB 838 GiB 54.99 2.30 19 up 11hdd 1.81940 1.0 1.8 TiB 443 GiB 441 GiB4 KiB 1.1 GiB 1.4 TiB 23.76 0.99 20 up 13hdd 1.81940 1.0 1.8 TiB 438 GiB 437 GiB 310 KiB 1.0 GiB 1.4 TiB 23.50 0.98 18 up 15hdd 1.81940 1.0 1.8 TiB 334 GiB 333 GiB 621 KiB 929 MiB 1.5 TiB 17.92 0.75 15 up 17hdd 1.81940 1.0 1.8 TiB 310 GiB 309 GiB2 KiB 807 MiB 1.5 TiB 16.64 0.70 20 up 19hdd 1.81940 1.0 1.8 TiB 433 GiB 432 GiB7 KiB 974 MiB 1.4 TiB 23.23 0.97 25 up 45hdd 1.81940 1.0 1.8 TiB 169 GiB 169 GiB2 KiB 615 MiB 1.7 TiB 9.09 0.38 18 up 0hdd 1.81940 1.0 1.8 TiB 582 GiB 580 GiB 295 KiB 1.7 GiB 1.3 TiB 31.24 1.31 21 up 2hdd 1.81940 1.0 1.8 TiB 870 MiB21 MiB 112 KiB 849 MiB 1.8 TiB 0.05 0.00 14 up 4hdd 1.81940 1.0 1.8 TiB 326 GiB 325 GiB 14 KiB 947 MiB 1.5 TiB 17.48 0.73 24 up 6hdd 1.81940 1.0 1.8 TiB 450 GiB 448 GiB1 KiB 1.4 GiB 1.4 TiB 24.13 1.01 17 up 8hdd 1.81940 1.0 1.8 TiB 152 GiB 152 GiB 618 KiB 900 MiB 1.7 TiB 8.18 0.34 20 up 10hdd 1.81940 1.0 1.8 TiB 609 GiB 607 GiB4 KiB 1.7 GiB 1.2 TiB 32.67 1.37 25 up 12hdd 1.81940 1.0 1.8 TiB 333 GiB 332 GiB 175 KiB 1.5 GiB 1.5 TiB 17.89 0.75 24 up 14hdd 1.81940 1.0 1.8 TiB 1.0 TiB 1.0 TiB1 KiB 2.2 GiB 834 GiB 55.24 2.31 17 up 16hdd 1.81940 1.0 1.8 TiB 168 GiB 167 GiB4 KiB 1.2 GiB 1.7 TiB 9.03 0.38 15 up 18hdd 1.81940 1.0 1.8 TiB 299 GiB 298 GiB 261 KiB 1.6 GiB 1.5 TiB 16.07 0.67 15 up 32hdd 1.81940 1.0 1.8 TiB 873 GiB 871 GiB 45 KiB 2.3 GiB 990 GiB 46.88 1.96 18 up 22hdd 1.81940 1.0 1.8 TiB 449 GiB 447 GiB 139 KiB 1.6 GiB 1.4 TiB 24.10 1.01 22 up 23hdd 1.81940 1.0 1.8 TiB 299 GiB 298 GiB5 KiB 1.6 GiB 1.5 TiB 16.06 0.67 20 up 24hdd 1.81940 1.0 1.8 TiB 887 GiB 885 GiB8 KiB 2.4 GiB 976 GiB 47.62 1.99 23 up 25hdd 1.81940 1.0 1.8 TiB 451 GiB 449 GiB4 KiB 1.6 GiB 1.4 TiB 24.20 1.01 17 up 26hdd 1.81940 1.0 1.8 TiB 602 GiB 600 GiB 373 KiB 2.0 GiB 1.2 TiB 32.29 1.35 21 up 27hdd 1.81940 1.0 1.8 TiB 152 GiB 151 GiB 1.5 MiB 564 MiB 1.7 TiB 8.14 0.34 14 up 28hdd 1.81940 1.0 1.8 TiB 330 GiB 328 GiB7 KiB 1.6 GiB 1.5 TiB 17.70 0.74 12 up 29hdd 1.81940 1.0 1.8 TiB 726 GiB 723 GiB7 KiB 2.1 GiB 1.1 TiB 38.94 1.63 16 up 30hdd 1.81940 1.0 1.8 TiB 596 GiB 594 GiB 173 KiB 2.0 GiB 1.2 TiB 32.01 1.34 19 up 31hdd 1.81940 1.0 1.8 TiB 304 GiB 303 GiB4 KiB 1.6 GiB 1.5 TiB 16.34 0.68 20 up 44hdd 1.81940 1.0 1.8 TiB 150 GiB 149 GiB 0 B 599 MiB 1.7 TiB 8.03 0.34 12 up 33hdd 1.81940 1.0 1.8 TiB 451 GiB 449 GiB 462 KiB 1.8 GiB 1.4 TiB 24.22 1.01 19 up 34hdd 1.81940 1.0 1.8 TiB 449 GiB 448 GiB2 KiB 966 MiB 1.4 TiB 24.12 1.01 21 up 35hdd 1.81940 1.0 1.8 TiB 458 GiB 457 GiB2 KiB 1.5 GiB 1.4 TiB 24.60 1.03 23 up 36hdd 1.81940 1.0 1.8 TiB 872 GiB 870 GiB3 KiB 2.4 Gi
[ceph-users] Re: Ceph recovery network speed
Nope, majority of read/writes happen at night so it's doing less than 1 MiB/s client io right now, sometimes 0. On Fri, Jun 24, 2022, 22:23 Stefan Kooman wrote: > On 6/24/22 20:09, Curt wrote: > > > > > > On Fri, Jun 24, 2022 at 10:00 PM Stefan Kooman > <mailto:ste...@bit.nl>> wrote: > > > > On 6/24/22 19:49, Curt wrote: > > > Pool 12 is my erasure coding pool, 2+2. How can I tell if it's > > > objections or keys recovering?\ > > > > ceph -s. wil tell you what type of recovery is going on. > > > > Is it a cephfs metadata pool? Or a rgw index pool? > > > > Gr. Stefan > > > > > > object recovery, I guess I'm used to it always showing object, so didn't > > know it could be key. > > > > rbd pool. > > recovery has lower priority than client IO. Is the cluster busy? > > Gr. Stefan > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
> You wrote 2TB before, are they 2TB or 18TB? Is that 273 PGs total or per osd? Sorry, 18TB of data and 273 PGs total. > `ceph osd df` will show you toward the right how many PGs are on each OSD. If you have multiple pools, some PGs will have more data than others. > So take an average # of PGs per OSD and divide the actual HDD capacity by that. 20 pg on avg / 2TB(technically 1.8 I guess) which would be 10. Shouldn't that be used though, not capacity? My usage is only 23% capacity. I thought ceph autoscalling pg's changed the size dynamically according to usage? I'm guessing I'm misunderstanding that part? Thanks, Curt On Fri, Jun 24, 2022 at 9:48 PM Anthony D'Atri wrote: > > > Yes, SATA, I think my benchmark put it around 125, but that was a year > ago, so could be misremembering > > A FIO benchmark, especially a sequential one on an empty drive, can > mislead as to the real-world performance one sees on a fragmented drive. > > > 273 pg at 18TB so each PG would be 60G. > > You wrote 2TB before, are they 2TB or 18TB? Is that 273 PGs total or per > osd? > > > Mainly used for RBD, using erasure coding. cephadm bootstrap with > docker images. > > Ack. Have to account for replication. > > `ceph osd df` will show you toward the right how many PGs are on each > OSD. If you have multiple pools, some PGs will have more data than others. > > So take an average # of PGs per OSD and divide the actual HDD capacity by > that. > > > > > > > > On Fri, Jun 24, 2022 at 9:21 PM Anthony D'Atri > wrote: > > > > > > > > > > 2 PG's shouldn't take hours to backfill in my opinion. Just 2TB > enterprise HD's. > > > > SATA? Figure they can write at 70 MB/s > > > > How big are your PGs? What is your cluster used for? RBD? RGW? CephFS? > > > > > > > > Take this log entry below, 72 minutes and still backfilling > undersized? Should it be that slow? > > > > > > pg 12.15 is stuck undersized for 72m, current state > active+undersized+degraded+remapped+backfilling, last acting [34,10,29,NONE] > > > > > > Thanks, > > > Curt > > > > > > > > > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri < > anthony.da...@gmail.com> wrote: > > > Your recovery is slow *because* there are only 2 PGs backfilling. > > > > > > What kind of OSD media are you using? > > > > > > > On Jun 24, 2022, at 09:46, Curt wrote: > > > > > > > > Hello, > > > > > > > > I'm trying to understand why my recovery is so slow with only 2 pg > > > > backfilling. I'm only getting speeds of 3-4/MiB/s on a 10G > network. I > > > > have tested the speed between machines with a few tools and all > confirm 10G > > > > speed. I've tried changing various settings of priority and > recovery sleep > > > > hdd, but still the same. Is this a configuration issue or something > else? > > > > > > > > It's just a small cluster right now with 4 hosts, 11 osd's per. > Please let > > > > me know if you need more information. > > > > > > > > Thanks, > > > > Curt > > > > ___ > > > > ceph-users mailing list -- ceph-users@ceph.io > > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
On Fri, Jun 24, 2022 at 10:00 PM Stefan Kooman wrote: > On 6/24/22 19:49, Curt wrote: > > Pool 12 is my erasure coding pool, 2+2. How can I tell if it's > > objections or keys recovering?\ > > ceph -s. wil tell you what type of recovery is going on. > > Is it a cephfs metadata pool? Or a rgw index pool? > > Gr. Stefan > object recovery, I guess I'm used to it always showing object, so didn't know it could be key. rbd pool. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
Pool 12 is my erasure coding pool, 2+2. How can I tell if it's objections or keys recovering? Thanks, Curt On Fri, Jun 24, 2022 at 9:39 PM Stefan Kooman wrote: > On 6/24/22 19:04, Curt wrote: > > 2 PG's shouldn't take hours to backfill in my opinion. Just 2TB > enterprise > > HD's. > > > > Take this log entry below, 72 minutes and still backfilling undersized? > > Should it be that slow? > > > > pg 12.15 is stuck undersized for 72m, current state > > active+undersized+degraded+remapped+backfilling, last acting > [34,10,29,NONE] > > What is in that pool 12? Is it objects that are recovering, or keys? > OMAP data (keys) is slow. > > Gr. Stefan > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph recovery network speed
2 PG's shouldn't take hours to backfill in my opinion. Just 2TB enterprise HD's. Take this log entry below, 72 minutes and still backfilling undersized? Should it be that slow? pg 12.15 is stuck undersized for 72m, current state active+undersized+degraded+remapped+backfilling, last acting [34,10,29,NONE] Thanks, Curt On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri wrote: > Your recovery is slow *because* there are only 2 PGs backfilling. > > What kind of OSD media are you using? > > > On Jun 24, 2022, at 09:46, Curt wrote: > > > > Hello, > > > > I'm trying to understand why my recovery is so slow with only 2 pg > > backfilling. I'm only getting speeds of 3-4/MiB/s on a 10G network. I > > have tested the speed between machines with a few tools and all confirm > 10G > > speed. I've tried changing various settings of priority and recovery > sleep > > hdd, but still the same. Is this a configuration issue or something else? > > > > It's just a small cluster right now with 4 hosts, 11 osd's per. Please > let > > me know if you need more information. > > > > Thanks, > > Curt > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph recovery network speed
Hello, I'm trying to understand why my recovery is so slow with only 2 pg backfilling. I'm only getting speeds of 3-4/MiB/s on a 10G network. I have tested the speed between machines with a few tools and all confirm 10G speed. I've tried changing various settings of priority and recovery sleep hdd, but still the same. Is this a configuration issue or something else? It's just a small cluster right now with 4 hosts, 11 osd's per. Please let me know if you need more information. Thanks, Curt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io