[ceph-users] Too few PGs per OSD (autoscaler)
Helo, Ceph users, TL;DR: PG autoscaler should not cause the "too few PGs per OSD" warning Detailed: Some time ago, I upgraded the HW in my virtualization+Ceph cluster, replacing 30+ old servers with <10 modern servers. I immediately got "Too much PGs per OSD" warning, so I had to add more OSDs, even though I did not need the space at that time. So I eagerly waited for the PG autoscaling feature in Nautilus. Yesterday I upgraded to Nautilus and enabled the autoscaler on my RBD pool. Firstly I got the "objects per pg (XX) is more than XX times cluster average" warning for several hours, which has been replaced with "too few PGs per OSD" later on. I have to set the minimum number of PGs per pool, but anyway, I think autoscaler should not be too aggresive, and should not reduce the number of PGs below the PGs per OSD limit. (that said, the ability to reduce the number of PGs in a pool in Nautilus works well for me, thanks for it!) Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | sir_clive> I hope you don't mind if I steal some of your ideas? laryross> As far as stealing... we call it sharing here. --from rcgroups ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How do you deal with "clock skew detected"?
Konstantin Shalygin wrote: : >how do you deal with the "clock skew detected" HEALTH_WARN message? : > : >I think the internal RTC in most x86 servers does have 1 second resolution : >only, but Ceph skew limit is much smaller than that. So every time I reboot : >one of my mons (for kernel upgrade or something), I have to wait for several : >minutes for the system clock to synchronize over NTP, even though ntpd : >has been running before reboot and was started during the system boot again. : : Definitely you should use chrony with iburst. OK, many responses (thanks for them!) suggest chrony, so I tried it: With all three mons running chrony and being in sync with my NTP server with offsets under 0.0001 second, I rebooted one of the mons: There still was the HEALTH_WARN clock_skew message as soon as the rebooted mon starts responding to ping. The cluster returns to HEALTH_OK about 95 seconds later. According to "ntpdate -q my.ntp.server", the initial offset after reboot is about 0.6 s (which is the reason of HEALTH_WARN, I think), but it gets under 0.0001 s in about 25 seconds. The remaining ~50 seconds of HEALTH_WARN is inside Ceph, with mons being already synchronized. So the result is that chrony indeed synchronizes faster, but nevertheless I still have about 95 seconds of HEALTH_WARN "clock skew detected". I guess now the workaround now is to ignore the warning, and wait for two minutes before rebooting another mon. -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | sir_clive> I hope you don't mind if I steal some of your ideas? laryross> As far as stealing... we call it sharing here. --from rcgroups ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Huge rebalance after rebooting OSD host (Mimic)
Hello, Ceph users, I wanted to install the recent kernel update on my OSD hosts with CentOS 7, Ceph 13.2.5 Mimic. So I set a noout flag and ran "yum -y update" on the first OSD host. This host has 8 bluestore OSDs with data on HDDs and database on LVs of two SSDs (each SSD has 4 LVs for OSD metadata). Everything went OK, so I rebooted this host. After the OSD host went back online, the cluster went from HEALTH_WARN (noout flag set) to HEALTH_ERR, and started to rebalance itself, with reportedly almost 60 % objects misplaced, and some of them degraded. And, of course backfill_toofull: cluster: health: HEALTH_ERR 2300616/3975384 objects misplaced (57.872%) Degraded data redundancy: 74263/3975384 objects degraded (1.868%), 146 pgs degraded, 122 pgs undersized Degraded data redundancy (low space): 44 pgs backfill_toofull services: mon: 3 daemons, quorum stratus1,stratus2,stratus3 mgr: stratus3(active), standbys: stratus1, stratus2 osd: 44 osds: 44 up, 44 in; 2022 remapped pgs rgw: 1 daemon active data: pools: 9 pools, 3360 pgs objects: 1.33 M objects, 5.0 TiB usage: 25 TiB used, 465 TiB / 490 TiB avail pgs: 74263/3975384 objects degraded (1.868%) 2300616/3975384 objects misplaced (57.872%) 1739 active+remapped+backfill_wait 1329 active+clean 102 active+recovery_wait+remapped 76 active+undersized+degraded+remapped+backfill_wait 31 active+remapped+backfill_wait+backfill_toofull 30 active+recovery_wait+undersized+degraded+remapped 21 active+recovery_wait+degraded+remapped 8 active+undersized+degraded+remapped+backfill_wait+backfill_toofull 6active+recovery_wait+degraded 4active+remapped+backfill_toofull 3active+recovery_wait+undersized+degraded 3active+remapped+backfilling 2active+recovery_wait 2active+recovering+undersized 1active+clean+remapped 1active+undersized+degraded+remapped+backfill_toofull 1active+undersized+degraded+remapped+backfilling 1active+recovering+undersized+remapped io: client: 681 B/s rd, 1013 KiB/s wr, 0 op/s rd, 32 op/s wr recovery: 142 MiB/s, 93 objects/s (note that I cleaned the noout flag afterwards). What is wrong with it? Why did the cluster decided to rebalance itself? I am keeping the rest of the OSD hosts unrebooted for now. Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | sir_clive> I hope you don't mind if I steal some of your ideas? laryross> As far as stealing... we call it sharing here. --from rcgroups ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How do you deal with "clock skew detected"?
Hello, Ceph users, how do you deal with the "clock skew detected" HEALTH_WARN message? I think the internal RTC in most x86 servers does have 1 second resolution only, but Ceph skew limit is much smaller than that. So every time I reboot one of my mons (for kernel upgrade or something), I have to wait for several minutes for the system clock to synchronize over NTP, even though ntpd has been running before reboot and was started during the system boot again. Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | sir_clive> I hope you don't mind if I steal some of your ideas? laryross> As far as stealing... we call it sharing here. --from rcgroups ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw object size limit?
Hello, thanks for your help. Casey Bodley wrote: : It looks like the default.rgw.buckets.non-ec pool is missing, which : is where we track in-progress multipart uploads. So I'm guessing : that your perl client is not doing a multipart upload, where s3cmd : does by default. : : I'd recommend debugging this by trying to create the pool manually - : the only requirement for this pool is that it not be erasure coded. : See the docs for your ceph release for more information: : : http://docs.ceph.com/docs/luminous/rados/operations/pools/#create-a-pool : : http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/ I use Mimic, FWIW. I created the pool in question manually: # ceph osd pool create default.rgw.buckets.non-ec 32 pool 'default.rgw.buckets.non-ec' created # and it finished without any error. Now I can do multipart uploads using s3cmd. What could be the problem? Maybe radosgw cephx user does not have sufficient rights to create a pool? ceph auth ls shows the following keys: client.bootstrap-rgw key: ... caps: [mgr] allow r caps: [mon] allow profile bootstrap-rgw client.rgw.myrgwhost key: ... caps: [mon] allow rw caps: [osd] allow rwx Is this correct? Thank you very much! -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | sir_clive> I hope you don't mind if I steal some of your ideas? laryross> As far as stealing... we call it sharing here. --from rcgroups ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw object size limit?
Hello Casey (and the ceph-users list), I am returning to my older problem to which you replied: Casey Bodley wrote: : There is a rgw_max_put_size which defaults to 5G, which limits the : size of a single PUT request. But in that case, the http response : would be 400 EntityTooLarge. For multipart uploads, there's also a : rgw_multipart_part_upload_limit that defaults to 1 parts, which : would cause a 416 InvalidRange error. By default though, s3cmd does : multipart uploads with 15MB parts, so your 11G object should only : have ~750 parts. : : Are you able to upload smaller objects successfully? These : InvalidRange errors can also result from failures to create any : rados pools that didn't exist already. If that's what you're : hitting, you'd get the same InvalidRange errors for smaller object : uploads, and you'd also see messages like this in your radosgw log: : : > rgw_init_ioctx ERROR: librados::Rados::pool_create returned (34) : Numerical result out of range (this can be due to a pool or : placement group misconfiguration, e.g. pg_num < pgp_num or : mon_max_pg_per_osd exceeded) You are right. Now how do I know which pool it is and what is the reason? Anyway, If I try to upload a CentOS 7 ISO image using Perl module Net::Amazon::S3, it works. I do something like this there: my $bucket = $s3->add_bucket({ bucket => 'testbucket', acl_short => 'private', }); $bucket->add_key_filename("testdir/$dst", $file, { content_type => 'application/octet-stream' }) or die $s3->err . ': ' . $s3->errstr; and I see the following in /var/log/ceph/ceph-client.rgwlog: 2019-05-10 15:55:28.394 7f4b859b8700 1 civetweb: 0x558108506000: 127.0.0.1 - - [10/May/2019:15:53:50 +0200] "PUT /testbucket/testdir/CentOS-7-x86_64-Everything-1810.iso HTTP/1.1" 200 234 - libwww-perl/6.38 I can see the uploaded object using "s3cmd ls", and I can download it back using "s3cmd get", with matching sha1sum. When I do the same using "s3cmd put" instead of Perl module, I indeed get the pool create failure: 2019-05-10 15:53:14.914 7f4b859b8700 1 == starting new request req=0x7f4b859af850 = 2019-05-10 15:53:15.492 7f4b859b8700 0 rgw_init_ioctx ERROR: librados::Rados::pool_create returned (34) Numerical result out of range (this can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num or mon_max_pg_per_osd exceeded) 2019-05-10 15:53:15.492 7f4b859b8700 1 == req done req=0x7f4b859af850 op status=-34 http_status=416 == 2019-05-10 15:53:15.492 7f4b859b8700 1 civetweb: 0x558108506000: 127.0.0.1 - - [10/May/2019:15:53:14 +0200] "POST /testbucket/testdir/c7.iso?uploads HTTP/1.0" 416 469 - - So maybe the Perl module is configured differently? But which pool or other parameter is the problem? I have the following pools: # ceph osd pool ls one .rgw.root default.rgw.control default.rgw.meta default.rgw.log default.rgw.buckets.index default.rgw.buckets.data (the "one" pool is unrelated to RadosGW, it contains OpenNebula RBD images). Thanks, -Yenya : On 3/7/19 12:21 PM, Jan Kasprzak wrote: : > Hello, Ceph users, : > : >does radosgw have an upper limit of object size? I tried to upload : >a 11GB file using s3cmd, but it failed with InvalidRange error: : > : >$ s3cmd put --verbose centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso s3://mybucket/ : >INFO: No cache file found, creating it. : >INFO: Compiling list of local files... : >INFO: Running stat() and reading/calculating MD5 values on 1 files, this may take some time... : >INFO: Summary: 1 local files to upload : >WARNING: CentOS-7-x86_64-Everything-1810.iso: Owner username not known. Storing UID=108 instead. : >WARNING: CentOS-7-x86_64-Everything-1810.iso: Owner groupname not known. Storing GID=108 instead. : >ERROR: S3 error: 416 (InvalidRange) : > : >$ ls -lh centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso : >-rw-r--r--. 1 108 108 11G Nov 26 15:28 centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso : > : >Thanks for any hint how to increase the limit. : > : >-Yenya : > : ___ : ceph-users mailing list : ceph-users@lists.ceph.com : http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | sir_clive> I hope you don't mind if I steal some of your ideas? laryross> As far as stealing... we call it sharing here. --from rcgroups ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw object size limit?
Hello, Ceph users, does radosgw have an upper limit of object size? I tried to upload a 11GB file using s3cmd, but it failed with InvalidRange error: $ s3cmd put --verbose centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso s3://mybucket/ INFO: No cache file found, creating it. INFO: Compiling list of local files... INFO: Running stat() and reading/calculating MD5 values on 1 files, this may take some time... INFO: Summary: 1 local files to upload WARNING: CentOS-7-x86_64-Everything-1810.iso: Owner username not known. Storing UID=108 instead. WARNING: CentOS-7-x86_64-Everything-1810.iso: Owner groupname not known. Storing GID=108 instead. ERROR: S3 error: 416 (InvalidRange) $ ls -lh centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso -rw-r--r--. 1 108 108 11G Nov 26 15:28 centos/7/isos/x86_64/CentOS-7-x86_64-Everything-1810.iso Thanks for any hint how to increase the limit. -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD image format v1 EOL ...
Hello, Jason Dillaman wrote: : For the future Ceph Octopus release, I would like to remove all : remaining support for RBD image format v1 images baring any : substantial pushback. : : The image format for new images has been defaulted to the v2 image : format since Infernalis, the v1 format was officially deprecated in : Jewel, and creation of new v1 images was prohibited starting with : Mimic. : : The forthcoming Nautilus release will add a new image migration : feature to help provide a low-impact conversion path forward for any : legacy images in a cluster. The ability to migrate existing images off : the v1 image format was the last known pain point that was highlighted : the previous time I suggested removing support. : : Please let me know if anyone has any major objections or concerns. If I read the parallel thread about pool migration in ceph-users@ correctly, the ability to migrate to v2 would still require to stop the client before the "rbd migration prepare" can be executed. On my OpenNebula/Ceph cluster, I still have bigger tens of images in v1 format, so it would induce a moderate pain to figure out which VMs are using them, how availability-critical they are, and finally to migrate the images. But whatever, I guess I can cope with it :-) -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore increased disk usage
Jakub Jaszewski wrote: : Hi Yenya, : : I guess Ceph adds the size of all your data.db devices to the cluster : total used space. Jakub, thanks for the hint. The disk usage increase almost corresponds to that - I have added about 7.5 TB of data.db devices with the last batch of OSDs. Sincerely, -Yenya : pt., 8 lut 2019, 10:11: Jan Kasprzak napisał(a): : : > Hello, ceph users, : > : > I moved my cluster to bluestore (Ceph Mimic), and now I see the increased : > disk usage. From ceph -s: : > : > pools: 8 pools, 3328 pgs : > objects: 1.23 M objects, 4.6 TiB : > usage: 23 TiB used, 444 TiB / 467 TiB avail : > : > I use 3-way replication of my data, so I would expect the disk usage : > to be around 14 TiB. Which was true when I used filestore-based Luminous : > OSDs : > before. Why the disk usage now is 23 TiB? : > : > If I remember it correctly (a big if!), the disk usage was about the same : > when I originally moved the data to empty bluestore OSDs by changing the : > crush rule, but went up after I have added more bluestore OSDs and the : > cluster : > rebalanced itself. : > : > Could it be some miscalculation of free space in bluestore? Also, could it : > be : > related to the HEALTH_ERR backfill_toofull problem discused here in the : > other : > thread? : > : > Thanks, : > : > -Yenya : > : > -- : > | Jan "Yenya" Kasprzak : > | : > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 : > | : > This is the world we live in: the way to deal with computers is to google : > the symptoms, and hope that you don't have to watch a video. --P. Zaitcev : > ___ : > ceph-users mailing list : > ceph-users@lists.ceph.com : > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com : > -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore increased disk usage
Hello, ceph users, I moved my cluster to bluestore (Ceph Mimic), and now I see the increased disk usage. From ceph -s: pools: 8 pools, 3328 pgs objects: 1.23 M objects, 4.6 TiB usage: 23 TiB used, 444 TiB / 467 TiB avail I use 3-way replication of my data, so I would expect the disk usage to be around 14 TiB. Which was true when I used filestore-based Luminous OSDs before. Why the disk usage now is 23 TiB? If I remember it correctly (a big if!), the disk usage was about the same when I originally moved the data to empty bluestore OSDs by changing the crush rule, but went up after I have added more bluestore OSDs and the cluster rebalanced itself. Could it be some miscalculation of free space in bluestore? Also, could it be related to the HEALTH_ERR backfill_toofull problem discused here in the other thread? Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Downsizing a cephfs pool
Hello, Brian Topping wrote: : Hi all, I created a problem when moving data to Ceph and I would be grateful for some guidance before I do something dumb. [...] : Do I need to create new pools and copy again using cpio? Is there a better way? I think I will be facing the same problem soon (moving my cluster from ~64 1-2TB OSDs to about 16 12TB OSDs). Maybe this is the way to go: https://ceph.com/geen-categorie/ceph-pool-migration/ (I did not tested that, though). -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pgs inactive after setting a new crush rule (Re: backfill_toofull after adding new OSDs)
Jan Kasprzak wrote: : OKay, now I changed the crush rule also on a pool with : the real data, and it seems all the client i/o on that pool has stopped. : The recovery continues, but things like qemu I/O, "rbd ls", and so on : are just stuck doing nothing. : : Can I unstuck it somehow (faster than waiting for all the recovery : to finish)? Thanks. I was able to briefly reduce the "1721 pgs inactive" number by restarting the some of the original filestore OSDs, but after some time the number increased back to 1721. Then the data recovery finished, and 1721 PGs remained inactive (and, of course this pool I/O was stuck, both qemu and "rbd ls"). So I have returned the original crush rule, the data started to migrate back to the original OSDs, and the client I/O got unstuck (even though the data relocation is still in progress). Where can be the problem? It might be that I am hitting the limit of number of PGs per OSD or something? I had 60 OSDs before, and want to move it all to 20 new OSDs instead. The pool in question has 2048 PGs. Thanks, -Yenya : : # ceph -s : cluster: : id: ... my-uuid ... : health: HEALTH_ERR : 3308311/3803892 objects misplaced (86.972%) : Reduced data availability: 1721 pgs inactive : Degraded data redundancy: 85361/3803892 objects degraded (2.244%), 1 : 39 pgs degraded, 139 pgs undersized : Degraded data redundancy (low space): 25 pgs backfill_toofull : : services: : mon: 3 daemons, quorum mon1,mon2,mon3 : mgr: mon2(active), standbys: mon1, mon3 : osd: 80 osds: 80 up, 80 in; 1868 remapped pgs : rgw: 1 daemon active : : data: : pools: 13 pools, 5056 pgs : objects: 1.27 M objects, 4.8 TiB : usage: 15 TiB used, 208 TiB / 224 TiB avail : pgs: 34.039% pgs not active : 85361/3803892 objects degraded (2.244%) : 3308311/3803892 objects misplaced (86.972%) : 3188 active+clean : 1582 activating+remapped : 139 activating+undersized+degraded+remapped : 93 active+remapped+backfill_wait : 29 active+remapped+backfilling : 25 active+remapped+backfill_wait+backfill_toofull : : io: : recovery: 174 MiB/s, 43 objects/s : : : -Yenya : : : Jan Kasprzak wrote: : : : - Original Message - : : : From: "Caspar Smit" : : : To: "Jan Kasprzak" : : : Cc: "ceph-users" : : : Sent: Thursday, 31 January, 2019 15:43:07 : : : Subject: Re: [ceph-users] backfill_toofull after adding new OSDs : : : : : : Hi Jan, : : : : : : You might be hitting the same issue as Wido here: : : : : : : [ https://www.spinics.net/lists/ceph-users/msg50603.html | https://www.spinics.net/lists/ceph-users/msg50603.html ] : : : : : : Kind regards, : : : Caspar : : : : : : Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:k...@fi.muni.cz | k...@fi.muni.cz ] >: : : : : : : : : : Hello, ceph users, : : : : : : I see the following HEALTH_ERR during cluster rebalance: : : : : : : Degraded data redundancy (low space): 8 pgs backfill_toofull : : : : : : Detailed description: : : : I have upgraded my cluster to mimic and added 16 new bluestore OSDs : : : on 4 hosts. The hosts are in a separate region in my crush map, and crush : : : rules prevented data to be moved on the new OSDs. Now I want to move : : : all data to the new OSDs (and possibly decomission the old filestore OSDs). : : : I have created the following rule: : : : : : : # ceph osd crush rule create-replicated on-newhosts newhostsroot host : : : : : : after this, I am slowly moving the pools one-by-one to this new rule: : : : : : : # ceph osd pool set test-hdd-pool crush_rule on-newhosts : : : : : : When I do this, I get the above error. This is misleading, because : : : ceph osd df does not suggest the OSDs are getting full (the most full : : : OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR : : : disappears. Why am I getting this error? : : : : : : # ceph -s : : : cluster: : : : id: ...my UUID... : : : health: HEALTH_ERR : : : 1271/3803223 objects misplaced (0.033%) : : : Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs degraded, 67 pgs undersized : : : Degraded data redundancy (low space): 8 pgs backfill_toofull : : : : : : services: : : : mon: 3 daemons, quorum mon1,mon2,mon3 : : : mgr: mon2(active), standbys: mon1, mon3 : : : osd: 80 osds: 80 up, 80 in; 90 remapped pgs : : : rgw: 1 daemon active : : : : : : data: : : : pools: 13 pools, 5056 pgs : : : objects: 1.27 M objects, 4.8 TiB : : : usage: 15 TiB used, 208 TiB / 224 TiB avail : : : pgs: 40124/3803223 objects degraded (1.055%) : : : 1271/3803223 objects misplaced (0.033%) : : : 4963 active+clean : : : 41 active+recovery_wait+undersized+degraded+remapped : : : 21
Re: [ceph-users] backfill_toofull after adding new OSDs
OKay, now I changed the crush rule also on a pool with the real data, and it seems all the client i/o on that pool has stopped. The recovery continues, but things like qemu I/O, "rbd ls", and so on are just stuck doing nothing. Can I unstuck it somehow (faster than waiting for all the recovery to finish)? Thanks. # ceph -s cluster: id: ... my-uuid ... health: HEALTH_ERR 3308311/3803892 objects misplaced (86.972%) Reduced data availability: 1721 pgs inactive Degraded data redundancy: 85361/3803892 objects degraded (2.244%), 1 39 pgs degraded, 139 pgs undersized Degraded data redundancy (low space): 25 pgs backfill_toofull services: mon: 3 daemons, quorum mon1,mon2,mon3 mgr: mon2(active), standbys: mon1, mon3 osd: 80 osds: 80 up, 80 in; 1868 remapped pgs rgw: 1 daemon active data: pools: 13 pools, 5056 pgs objects: 1.27 M objects, 4.8 TiB usage: 15 TiB used, 208 TiB / 224 TiB avail pgs: 34.039% pgs not active 85361/3803892 objects degraded (2.244%) 3308311/3803892 objects misplaced (86.972%) 3188 active+clean 1582 activating+remapped 139 activating+undersized+degraded+remapped 93 active+remapped+backfill_wait 29 active+remapped+backfilling 25 active+remapped+backfill_wait+backfill_toofull io: recovery: 174 MiB/s, 43 objects/s -Yenya Jan Kasprzak wrote: : : - Original Message - : : From: "Caspar Smit" : : To: "Jan Kasprzak" : : Cc: "ceph-users" : : Sent: Thursday, 31 January, 2019 15:43:07 : : Subject: Re: [ceph-users] backfill_toofull after adding new OSDs : : : : Hi Jan, : : : : You might be hitting the same issue as Wido here: : : : : [ https://www.spinics.net/lists/ceph-users/msg50603.html | https://www.spinics.net/lists/ceph-users/msg50603.html ] : : : : Kind regards, : : Caspar : : : : Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:k...@fi.muni.cz | k...@fi.muni.cz ] >: : : : : : : Hello, ceph users, : : : : I see the following HEALTH_ERR during cluster rebalance: : : : : Degraded data redundancy (low space): 8 pgs backfill_toofull : : : : Detailed description: : : I have upgraded my cluster to mimic and added 16 new bluestore OSDs : : on 4 hosts. The hosts are in a separate region in my crush map, and crush : : rules prevented data to be moved on the new OSDs. Now I want to move : : all data to the new OSDs (and possibly decomission the old filestore OSDs). : : I have created the following rule: : : : : # ceph osd crush rule create-replicated on-newhosts newhostsroot host : : : : after this, I am slowly moving the pools one-by-one to this new rule: : : : : # ceph osd pool set test-hdd-pool crush_rule on-newhosts : : : : When I do this, I get the above error. This is misleading, because : : ceph osd df does not suggest the OSDs are getting full (the most full : : OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR : : disappears. Why am I getting this error? : : : : # ceph -s : : cluster: : : id: ...my UUID... : : health: HEALTH_ERR : : 1271/3803223 objects misplaced (0.033%) : : Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs degraded, 67 pgs undersized : : Degraded data redundancy (low space): 8 pgs backfill_toofull : : : : services: : : mon: 3 daemons, quorum mon1,mon2,mon3 : : mgr: mon2(active), standbys: mon1, mon3 : : osd: 80 osds: 80 up, 80 in; 90 remapped pgs : : rgw: 1 daemon active : : : : data: : : pools: 13 pools, 5056 pgs : : objects: 1.27 M objects, 4.8 TiB : : usage: 15 TiB used, 208 TiB / 224 TiB avail : : pgs: 40124/3803223 objects degraded (1.055%) : : 1271/3803223 objects misplaced (0.033%) : : 4963 active+clean : : 41 active+recovery_wait+undersized+degraded+remapped : : 21 active+recovery_wait+undersized+degraded : : 17 active+remapped+backfill_wait : : 5 active+remapped+backfill_wait+backfill_toofull : : 3 active+remapped+backfill_toofull : : 2 active+recovering+undersized+remapped : : 2 active+recovering+undersized+degraded+remapped : : 1 active+clean+remapped : : 1 active+recovering+undersized+degraded : : : : io: : : client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr : : recovery: 2.0 MiB/s, 92 objects/s : : : : Thanks for any hint, : : : : -Yenya : : : : -- : : | Jan "Yenya" Kasprzak http://fi.muni.cz/ | fi.muni.cz ] - work | [ http://yenya.net/ | yenya.net ] - private}> | : : | [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: 4096R/A45477D5 | : : This is the world we live in: the way to deal with computers is to google : : the symptoms, and hope that you don't have to watch a video. --P. Zaitcev : : ___ : : ceph-users mailing list : : [ mailto:ceph-use
Re: [ceph-users] backfill_toofull after adding new OSDs
Fyodor Ustinov wrote: : Hi! : : I saw the same several times when I added a new osd to the cluster. One-two pg in "backfill_toofull" state. : : In all versions of mimic. Yep. In my case it is not (only) after adding the new OSDs. An hour or so ago my cluster reached the HEALTH_OK state, so I moved another pool to the new hosts with "crush_rule on-newhosts". The result was immediate backfill_toofull on two PGs for about five minutes, and then it reached the HEALTH_OK again. So the PGs are not stuck in that state forever, they are there only during the data reshuffle. 13.2.4 on CentOS 7. -Yenya : : - Original Message - : From: "Caspar Smit" : To: "Jan Kasprzak" : Cc: "ceph-users" : Sent: Thursday, 31 January, 2019 15:43:07 : Subject: Re: [ceph-users] backfill_toofull after adding new OSDs : : Hi Jan, : : You might be hitting the same issue as Wido here: : : [ https://www.spinics.net/lists/ceph-users/msg50603.html | https://www.spinics.net/lists/ceph-users/msg50603.html ] : : Kind regards, : Caspar : : Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:k...@fi.muni.cz | k...@fi.muni.cz ] >: : : : Hello, ceph users, : : I see the following HEALTH_ERR during cluster rebalance: : : Degraded data redundancy (low space): 8 pgs backfill_toofull : : Detailed description: : I have upgraded my cluster to mimic and added 16 new bluestore OSDs : on 4 hosts. The hosts are in a separate region in my crush map, and crush : rules prevented data to be moved on the new OSDs. Now I want to move : all data to the new OSDs (and possibly decomission the old filestore OSDs). : I have created the following rule: : : # ceph osd crush rule create-replicated on-newhosts newhostsroot host : : after this, I am slowly moving the pools one-by-one to this new rule: : : # ceph osd pool set test-hdd-pool crush_rule on-newhosts : : When I do this, I get the above error. This is misleading, because : ceph osd df does not suggest the OSDs are getting full (the most full : OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR : disappears. Why am I getting this error? : : # ceph -s : cluster: : id: ...my UUID... : health: HEALTH_ERR : 1271/3803223 objects misplaced (0.033%) : Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs degraded, 67 pgs undersized : Degraded data redundancy (low space): 8 pgs backfill_toofull : : services: : mon: 3 daemons, quorum mon1,mon2,mon3 : mgr: mon2(active), standbys: mon1, mon3 : osd: 80 osds: 80 up, 80 in; 90 remapped pgs : rgw: 1 daemon active : : data: : pools: 13 pools, 5056 pgs : objects: 1.27 M objects, 4.8 TiB : usage: 15 TiB used, 208 TiB / 224 TiB avail : pgs: 40124/3803223 objects degraded (1.055%) : 1271/3803223 objects misplaced (0.033%) : 4963 active+clean : 41 active+recovery_wait+undersized+degraded+remapped : 21 active+recovery_wait+undersized+degraded : 17 active+remapped+backfill_wait : 5 active+remapped+backfill_wait+backfill_toofull : 3 active+remapped+backfill_toofull : 2 active+recovering+undersized+remapped : 2 active+recovering+undersized+degraded+remapped : 1 active+clean+remapped : 1 active+recovering+undersized+degraded : : io: : client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr : recovery: 2.0 MiB/s, 92 objects/s : : Thanks for any hint, : : -Yenya : : -- : | Jan "Yenya" Kasprzak http://fi.muni.cz/ | fi.muni.cz ] - work | [ http://yenya.net/ | yenya.net ] - private}> | : | [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: 4096R/A45477D5 | : This is the world we live in: the way to deal with computers is to google : the symptoms, and hope that you don't have to watch a video. --P. Zaitcev : ___ : ceph-users mailing list : [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] : [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] : : ___ : ceph-users mailing list : ceph-users@lists.ceph.com : http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] backfill_toofull after adding new OSDs
Hello, ceph users, I see the following HEALTH_ERR during cluster rebalance: Degraded data redundancy (low space): 8 pgs backfill_toofull Detailed description: I have upgraded my cluster to mimic and added 16 new bluestore OSDs on 4 hosts. The hosts are in a separate region in my crush map, and crush rules prevented data to be moved on the new OSDs. Now I want to move all data to the new OSDs (and possibly decomission the old filestore OSDs). I have created the following rule: # ceph osd crush rule create-replicated on-newhosts newhostsroot host after this, I am slowly moving the pools one-by-one to this new rule: # ceph osd pool set test-hdd-pool crush_rule on-newhosts When I do this, I get the above error. This is misleading, because ceph osd df does not suggest the OSDs are getting full (the most full OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR disappears. Why am I getting this error? # ceph -s cluster: id: ...my UUID... health: HEALTH_ERR 1271/3803223 objects misplaced (0.033%) Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs degraded, 67 pgs undersized Degraded data redundancy (low space): 8 pgs backfill_toofull services: mon: 3 daemons, quorum mon1,mon2,mon3 mgr: mon2(active), standbys: mon1, mon3 osd: 80 osds: 80 up, 80 in; 90 remapped pgs rgw: 1 daemon active data: pools: 13 pools, 5056 pgs objects: 1.27 M objects, 4.8 TiB usage: 15 TiB used, 208 TiB / 224 TiB avail pgs: 40124/3803223 objects degraded (1.055%) 1271/3803223 objects misplaced (0.033%) 4963 active+clean 41 active+recovery_wait+undersized+degraded+remapped 21 active+recovery_wait+undersized+degraded 17 active+remapped+backfill_wait 5active+remapped+backfill_wait+backfill_toofull 3active+remapped+backfill_toofull 2active+recovering+undersized+remapped 2active+recovering+undersized+degraded+remapped 1active+clean+remapped 1active+recovering+undersized+degraded io: client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr recovery: 2.0 MiB/s, 92 objects/s Thanks for any hint, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Spec for Ceph Mon+Mgr?
jes...@krogh.cc wrote: : Hi. : : We're currently co-locating our mons with the head node of our Hadoop : installation. That may be giving us some problems, we dont know yet, but : thus I'm speculation about moving them to dedicated hardware. : : It is hard to get specifications "small" engough .. the specs for the : mon is where we usually virtualize our way out of if .. which seems very : wrong here. : : Are other people just co-locating it with something random or what are : others typically using in a small ceph cluster (< 100 OSDs .. 7 OSD hosts) Jesper, FWIW, we colocate our mons/mgrs with OpenNebula master node and minor OpenNebula host nodes. As an example, one of them is AMD Opteron 6134 (8 cores, 2.3 GHz), 16 GB RAM, 1 Gbit ethernet. We have three mons. I want to keep this setup also in the future, but I may move the OpenNebula virtualization out of mon hosts - not because the hosts are overloaded, but they are getting too old/slow/small for the VMs themselves :-). We have three mons with a similar configuration. -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating to a dedicated cluster network
Jakub Jaszewski wrote: : Hi Yenya, : : Can I ask how your cluster looks and why you want to do the network : splitting? Jakub, we have deployed the Ceph cluster originally as a proof of concept for a private cloud. We run OpenNebula and Ceph on about 30 old servers with old HDDs (2 OSDs per host), all connected via 1 Gbit ethernet with 10Gbit backbone. Since then our private cloud got pretty popular among our users, so we are planning to upgrade it to a smaller amount of modern servers. The new servers have two 10GbE interfaces, so the primary reasoning behind it is "why not use them both when we already have them". Of course, interface teaming/bonding is another option. Currently I see the network being saturated only when doing a live migration of a VM between the physical hosts, and then during a Ceph cluster rebalance. So, I don't think moving to a dedicated cluster network is a necessity for us. Anyway, does anybody use the cluster network with larger MTU (jumbo frames)? : We used to set up 9-12 OSD nodes (12-16 HDDs each) clusters using 2x10Gb : for access and 2x10Gb for cluster network, however, I don't see the reasons : to not use just one network for next cluster setup. -Yenya : śr., 23 sty 2019, 10:40: Jan Kasprzak napisał(a): : : > Hello, Ceph users, : > : > is it possible to migrate already deployed Ceph cluster, which uses : > public network only, to a split public/dedicated networks? If so, : > can this be done without service disruption? I have now got a new : > hardware which makes this possible, but I am not sure how to do it. : > : > Another question is whether the cluster network can be done : > solely on top of IPv6 link-local addresses without any public address : > prefix. : > : > When deploying this cluster (Ceph Firefly, IIRC), I had problems : > with mixed IPv4/IPv6 addressing, and ended up with ms_bind_ipv6 = false : > in my Ceph conf. : > : > Thanks, : > : > -Yenya : > : > -- : > | Jan "Yenya" Kasprzak : > | : > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 : > | : > This is the world we live in: the way to deal with computers is to google : > the symptoms, and hope that you don't have to watch a video. --P. Zaitcev : > ___ : > ceph-users mailing list : > ceph-users@lists.ceph.com : > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com : > -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Migrating to a dedicated cluster network
Hello, Ceph users, is it possible to migrate already deployed Ceph cluster, which uses public network only, to a split public/dedicated networks? If so, can this be done without service disruption? I have now got a new hardware which makes this possible, but I am not sure how to do it. Another question is whether the cluster network can be done solely on top of IPv6 link-local addresses without any public address prefix. When deploying this cluster (Ceph Firefly, IIRC), I had problems with mixed IPv4/IPv6 addressing, and ended up with ms_bind_ipv6 = false in my Ceph conf. Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)
Alfredo, Alfredo Deza wrote: : On Fri, Jan 18, 2019 at 7:21 AM Jan Kasprzak wrote: : > Eugen Block wrote: : > : : > : I think you're running into an issue reported a couple of times. : > : For the use of LVM you have to specify the name of the Volume Group : > : and the respective Logical Volume instead of the path, e.g. : > : : > : ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda : > thanks, I will try it. In the meantime, I have discovered another way : > how to get around it: convert my SSDs from MBR to GPT partition table, : > and then create 15 additional GPT partitions for the respective block.dbs : > instead of 2x15 LVs. : : This is because ceph-volume can accept both LVs or GPT partitions for block.db : : Another way around this, that doesn't require you to create the LVs is : to use the `batch` sub-command, that will automatically : detect your HDD and put data on it, and detect the SSD and create the : block.db LVs. The command could look something like: : : : ceph-volume lvm batch --bluestore /dev/sda /dev/sdb /dev/sdc /dev/sdd : /dev/nvme0n1 : : Would create 4 OSDs, place data on: sda, sdb, sdc, and sdd. And create : 4 block.db LVs on nvme0n1 Interesting. Thanks! Can the batch command accept also partitions instead of a whole device for block.db? I already have two partitions on my SSDs for root and swap. -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)
Eugen Block wrote: : Hi Jan, : : I think you're running into an issue reported a couple of times. : For the use of LVM you have to specify the name of the Volume Group : and the respective Logical Volume instead of the path, e.g. : : ceph-volume lvm prepare --bluestore --block.db ssd_vg/ssd00 --data /dev/sda Eugen, thanks, I will try it. In the meantime, I have discovered another way how to get around it: convert my SSDs from MBR to GPT partition table, and then create 15 additional GPT partitions for the respective block.dbs instead of 2x15 LVs. -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] block.db on a LV? (Re: Mixed SSD+HDD OSD setup recommendation)
Hello, Ceph users, replying to my own post from several weeks ago: Jan Kasprzak wrote: : [...] I plan to add new OSD hosts, : and I am looking for setup recommendations. : : Intended usage: : : - small-ish pool (tens of TB) for RBD volumes used by QEMU : - large pool for object-based cold (or not-so-hot :-) data, : write-once read-many access pattern, average object size : 10s or 100s of MBs, probably custom programmed on top of : libradosstriper. : : Hardware: : : The new OSD hosts have ~30 HDDs 12 TB each, and two 960 GB SSDs. : There is a small RAID-1 root and RAID-1 swap volume spanning both SSDs, : leaving about 900 GB free on each SSD. : The OSD hosts have two CPU sockets (32 cores including SMT), 128 GB RAM. : : My questions: [...] : - block.db on SSDs? The docs recommend about 4 % of the data size : for block.db, but my SSDs are only 0.6 % of total storage size. : : - or would it be better to leave SSD caching on the OS and use LVMcache : or something? : : - LVM or simple volumes? I have problem setting this up with ceph-volume: I want to have an OSD on each HDD, with block.db on the SSD. In order to set this up, I have created a VG on the two SSDs, created 30 LVs on top of it for block.db, and wanted to create an OSD using the following: # ceph-volume lvm prepare --bluestore \ --block.db /dev/ssd_vg/ssd00 \ --data /dev/sda [...] --> blkid could not detect a PARTUUID for device: /dev/cbia_ssd_vg/ssd00 --> Was unable to complete a new OSD, will rollback changes [...] Then it failed, because deploying a volume used client.bootstrap-osd user, but trying to roll the changes back required the client.admin user, which does not have a keyring on the OSD host. Never mind. The problem is with determining the PARTUUID of the SSD LV for block.db. How can I deploy an OSD which is on top of bare HDD, but which also has a block.db on an existing LV? Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw cannot create pool
Hello, Ceph users, TL;DR: radosgw fails on me with the following message: 2019-01-17 09:34:45.247721 7f52722b3dc0 0 rgw_init_ioctx ERROR: librados::Rados::pool_create returned (34) Numerical result out of range (this can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num or mon_max_pg_per_osd exceeded) Detailed description: I have a Ceph cluster installed long time ago as firefly on CentOS 7, and now running luminous. So far I have used it for RBD pools, but now I want to try using radosgw as well. I tried to deploy radosgw using # ceph-deploy rgw create myhost Which went well until it tried to start it up: [myhost][INFO ] Running command: service ceph-radosgw start [myhost][WARNIN] Redirecting to /bin/systemctl start ceph-radosgw.service [myhost][WARNIN] Failed to start ceph-radosgw.service: Unit not found. [myhost][ERROR ] RuntimeError: command returned non-zero exit status: 5 [ceph_deploy.rgw][ERROR ] Failed to execute command: service ceph-radosgw start [ceph_deploy][ERROR ] GenericError: Failed to create 1 RGWs Comparing it to my testing deployment of mimic, where radosgw works, the problem was with the unit name, the correct way to start it up apparently was # systemctl start ceph-radosgw@rgw.myhost.service Now it is apparently running: /usr/bin/radosgw -f --cluster ceph --name client.rgw.myhost --setuser ceph --setgroup ceph However, when I want to add the first user, radosgw-admin fails and radosgw itself exits with the similar message: # radosgw-admin user create --uid=kas --display-name="Jan Kasprzak" 2019-01-17 09:52:29.805828 7fea6cfd2dc0 0 rgw_init_ioctx ERROR: librados::Rados::pool_create returned (34) Numerical result out of range (this can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num or mon_max_pg_per_osd exceeded) 2019-01-17 09:52:29.805957 7fea6cfd2dc0 -1 ERROR: failed to initialize watch: (34) Numerical result out of range couldn't init storage provider So I guess it is trying to create a pool for data, but it fails somehow. Can I determine which pool it is and what parameters it tries to use? I have looked at my testing mimic cluster, and radosgw there created the following pools: .rgw.root default.rgw.control default.rgw.meta default.rgw.log default.rgw.buckets.index default.rgw.buckets.data So I created these pools manually on my luminous cluster as well: # ceph osd pool create .rgw.root 128 (repeat for all the above pool names) Which helped, and I am able to create the user with radosgw-admin. Now where should I look for the exact parameters radosgw is trying to use when creating its pools? Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Get packages - incorrect link
Hello, Ceph users, I am not sure where to report the issue with the ceph.com website, so I am posting to this list: The https://ceph.com/use/ page has an incorrect link for getting the packages: "For packages, see http://ceph.com/docs/master/install/get-packages; - the URL should be http://docs.ceph.com/docs/master/install/get-packages/ instead (docs.ceph.com instead of ceph.com). Thanks in advance for fixing this. -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph blog RSS/Atom URL?
Gregory Farnum wrote: : It looks like ceph.com/feed is the RSS url? Close enough, thanks. Comparing the above with the blog itself, there are some posts in (apparently) Chinese in /feed, which are not present in /community/blog. The first one being https://ceph.com/planet/vdbench%e6%b5%8b%e8%af%95%e5%ae%9e%e6%97%b6%e5%8f%af%e8%a7%86%e5%8c%96%e6%98%be%e7%a4%ba/ -Yenya : On Fri, Jan 4, 2019 at 5:52 AM Jan Kasprzak wrote: : > is there any RSS or Atom source for Ceph blog? I have looked inside : > the https://ceph.com/community/blog/ HTML source, but there is no : > or anything mentioning RSS or Atom. -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph health JSON format has changed
Gregory Farnum wrote: : On Wed, Jan 2, 2019 at 5:12 AM Jan Kasprzak wrote: : : > Thomas Byrne - UKRI STFC wrote: : > : I recently spent some time looking at this, I believe the 'summary' and : > : 'overall_status' sections are now deprecated. The 'status' and 'checks' : > : fields are the ones to use now. : > : > OK, thanks. : > : > : The 'status' field gives you the OK/WARN/ERR, but returning the most : > : severe error condition from the 'checks' section is less trivial. AFAIK : > : all health_warn states are treated as equally severe, and same for : > : health_err. We ended up formatting our single line human readable output : > : as something like: : > : : > : "HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN: : > 20 large omap objects" : > : > Speaking of scrub errors: : > : > In previous versions of Ceph, I was able to determine which PGs had : > scrub errors, and then a cron.hourly script ran "ceph pg repair" for them, : > provided that they were not already being scrubbed. In Luminous, the bad PG : > is not visible in "ceph --status" anywhere. Should I use something like : > "ceph health detail -f json-pretty" instead? : > : > Also, is it possible to configure Ceph to attempt repairing : > the bad PGs itself, as soon as the scrub fails? I run most of my OSDs on : > top : > of a bunch of old spinning disks, and a scrub error almost always means : > that there is a bad sector somewhere, which can easily be fixed by : > rewriting the lost data using "ceph pg repair". : > : : It is possible. It's a lot safer than it used to be, but is still NOT : RECOMMENDED for replicated pools. : : But if you are very sure, you can use the options osd_scrub_auto_repair : (default: false) and osd_scrub_auto_repair_num_errors (default:5, which : will not auto-repair if scrub detects more errors than that value) to : configure it. OK, thanks. I just want to say that I am NOT very sure, but this is about the only way I am aware of, when I want to handle the scrub error. I have mail notification set up in smartd.conf, and so far the scrub errors seem to correlate with new reallocated or pending sectors. What are the drawbacks of running "ceph pg repair" as soon asi the cluster enters the HEALTH_ERR state with scrub error? Thanks for explanation, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph health JSON format has changed
Thomas Byrne - UKRI STFC wrote: : I recently spent some time looking at this, I believe the 'summary' and : 'overall_status' sections are now deprecated. The 'status' and 'checks' : fields are the ones to use now. OK, thanks. : The 'status' field gives you the OK/WARN/ERR, but returning the most : severe error condition from the 'checks' section is less trivial. AFAIK : all health_warn states are treated as equally severe, and same for : health_err. We ended up formatting our single line human readable output : as something like: : : "HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN: 20 large omap objects" Speaking of scrub errors: In previous versions of Ceph, I was able to determine which PGs had scrub errors, and then a cron.hourly script ran "ceph pg repair" for them, provided that they were not already being scrubbed. In Luminous, the bad PG is not visible in "ceph --status" anywhere. Should I use something like "ceph health detail -f json-pretty" instead? Also, is it possible to configure Ceph to attempt repairing the bad PGs itself, as soon as the scrub fails? I run most of my OSDs on top of a bunch of old spinning disks, and a scrub error almost always means that there is a bad sector somewhere, which can easily be fixed by rewriting the lost data using "ceph pg repair". Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph health JSON format has changed sync?
Hello, Ceph users, I am afraid the following question is a FAQ, but I still was not able to find the answer: I use ceph --status --format=json-pretty as a source of CEPH status for my Nagios monitoring. After upgrading to Luminous, I see the following in the JSON output when the cluster is not healthy: "summary": [ { "severity": "HEALTH_WARN", "summary": "'ceph health' JSON format has changed in luminous. If you see this your monitoring system is scraping the wrong fields. Disable this with 'mon health preluminous compat warning = false'" } ], Apart from that, the JSON data seems reasonable. My question is which part of JSON structure are the "wrong fields" I have to avoid. Is it just the "summary" section, or some other parts as well? Or should I avoid the whole ceph --status and use something different instead? What I want is a single machine-readable value with OK/WARNING/ERROR meaning, and a single human-readable text line, describing the most severe error condition which is currently present. What is the preferred way to get this data in Luminous? Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mixed SSD+HDD OSD setup recommendation
Hello, CEPH users, having upgraded my CEPH cluster to Luminous, I plan to add new OSD hosts, and I am looking for setup recommendations. Intended usage: - small-ish pool (tens of TB) for RBD volumes used by QEMU - large pool for object-based cold (or not-so-hot :-) data, write-once read-many access pattern, average object size 10s or 100s of MBs, probably custom programmed on top of libradosstriper. Hardware: The new OSD hosts have ~30 HDDs 12 TB each, and two 960 GB SSDs. There is a small RAID-1 root and RAID-1 swap volume spanning both SSDs, leaving about 900 GB free on each SSD. The OSD hosts have two CPU sockets (32 cores including SMT), 128 GB RAM. My questions: - Filestore or Bluestore? -> probably the later, but I am also considering using the OSD hosts for QEMU-based VMs which are not performance critical, and then having the kernel balance the memory usage between ceph-osd and qemu processes (using Filestore) would probably be better? Am I right? - block.db on SSDs? The docs recommend about 4 % of the data size for block.db, but my SSDs are only 0.6 % of total storage size. - or would it be better to leave SSD caching on the OS and use LVMcache or something? - LVM or simple volumes? I find it a bit strange and bloated to create 32 VGs, each VG for a single HDD or SSD, and have 30 VGs with only one LV. Could I use /dev/disk/by-id/wwn-0x5000 symlinks to have stable device names instead, and have only two VGs for two SSDs? Thanks for any recommendations. -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrade to Luminous (mon+osd)
Dan van der Ster wrote: : It's not that simple see http://tracker.ceph.com/issues/21672 : : For the 12.2.8 to 12.2.10 upgrade it seems the selinux module was : updated -- so the rpms restart the ceph.target. : What's worse is that this seems to happen before all the new updated : files are in place. : : Our 12.2.8 to 12.2.10 upgrade procedure is: : : systemctl stop ceph.target : yum update : systemctl start ceph.target Yes, this looks reasonable. Except that when upgrading from Jewel, even after the restart the OSDs do not work until _all_ mons are upgraded. So effectively if a PG happens to be placed on the mon hosts only, there will be service outage during upgrade from Jewel. So I guess the upgrade procedure described here: http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken is misleading - the mons and osds get restarted anyway by the package upgrade itself. The user should be warned that for this reason the package upgrades should be run sequentially. And that the upgrade is not possible without service outage, when there are OSDs on the mon hosts and when the cluster is running under SELinux. Also, there is another important thing omitted by the above upgrade procedure: After "ceph osd require-osd-release luminous" I have got HEALTH_WARN saying "application not enabled on X pool(s)". I have fixed this by running the following scriptlet: ceph osd pool ls | while read pool; do ceph osd pool application enable $pool rbd; done (yes, all of my pools are used for rbd for now). Maybe this should be fixed in the release notes as well. Thanks, -Yenya : On Mon, Dec 3, 2018 at 12:42 PM Paul Emmerich wrote: : > : > Upgrading Ceph packages does not restart the services -- exactly for : > this reason. : > : > This means there's something broken with your yum setup if the : > services are restarted when only installing the new version. : > : > : > Paul : > : > -- : > Paul Emmerich : > : > Looking for help with your Ceph cluster? Contact us at https://croit.io : > : > croit GmbH : > Freseniusstr. 31h : > 81247 München : > www.croit.io : > Tel: +49 89 1896585 90 : > : > Am Mo., 3. Dez. 2018 um 11:56 Uhr schrieb Jan Kasprzak : : > > : > > Hello, ceph users, : > > : > > I have a small(-ish) Ceph cluster, where there are osds on each host, : > > and in addition to that, there are mons on the first three hosts. : > > Is it possible to upgrade the cluster to Luminous without service : > > interruption? : > > : > > I have tested that when I run "yum --enablerepo Ceph update" on a : > > mon host, the osds on that host remain down until all three mons : > > are upgraded to Luminous. Is it possible to upgrade ceph-mon only, : > > and keep ceph-osd running the old version (Jewel in my case) as long : > > as possible? It seems RPM dependencies forbid this, but with --nodeps : > > it could be done. : > > : > > Is there a supported way how to upgrade host running both mon and osd : > > to Luminous? : > > : > > Thanks, : > > : > > -Yenya : > > : > > -- : > > | Jan "Yenya" Kasprzak | : > > | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | : > > This is the world we live in: the way to deal with computers is to google : > > the symptoms, and hope that you don't have to watch a video. --P. Zaitcev : > > ___ : > > ceph-users mailing list : > > ceph-users@lists.ceph.com : > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com : > ___ : > ceph-users mailing list : > ceph-users@lists.ceph.com : > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrade to Luminous (mon+osd)
Paul Emmerich wrote: : Upgrading Ceph packages does not restart the services -- exactly for : this reason. : : This means there's something broken with your yum setup if the : services are restarted when only installing the new version. Interesting. I have verified that I have CEPH_AUTO_RESTART_ON_UPGRADE=no in my /etc/sysconfig/ceph, yet my ceph-osd daemons get restarted on upgrade. I have watched "ps ax|grep ceph-osd" output during "yum --enablerepo Ceph update", and it seems the OSDs got restarted near the time ceph-selinux got upgraded: Updating : 2:ceph-base-12.2.10-0.el7.x86_64 74/248 Updating : 2:ceph-selinux-12.2.10-0.el7.x86_64 75/248 Updating : 2:ceph-mon-12.2.10-0.el7.x86_64 76/248 And indeed, rpm -q --scripts ceph-selinux shows that this package restarts the whole ceph.target when the labels got changed: [...] # Check whether the daemons are running /usr/bin/systemctl status ceph.target > /dev/null 2>&1 STATUS=$? # Stop the daemons if they were running if test $STATUS -eq 0; then /usr/bin/systemctl stop ceph.target > /dev/null 2>&1 fi [...] So maybe ceph-selinux should also honor CEPH_AUTO_RESTART_ON_UPGRADE=no in /etc/sysconfig/ceph ? But I am not sure whether it is possible at all, when the labels got changed. -Yenya : Am Mo., 3. Dez. 2018 um 11:56 Uhr schrieb Jan Kasprzak : : > : > I have a small(-ish) Ceph cluster, where there are osds on each host, : > and in addition to that, there are mons on the first three hosts. : > Is it possible to upgrade the cluster to Luminous without service : > interruption? : > : > I have tested that when I run "yum --enablerepo Ceph update" on a : > mon host, the osds on that host remain down until all three mons : > are upgraded to Luminous. Is it possible to upgrade ceph-mon only, : > and keep ceph-osd running the old version (Jewel in my case) as long : > as possible? It seems RPM dependencies forbid this, but with --nodeps : > it could be done. : > : > Is there a supported way how to upgrade host running both mon and osd : > to Luminous? -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Upgrade to Luminous (mon+osd)
Hello, ceph users, I have a small(-ish) Ceph cluster, where there are osds on each host, and in addition to that, there are mons on the first three hosts. Is it possible to upgrade the cluster to Luminous without service interruption? I have tested that when I run "yum --enablerepo Ceph update" on a mon host, the osds on that host remain down until all three mons are upgraded to Luminous. Is it possible to upgrade ceph-mon only, and keep ceph-osd running the old version (Jewel in my case) as long as possible? It seems RPM dependencies forbid this, but with --nodeps it could be done. Is there a supported way how to upgrade host running both mon and osd to Luminous? Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | This is the world we live in: the way to deal with computers is to google the symptoms, and hope that you don't have to watch a video. --P. Zaitcev ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Atomic object replacement with libradosstriper
Hello, Ceph users, I would like to use RADOS as an object storage (I have written about it to this list a while ago), and I would like to use libradosstriper with C, as has been suggested to me here. My question is - when writing an object, is it possible to do it so that either the old version as a whole or a new version as a whole is visible by readers at all times? Also, when creating a new object, only the fully written new object should be visible. Is it possible to do this with libradosstriper? With POSIX filesystem, one would do write(tmpfile)+fsync()+rename() to achieve similar results. Thanks! -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > That's why this kind of vulnerability is a concern: deploying stuff is < > often about collecting an obscene number of .jar files and pushing them < > up to the application server. --pboddie at LWN < ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs stuck unclean after removing OSDs
David Turner wrote: : A couple things. You didn't `ceph osd crush remove osd.21` after doing the : other bits. Also you will want to remove the bucket (re: host) from the : crush map as it will now be empty. Right now you have a host in the crush : map with a weight, but no osds to put that data on. It has a weight : because of the 2 OSDs that are still in it that were removed from the : cluster but not from the crush map. It's confusing to your cluster. OK, this helped. I have removed osd.20 and osd.21 from the crush map, as well as the bucket for the faulty host. PGs got unstuck, and after some time, my system now reports HEALTH_OK. Thanks for the hint! -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > That's why this kind of vulnerability is a concern: deploying stuff is < > often about collecting an obscene number of .jar files and pushing them < > up to the application server. --pboddie at LWN < ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pgs stuck unclean after removing OSDs
Hello, TL;DR: what to do when my cluster reports stuck unclean pgs? Detailed description: One of the nodes in my cluster died. CEPH correctly rebalanced itself, and reached the HEALTH_OK state. I have looked at the failed server, and decided to take it out of the cluster permanently, because the hardware is indeed faulty. It used to host two OSDs, which were marked down and out in "ceph osd dump". So from the HEALTH_OK I ran the following commands: # ceph auth del osd.20 # ceph auth del osd.21 # ceph osd rm osd.20 # ceph osd rm osd.21 After that, CEPH started to rebalance itself, but now it reports some PGs as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s": # ceph -s cluster 3065224c-ea2e-4558-8a81-8f935dde56e5 health HEALTH_WARN 350 pgs stuck unclean recovery 26/1596390 objects degraded (0.002%) recovery 58772/1596390 objects misplaced (3.682%) monmap e16: 3 mons at {...} election epoch 584, quorum 0,1,2 ... osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs flags require_jewel_osds pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects 6244 GB used, 40569 GB / 46814 GB avail 26/1596390 objects degraded (0.002%) 58772/1596390 objects misplaced (3.682%) 3426 active+clean 349 active+remapped 1 active client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr # ceph health detail HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded (0.002%); recovery 58772/1596390 objects misplaced (3.682%) pg 28.fa is stuck unclean for 14408925.966824, current state active+remapped, last acting [38,52,4] pg 28.e7 is stuck unclean for 14408925.966886, current state active+remapped, last acting [29,42,22] pg 23.dc is stuck unclean for 61698.641750, current state active+remapped, last acting [50,33,23] pg 23.d9 is stuck unclean for 61223.093284, current state active+remapped, last acting [54,31,23] pg 28.df is stuck unclean for 14408925.967120, current state active+remapped, last acting [33,7,15] pg 34.38 is stuck unclean for 60904.322881, current state active+remapped, last acting [18,41,9] pg 34.fe is stuck unclean for 60904.241762, current state active+remapped, last acting [58,1,44] [...] pg 28.8f is stuck unclean for 66102.059671, current state active, last acting [8,40,5] [...] recovery 26/1596390 objects degraded (0.002%) recovery 58772/1596390 objects misplaced (3.682%) Apart from that, the data stored in CEPH pools seems to be reachable and usable as before. The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH repository). What other debugging info should I provide, or what to do in order to unstuck the stuck pgs? Thanks! -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > That's why this kind of vulnerability is a concern: deploying stuff is < > often about collecting an obscene number of .jar files and pushing them < > up to the application server. --pboddie at LWN < ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados rm: device or resource busy
Hello, Brad Hubbard wrote: : I can reproduce this. [...] : That's here where you will notice it is returning EBUSY which is error : code 16, "Device or resource busy". : : https://github.com/badone/ceph/blob/wip-ceph_test_admin_socket_output/src/cls/lock/cls_lock.cc#L189 : : In order to remove the existing parts of the file you should be able : to just run "rados --pool testpool ls" and remove the listed objects : belonging to "testfile". : : Example: : rados --pool testpool ls : testfile.0004 : testfile.0001 : testfile. : testfile.0003 : testfile.0005 : testfile.0002 : : rados --pool testpool rm testfile. : rados --pool testpool rm testfile.0001 : ... This works for me, thanks! : Please open a tracker for this so it can be investigated further. Done: http://tracker.ceph.com/issues/20233 -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > That's why this kind of vulnerability is a concern: deploying stuff is < > often about collecting an obscene number of .jar files and pushing them < > up to the application server. --pboddie at LWN < ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados rm: device or resource busy
Hello, David Turner wrote: : How long have you waited? About a day. : I don't do much with rados objects directly. I usually use RBDs and : cephfs. If you just need to clean things up, you can delete the pool and : recreate it since it looks like it's testing. However this is probably a : prime time to figure out how to get past this in case it happens in the : future in production. Yes. This is why I am asking now. -Yenya : On Thu, Jun 8, 2017 at 11:04 AM Jan Kasprzak <k...@fi.muni.cz> wrote: : > I have created a RADOS striped object using : > : > $ dd someargs | rados --pool testpool --striper put testfile - : > : > and interrupted it in the middle of writing. Now I cannot remove this : > object: : > : > $ rados --pool testpool --striper rm testfile : > error removing testpool>testfile: (16) Device or resource busy : > : > How can I tell CEPH that the writer is no longer around and does not come : > back, : > so that I can remove the object "testfile"? -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > That's why this kind of vulnerability is a concern: deploying stuff is < > often about collecting an obscene number of .jar files and pushing them < > up to the application server. --pboddie at LWN < ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados rm: device or resource busy
Hello, I have created a RADOS striped object using $ dd someargs | rados --pool testpool --striper put testfile - and interrupted it in the middle of writing. Now I cannot remove this object: $ rados --pool testpool --striper rm testfile error removing testpool>testfile: (16) Device or resource busy How can I tell CEPH that the writer is no longer around and does not come back, so that I can remove the object "testfile"? Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > That's why this kind of vulnerability is a concern: deploying stuff is < > often about collecting an obscene number of .jar files and pushing them < > up to the application server. --pboddie at LWN < ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RADOS as a simple object storage
Wido den Hollander wrote: : : > Op 27 februari 2017 om 15:59 schreef Jan Kasprzak <k...@fi.muni.cz>: : > : > : > Here is some statistics from our biggest instance of the object storage: : > : > : > : > : > : > objects stored: 100_000_000 : > : > : > < 1024 bytes:10_000_000 : > : > : > 1k-64k bytes:80_000_000 : > : > : > 64k-4M bytes:10_000_000 : > : > : > 4M-256M bytes:1_000_000 : > : > : >> 256M bytes:10_000 : > : > : > biggest object: 15 GBytes : > : > : > : > : > : > Would it be feasible to put 100M to 1G objects as a native RADOS objects : > : > : > into a single pool? [...] : > https://github.com/ceph/ceph/blob/master/src/libradosstriper/RadosStriperImpl.cc#L33 : > : > If I understand it correctly, it looks like libradosstriper only splits : > large stored objects into smaller pieces (RADOS objects), but does not : > consolidate more small stored objects into larger RADOS objects. : : Why would you want to do that? Yes, very small objects can be a problem if you have millions of them since it takes a bit more to replicate them and recover them. Yes. This is what I was afraid of. The immutability of my objects would allow to consolidate smaller objects into larger bundles, but if you say is not necessary for the problem of my size, I'll store them into individual RADOS objects. : : But overall I wouldn't bother about it too much. OK, thanks! : > So do you think I am ok with >10M tiny objects (smaller than 1KB) : > and ~100,000,000 to 1,000,000,000 total objects, provided that I split : > huge objects using libradosstriper? -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | Assuming that OpenSSL is written as carefully as Wietse's own code, every 1000 lines introduce one additional bug into Postfix." --TLS_README ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RADOS as a simple object storage
Hello, Gregory Farnum wrote: : On Mon, Feb 20, 2017 at 11:57 AM, Jan Kasprzak <k...@fi.muni.cz> wrote: : > Gregory Farnum wrote: : > : On Mon, Feb 20, 2017 at 6:46 AM, Jan Kasprzak <k...@fi.muni.cz> wrote: : > : > : > : > I have been using CEPH RBD for a year or so as a virtual machine storage : > : > backend, and I am thinking about moving our another subsystem to CEPH: [...] : > : > Here is some statistics from our biggest instance of the object storage: : > : > : > : > objects stored: 100_000_000 : > : > < 1024 bytes:10_000_000 : > : > 1k-64k bytes:80_000_000 : > : > 64k-4M bytes:10_000_000 : > : > 4M-256M bytes:1_000_000 : > : >> 256M bytes:10_000 : > : > biggest object: 15 GBytes : > : > : > : > Would it be feasible to put 100M to 1G objects as a native RADOS objects : > : > into a single pool? : > : : > : This is well outside the object size RADOS is targeted or tested with; : > : I'd expect issues. You might want to look at libradosstriper from the : > : requirements you've mentioned. : > : > OK, thanks! Is there any documentation for libradosstriper? : > I am looking for something similar to librados documentation: : > http://docs.ceph.com/docs/master/rados/api/librados/ : : Not that I see, and I haven't used it myself, but the header file (see : ceph/src/libradosstriper) seems to have reasonable function docs. It's : a fairly thin wrapper around librados AFAIK. OK, I have read the docs in the header file and the comment near the top of RadosStriperImpl.cc: https://github.com/ceph/ceph/blob/master/src/libradosstriper/RadosStriperImpl.cc#L33 If I understand it correctly, it looks like libradosstriper only splits large stored objects into smaller pieces (RADOS objects), but does not consolidate more small stored objects into larger RADOS objects. So do you think I am ok with >10M tiny objects (smaller than 1KB) and ~100,000,000 to 1,000,000,000 total objects, provided that I split huge objects using libradosstriper? Thanks, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | Assuming that OpenSSL is written as carefully as Wietse's own code, every 1000 lines introduce one additional bug into Postfix." --TLS_README ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RADOS as a simple object storage
Gregory Farnum wrote: : On Mon, Feb 20, 2017 at 6:46 AM, Jan Kasprzak <k...@fi.muni.cz> wrote: : > Hello, world!\n : > : > I have been using CEPH RBD for a year or so as a virtual machine storage : > backend, and I am thinking about moving our another subsystem to CEPH: : > : > The subsystem in question is a simple replicated object storage, : > currently implemented by a custom C code by yours truly. My question : > is whether implementing such a thing on top of a CEPH RADOS pool and librados : > is feasible, and what layout and optimizations would you suggest. : > : > Our object storage indexes object with a numeric ID. The access methods : > involve creating, reading and deleting objects. Objects are never modified : > in place, they are instead deleted and an object with a new ID is created. : > We also keep a hash of an object contents and use it to prevent bit rot : > - the objects are scrubbed periodically, and if a checksum mismatch is : > discovered, the object is restored from another replica. : > : > Here is some statistics from our biggest instance of the object storage: : > : > objects stored: 100_000_000 : > < 1024 bytes:10_000_000 : > 1k-64k bytes:80_000_000 : > 64k-4M bytes:10_000_000 : > 4M-256M bytes:1_000_000 : >> 256M bytes:10_000 : > biggest object: 15 GBytes : > : > Would it be feasible to put 100M to 1G objects as a native RADOS objects : > into a single pool? : : This is well outside the object size RADOS is targeted or tested with; : I'd expect issues. You might want to look at libradosstriper from the : requirements you've mentioned. OK, thanks! Is there any documentation for libradosstriper? I am looking for something similar to librados documentation: http://docs.ceph.com/docs/master/rados/api/librados/ Thanks! -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | Assuming that OpenSSL is written as carefully as Wietse's own code, every 1000 lines introduce one additional bug into Postfix." --TLS_README ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RADOS as a simple object storage
Hello, world!\n I have been using CEPH RBD for a year or so as a virtual machine storage backend, and I am thinking about moving our another subsystem to CEPH: The subsystem in question is a simple replicated object storage, currently implemented by a custom C code by yours truly. My question is whether implementing such a thing on top of a CEPH RADOS pool and librados is feasible, and what layout and optimizations would you suggest. Our object storage indexes object with a numeric ID. The access methods involve creating, reading and deleting objects. Objects are never modified in place, they are instead deleted and an object with a new ID is created. We also keep a hash of an object contents and use it to prevent bit rot - the objects are scrubbed periodically, and if a checksum mismatch is discovered, the object is restored from another replica. Here is some statistics from our biggest instance of the object storage: objects stored: 100_000_000 < 1024 bytes:10_000_000 1k-64k bytes:80_000_000 64k-4M bytes:10_000_000 4M-256M bytes:1_000_000 > 256M bytes:10_000 biggest object: 15 GBytes Would it be feasible to put 100M to 1G objects as a native RADOS objects into a single pool? Or should I consider their read-only nature and pack them to bigger object/pack with metadata stored in a tmap object, and repack those packed objects periodically as older object get deleted? I have also considered rados-gw, but it looks like a too big hammer for my nail :-) Thanks for your suggestions, -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | Assuming that OpenSSL is written as carefully as Wietse's own code, every 1000 lines introduce one additional bug into Postfix." --TLS_README ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com