[ceph-users] ceph-volume failed after replacing disk
Hi all, We replaced a faulty disk out of N OSD and tried to follow steps according to "Replacing and OSD" in http://docs.ceph.com/docs/nautilus/rados/operations/add-or-rm-osds/, but got error: # ceph osd destroy 71--yes-i-really-mean-it # ceph-volume lvm create --bluestore --data /dev/data/lv01 --osd-id 71 --block.db /dev/db/lv01 Running command: /bin/ceph-authtool --gen-print-key Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json --> RuntimeError: The osd ID 71 is already in use or does not exist. ceph -s still shows N OSDS. I then remove with "ceph osd rm 71". Now "ceph -s" shows N-1 OSDS and id 71 doesn't appear in "ceph osd ls". However, repeating the ceph-volume command still gets same error. We're running CEPH 14.2.1. I must have some steps missed.Would anyone please help? Thanks a lot. Rgds, /stwong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Faux-Jewel Client Features
Hi all, Starting to make preparations for Nautilus upgrades from Mimic, and I'm looking over my client/session features and trying to fully grasp the situation. >/$ ceph versions />/{ />/"mon": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)": 3 }, />/"mgr": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)": 3 }, />/"osd": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)": 204 }, />/"mds": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)": 2 }, />/"overall": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)": 212 } />/} / >/$ ceph features />/{ />/"mon": [ />/{ "features": "0x3ffddff8ffacfffb", "release": "luminous", "num": 3 } ], />/"mds": [ />/{ "features": "0x3ffddff8ffacfffb", "release": "luminous" "num": 2 } ], />/"osd": [ />/{ "features": "0x3ffddff8ffacfffb", "num": 204 } ], />/"client": [ />/{ "features": "0x7010fb86aa42ada", "release": "jewel", "num": 4 }, />/{ "features": "0x7018fb86aa42ada", "release": "jewel", "num": 1 }, />/{ "features": "0x3ffddff8eea4fffb", "release": "luminous", "num": 344 }, />/{ "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 200 }, />/{ "features": "0x3ffddff8ffa4fffb", "release": "luminous", "num": 49 }, />/{ "features": "0x3ffddff8ffacfffb", "release": "luminous", "num": 213 } ], />/"mgr": [ />/{ "features": "0x3ffddff8ffacfffb", "release": "luminous", "num": 3 } ] />/} / >/$ ceph osd dump | grep compat />/require_min_compat_client luminous />/min_compat_client luminous / I flattened the output to make it a bit more vertical scrolling friendly. Diving into the actual clients with those features: >/# ceph daemon mon.mon1 sessions | grep jewel />/"MonSession(client.1649789192 ip.2:0/3697083337 is open allow *, features 0x7010fb86aa42ada (jewel))", />/"MonSession(client.1656508179 ip.202:0/2664244117 is open allow *, features 0x7018fb86aa42ada (jewel))", />/"MonSession(client.1637479106 ip.250:0/1882319989 is open allow *, features 0x7010fb86aa42ada (jewel))", />/"MonSession(client.1662023903 ip.249:0/3198281565 is open allow *, features 0x7010fb86aa42ada (jewel))", />/"MonSession(client.1658312940 ip.251:0/3538168209 is open allow *, features 0x7010fb86aa42ada (jewel))", / ip.2 is a cephfs kernel client with 4.15.0-51-generic ip.202 is a krbd client with kernel 4.18.0-22-generic ip.250 is a krbd client with kernel 4.15.0-43-generic ip.249 is a krbd client with kernel 4.15.0-45-generic ip.251 is a krbd client with kernel 4.15.0-45-generic For the krbd clients, the features are " features: layering, exclusive-lock". My min_compat and require_min_compat clients are already set to Luminous, however, I would love some reassurance that I'm not going to run into issues with the krbd/kcephfs clients, and trying to make use of new features like the PG autoscaler for instance. I should have full upmap compatibility as the balancer in upmap mode has been functioning, and given that they are relatively recent kernels. Just looking for some sanity checks to make sure I don't have any surprises for these 'jewel' clients come a nautilus rollout. Your krbd (0x7010fb86aa42ada) is enough for upmap. k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph pool EC with overwrite enabled
try: rbd create backup2/teste --size 5T --data-pool ec_pool Fabio Abreu 于2019年7月5日周五 上午1:49写道: > > Hi Everybody, > > I have a doubt about the usability of rbd with EC pool , I tried to use this > in my CentOS lab but I just receive some errors when I try create a rbd > image inside this pool. > > For luminous environment this feature is supported? > > http://docs.ceph.com/docs/mimic/rados/operations/erasure-code/#erasure-coding-with-overwrites > > ceph osd pool set ec_pool allow_ec_overwrites true > > > This error bellow happened when I try to create the RBD image : > > > [root@mon1 ceph-key]# rbd create backup2/teste --size 5T --data-pool backup2 > > ... > > warning: line 9: 'osd_pool_default_crush_rule' in section 'global' redefined > > 2019-07-03 17:27:33.721593 7f12c3fff700 -1 librbd::image::CreateRequest: > 0x560f2f0db0a0 handle_add_image_to_directory: error adding image to > directory: (95) Operation not supported > > rbd: create error: (95) Operation not supported > > > Regards, > Fabio Abreu Reis > http://fajlinux.com.br > Tel : +55 21 98244-0161 > Skype : fabioabreureis > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Understanding incomplete PGs
Hello, I'm working with a small ceph cluster (about 10TB, 7-9 OSDs, all Bluestore on lvm) and recently ran into a problem with 17 pgs marked as incomplete after adding/removing OSDs. Here's the sequence of events: 1. 7 osds in the cluster, health is OK, all pgs are active+clean 2. 3 new osds on a new host are added, lots of backfilling in progress 3. osd 6 needs to be removed, so we do "ceph osd crush reweight osd.6 0" 4. after a few hours we see "min osd.6 with 0 pgs" from "ceph osd utilization" 5. ceph osd out 6 6. systemctl stop ceph-osd@6 7. the drive backing osd 6 is pulled and wiped 8. backfilling has now finished all pgs are active+clean except for 17 incomplete pgs >From reading the docs, it sounds like there has been unrecoverable data loss in those 17 pgs. That raises some questions for me: Was "ceph osd utilization" only showing a goal of 0 pgs allocated instead of the current actual allocation? Why is there data loss from a single osd being removed? Shouldn't that be recoverable? All pools in the cluster are either replicated 3 or erasure-coded k=2,m=1 with default "host" failure domain. They shouldn't suffer data loss with a single osd being removed even if there were no reweighting beforehand. Does the backfilling temporarily reduce data durability in some way? Is there a way to see which pgs actually have data on a given osd? I attached an example of one of the incomplete pgs. Thanks for any help, Kyle{ "state": "incomplete", "snap_trimq": "[]", "snap_trimq_len": 0, "epoch": 2087, "up": [ 4, 3, 8 ], "acting": [ 4, 3, 8 ], "info": { "pgid": "15.59s0", "last_update": "753'7465", "last_complete": "753'7465", "log_tail": "663'4401", "last_user_version": 6947, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [], "history": { "epoch_created": 603, "epoch_pool_created": 603, "last_epoch_started": 1581, "last_interval_started": 1580, "last_epoch_clean": 945, "last_interval_clean": 944, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 2082, "same_interval_since": 2082, "same_primary_since": 2076, "last_scrub": "753'7465", "last_scrub_stamp": "2019-07-02 13:40:58.935208", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2019-06-27 17:42:04.685790", "last_clean_scrub_stamp": "2019-07-02 13:40:58.935208" }, "stats": { "version": "753'7465", "reported_seq": "12691", "reported_epoch": "2087", "state": "incomplete", "last_fresh": "2019-07-04 14:30:47.930190", "last_change": "2019-07-04 14:30:47.930190", "last_active": "2019-07-03 13:04:00.967354", "last_peered": "2019-07-03 13:02:40.242867", "last_clean": "2019-07-02 23:04:26.601070", "last_became_active": "2019-07-03 08:35:12.459857", "last_became_peered": "2019-07-03 08:35:12.459857", "last_unstale": "2019-07-04 14:30:47.930190", "last_undegraded": "2019-07-04 14:30:47.930190", "last_fullsized": "2019-07-04 14:30:47.930190", "mapping_epoch": 2082, "log_start": "663'4401", "ondisk_log_start": "663'4401", "created": 603, "last_epoch_clean": 945, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "753'7465", "last_scrub_stamp": "2019-07-02 13:40:58.935208", "last_deep_scrub": "0'0", "last_deep_scrub_stamp": "2019-06-27 17:42:04.685790", "last_clean_scrub_stamp": "2019-07-02 13:40:58.935208", "log_size": 3064, "ondisk_log_size": 3064, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "manifest_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 12872933376, "num_objects": 3094, "num_object_clones": 0, "num_object_copies": 9282, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 3094, "num_whiteouts": 0, "num_read": 896, "num_read_kb": 3708, "num_write": 5870, "num_write_kb": 12567180,
[ceph-users] Ceph pool EC with overwrite enabled
Hi Everybody, I have a doubt about the usability of rbd with EC pool , I tried to use this in my CentOS lab but I just receive some errors when I try create a rbd image inside this pool. For luminous environment this feature is supported? http://docs.ceph.com/docs/mimic/rados/operations/erasure-code/#erasure-coding-with-overwrites ceph osd pool set ec_pool allow_ec_overwrites true This error bellow happened when I try to create the RBD image : [root@mon1 ceph-key]# rbd create backup2/teste --size 5T --data-pool backup2 ... warning: line 9: 'osd_pool_default_crush_rule' in section 'global' redefined 2019-07-03 17:27:33.721593 7f12c3fff700 -1 librbd::image::CreateRequest: 0x560f2f0db0a0 handle_add_image_to_directory: error adding image to directory: (95) Operation not supported rbd: create error: (95) Operation not supported Regards, Fabio Abreu Reis http://fajlinux.com.br *Tel : *+55 21 98244-0161 *Skype : *fabioabreureis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Random slow requests without any load
Hello, I do have a very strange situation: in a almost-new ceph cluster, I do have random requests blocked, leading to timeouts. Example error from logs: > [WRN] Health check failed: 8 slow requests are blocked > 32 sec. Implicated osds 12 (REQUEST_SLOW) >7fd8bb0bd700 0 log_channel(cluster) log [WRN] : slow request 30.796124 seconds old, received at 2019-07-04 16:18:54.53038 8: osd_op(client.2829606.0:103 3.135 3:ac9abb76:::rbd_data.2b00ee6b8b4567.:head [set-alloc-hint object_size 4194304 write_size 419430 4,write 0~4096] snapc 0=[] ondisk+write+known_if_redirected e294) currently op_applied This happens totally randomly. I'm not able to reproduce it: I never had the issue with benchmarks, I do have it occasionally when I start or stop of VM (it's a proxmox deployment and ceph / rdb is used as storage for VMs) or when I use the VM. This is a example request stuck (with dump_ops_in_flight): > { "description": "osd_op(client.2829606.0:103 3.135 3:ac9abb76:::rbd_data.2b00ee6b8b4567.:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4096] snapc 0=[] ondisk+write+known_if_redirected e294)", "initiated_at": "2019-07-04 16:18:54.530388", "age": 196.315782, "duration": 196.315797, "type_data": { "flag_point": "waiting for sub ops", "client_info": { "client": "client.2829606", "client_addr": "10.3.5.40:0/444048627", "tid": 103 }, "events": [ { "time": "2019-07-04 16:18:54.530388", "event": "initiated" }, { "time": "2019-07-04 16:18:54.530429", "event": "queued_for_pg" }, { "time": "2019-07-04 16:18:54.530437", "event": "reached_pg" }, { "time": "2019-07-04 16:18:54.530455", "event": "started" }, { "time": "2019-07-04 16:18:54.530507", "event": "waiting for subops from 8" }, { "time": "2019-07-04 16:18:54.531020", "event": "op_commit" }, { "time": "2019-07-04 16:18:54.531024", "event": "op_applied" } ] } } Since he seems to be waiting on osd 8, I tried to dump_ops_in_flight and to dump_historic_ops but there was nothing (witch is quite strange no ?). The cluster has no load in general: There is no i/o errors, no requests on disks (iostat is at 99.+% idle), no cpu usage, no ethernet usage. The OSD or the OSD waited on in subops are random. If I restart the target osd, the request is unstuck. There is nothing else in logs / dmesg except this: > 7fd8bf8f1700 0 -- 10.3.5.41:6809/16241 >> 10.3.5.42:6813/1015314 conn(0x555ddb9db800 :6809 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 39 vs existing csq=39 existing_state=STATE_CONNECTING But not around errors, I'm not sure it's just debuging output. On the network side, I had jumbo frames but disabling them changed nothing. Just in case, I do have a LACP bond to two switches (mlag/vtl), but I don't see any network issues (heavy pings are totally fine, even for a long time). I kind of suspect the tcp connection of the OSD who is stuck / in a bad state for some reason, but I'm not sure what and how I can debug this. My ceph version is: 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable) Do you have any idea/pointer/help on what is the issue / what can I try to debug / check ? Thanks a lot and have a nice day! -- Maximilien Cuony ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] To backport or not to backport
Hi, On 7/4/19 3:00 PM, Stefan Kooman wrote: > - Only backport fixes that do not introduce new functionality, but addresses > (impaired) functionality already present in the release. ack, and also my full agrement/support for everything else you wrote, thanks. reading in the changelogs about backported features (in particular the one release where bluestor was backported to) left me quite scared for upgrading our cluster. Regards, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] To backport or not to backport
Hi, Now the release cadence has been set, it's time for another discussion :-). During Ceph day NL we had a panel q/a [1]. One of the things that was discussed were backports. Occasionally users will ask for backports of functionality in newer releases to older releases (that are still in support). Ceph is quite a unique project in the sense that new functionality gets backported to older releases. Sometimes even functionality gets changed in the lifetime of a release. I can recall "ceph-volume" change to LVM in the beginning of the Luminous release. While backports can enrich the user experience of a ceph operator, it's not without risks. There have been several issues with "incomplete" backports and or unforeseen circumstances that had the reverse effect: downtime of (part of) ceph services. The ones that come to my mind are: - MDS (cephfs damaged) mimic backport (13.2.2) - RADOS (pg log hard limit) luminous / mimic backport (12.2.8 / 13.2.2) I would like to define a simple rule of when to backport: - Only backport fixes that do not introduce new functionality, but addresses (impaired) functionality already present in the release. Example of, IMHO, a backport that matches the backport criteria was the "bitmap_allocator" fix. It fixed a real problem, not some corner case. Don't get me wrong here, it is important to catch corner cases, but it should not put the majority of clusters at risk. The time and effort that might be saved with this approach can indeed be spend in one of the new focus areas Sage mentioned during his keynote talk at Cephalocon Barcelona: quality. Quality of the backports that are needed, improved testing, especially for upgrades to newer releases. If upgrades are seemless, people are more willing to upgrade, because hey, it just works(tm). Upgrades should be boring. How many clusters (not nautilus ;-)) are running with "bitmap_allocator" or with the pglog_hardlimit enabled? If a new feature is not enabled by default and it's unclear how "stable" it is to use, operators tend to not enable it, defeating the purpose of the backport. Backporting fixes to older releases can be considered a "business opportunity" for the likes of Red Hat, SUSE, Fujitsu, etc. Especially for users that want a system that "keeps on running forever" and never needs "dangerous" updates. This is my view on the matter, please let me know what you think of this. Gr. Stefan P.s. Just to make things clear: this thread is in _no way_ intended to pick on anybody. [1]: https://pad.ceph.com/p/ceph-day-nl-2019-panel -- | BIT BV https://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cannot add fuse options to ceph-fuse command
Hi, I try to add some fuse options when mount cephfs using ceph-fuse tool, but it errored: ceph-fuse -m 10.128.5.1,10.128.5.2,10.128.5.3 -r /test1 /cephfs/test1 -o entry_timeout=5 ceph-fuse[3857515]: starting ceph client2019-07-04 21:55:37.767 7fc1d9cbdbc0 -1 init, newargv = 0x555d6f847490 newargc=9 fuse: unknown option `entry_timeout=5' ceph-fuse[3857515]: fuse failed to start 2019-07-04 21:55:37.796 7fc1d9cbdbc0 -1 fuse_lowlevel_new failed How can I pass options to fuse? Thank you for your precious help !___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cinder pool inaccessible after Nautilus upgrade
It appears that if the client or Openstack cinder service is in the same network as Ceph, it works. In the Openstack network it fails, but only on this particular pool! It was working well before the upgrade and no changes have been made on network side. Very strange issue. I checked the Ceph release notes in order to find network changes but found nothing relevant. Only the biggest pool is concerned, same pool config, same hosts, ACLs all open, no iptables, ... Anything else to check? We are thinking about adding a VNIC to all Ceph and Openstack hosts in order to be in the same subnet. Adrien Le 03/07/2019 à 13:46, Adrien Georget a écrit : Hi, With --debug-objecter=20, I found that the rados ls command hangs looping on laggy messages : | ||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter _op_submit op 0x7efc3800dc10|| ||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter _calc_target epoch 13146 base @3 precalc_pgid 1 pgid 3.100 is_read|| ||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter _calc_target target @3 -> pgid 3.100|| ||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter _calc_target raw pgid 3.100 -> actual 3.100 acting [29,12,55] primary 29|| ||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter _get_session s=0x7efc380024c0 osd=29 3|| ||2019-07-03 13:33:24.913 7efc402f5700 10 client.21363886.objecter _op_submit oid '@3' '@3' [pgnls start_epoch 13146] tid 11 osd.29|| ||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter get_session s=0x7efc380024c0 osd=29 3|| ||2019-07-03 13:33:24.913 7efc402f5700 15 client.21363886.objecter _session_op_assign 29 11|| ||2019-07-03 13:33:24.913 7efc402f5700 15 client.21363886.objecter _send_op 11 to 3.100 on osd.29|| ||2019-07-03 13:33:24.913 7efc402f5700 20 client.21363886.objecter put_session s=0x7efc380024c0 osd=29 4|| ||2019-07-03 13:33:24.913 7efc402f5700 5 client.21363886.objecter 1 in flight|| ||2019-07-03 13:33:29.678 7efc3e2f1700 10 client.21363886.objecter tick|| ||2019-07-03 13:33:34.678 7efc3e2f1700 10 client.21363886.objecter tick|| ||2019-07-03 13:33:39.678 7efc3e2f1700 10 client.21363886.objecter tick|| ||2019-07-03 13:33:39.678 7efc3e2f1700 2 client.21363886.objecter tid 11 on osd.29 is laggy|| ||2019-07-03 13:33:39.678 7efc3e2f1700 10 client.21363886.objecter _maybe_request_map subscribing (onetime) to next osd map|| ||2019-07-03 13:33:44.678 7efc3e2f1700 10 client.21363886.objecter tick|| ||2019-07-03 13:33:44.678 7efc3e2f1700 2 client.21363886.objecter tid 11 on osd.29 is laggy|| ||2019-07-03 13:33:44.678 7efc3e2f1700 10 client.21363886.objecter _maybe_request_map subscribing (onetime) to next osd map|| ||2019-07-03 13:33:49.679 7efc3e2f1700 10 client.21363886.objecter tick ... |I tried to disable this OSD but the problem goes on another OSD, and so on. The ceph client packages are up to date, all RBD command still work from a monitor but not from Openstack controllers. And the other Ceph pool on the same OSD host but on different disks works perfectly with Openstack... The issue looks like these old on, but It seems fixed since fews years : https://tracker.ceph.com/issues/2454 and https://tracker.ceph.com/issues/8515 Is there anything more I can check? Adrien Le 02/07/2019 à 14:10, Adrien Georget a écrit : Hi Eugen, The cinder keyring used by the 2 pools is the same, the rbd command works using this keyring and ceph.conf used by Openstack while the rados ls command stays stuck. I tried with the previous ceph-common version used 10.2.5 and the last ceph version 14.2.1. With the Nautilus ceph-common version, the 2 cinder-volume services crashed... Adrien Le 02/07/2019 à 13:50, Eugen Block a écrit : Hi, did you try to use rbd and rados commands with the cinder keyring, not the admin keyring? Did you check if the caps for that client are still valid (do the caps differ between the two cinder pools)? Are the ceph versions on your hypervisors also nautilus? Regards, Eugen Zitat von Adrien Georget : Hi all, I'm facing a very strange issue after migrating my Luminous cluster to Nautilus. I have 2 pools configured for Openstack cinder volumes with multiple backend setup, One "service" Ceph pool with cache tiering and one "R" Ceph pool. After the upgrade, the R pool became inaccessible for Cinder and the cinder-volume service using this pool can't start anymore. What is strange is that Openstack and Ceph report no error, Ceph cluster is healthy, all OSDs are UP & running and the "service" pool is still running well with the other cinder service on the same openstack host. I followed exactly the upgrade procedure (https://ceph.com/releases/v14-2-0-nautilus-released/#upgrading-from-mimic-or-luminous), no problem during the upgrade but I can't understand why Cinder still fails with this pool. I can access, list, create volume on this pool with rbd or rados command from the
Re: [ceph-users] slow requests due to scrubbing of very small pg
Hi Lukasz, I've seen something like that - slow requests and relevant OSD reboots on suicide timeout at least twice with two different clusters. The root cause was slow omap listing for some objects which had started to happen after massive removals from RocksDB. To verify if this is the case you can create a script that uses ceph-objectstore-tool to list objects for the specific pg and then list-omap for every returned object. If omap listing for some object(s) takes too long (minutes in my case) - you're facing the same issue. PR that implements automatic lookup for such "slow" objects in ceph-objectstore-tool is under review: https://github.com/ceph/ceph/pull/27985 The only known workaround for existing OSDs so far is manual DB compaction. And https://github.com/ceph/ceph/pull/27627 hopefully fixes the issue for newly deployed OSDs. Relevant upstream tickets are: http://tracker.ceph.com/issues/36482 http://tracker.ceph.com/issues/40557 Hope this helps, Igor On 7/3/2019 9:54 AM, Luk wrote: Hello, I have strange problem with scrubbing. When scrubbing starts on PG which belong to default.rgw.buckets.index pool, I can see that this OSD is very busy (see attachment), and starts showing many slow request, after the scrubbing of this PG stops, slow requests stops immediately. [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# zgrep scrub /var/log/ceph/ceph-osd.118.log.1.gz | grep -w 20.2 2019-07-03 00:14:57.496308 7fd4c7a09700 0 log_channel(cluster) log [DBG] : 20.2 deep-scrub starts 2019-07-03 05:36:13.274637 7fd4ca20e700 0 log_channel(cluster) log [DBG] : 20.2 deep-scrub ok [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# du -sh 20.2_* 636K20.2_head 0 20.2_TEMP [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# ls -1 -R 20.2_head | wc -l 4125 [root@stor-b02 /var/lib/ceph/osd/ceph-118/current]# and on mon: 2019-07-03 00:48:44.793893 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231090 : cluster [WRN] Health check failed: 105 slow requests are blocked > 32 sec. Implicated osds 118 (REQUEST_SLOW) 2019-07-03 00:48:54.086446 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231097 : cluster [WRN] Health check update: 102 slow requests are blocked > 32 sec. Implicated osds 118 (REQUEST_SLOW) 2019-07-03 00:48:59.088240 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6231099 : cluster [WRN] Health check update: 91 slow requests are blocked > 32 sec. Implicated osds 118 (REQUEST_SLOW) [...] 2019-07-03 05:36:19.695586 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243211 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 23 slow requests are blocked > 32 sec. Implicated osds 118) 2019-07-03 05:36:19.695700 mon.ceph-mon-01 mon.0 10.10.8.221:6789/0 6243212 : cluster [INF] Cluster is now healthy ceph version 12.2.9 it might be related to this (taken from: https://ceph.com/releases/v12-2-11-luminous-released/) ? : " There have been fixes to RGW dynamic and manual resharding, which no longer leaves behind stale bucket instances to be removed manually. For finding and cleaning up older instances from a reshard a radosgw-admin command reshard stale-instances list and reshard stale-instances rm should do the necessary cleanup. " ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] troubleshooting space usage
Thanks for trying to help, Igor. > From: "Igor Fedotov" > To: "Andrei Mikhailovsky" > Cc: "ceph-users" > Sent: Thursday, 4 July, 2019 12:52:16 > Subject: Re: [ceph-users] troubleshooting space usage > Yep, this looks fine.. > hmm... sorry, but I'm out of ideas what's happening.. > Anyway I think ceph reports are more trustworthy than rgw ones. Looks like > some > issue with rgw reporting or may be some object leakage. > Regards, > Igor > On 7/3/2019 6:34 PM, Andrei Mikhailovsky wrote: >> Hi Igor. >> The numbers are identical it seems: >> .rgw.buckets 19 15 TiB 78.22 4.3 TiB 8786934 >> # cat /root/ceph-rgw.buckets-rados-ls-all |wc -l >> 8786934 >> Cheers >>> From: "Igor Fedotov" [ mailto:ifedo...@suse.de | ] >>> To: "andrei" [ mailto:and...@arhont.com | ] >>> Cc: "ceph-users" [ mailto:ceph-users@lists.ceph.com | >>> ] >>> Sent: Wednesday, 3 July, 2019 13:49:02 >>> Subject: Re: [ceph-users] troubleshooting space usage >>> Looks fine - comparing bluestore_allocated vs. bluestore_stored shows a >>> little >>> difference. So that's not the allocation overhead. >>> What's about comparing object counts reported by ceph and radosgw tools? >>> Igor. >>> On 7/3/2019 3:25 PM, Andrei Mikhailovsky wrote: Thanks Igor, Here is a link to the ceph perf data on several osds. [ https://paste.ee/p/IzDMy | https://paste.ee/p/IzDMy ] In terms of the object sizes. We use rgw to backup the data from various workstations and servers. So, the sizes would be from a few kb to a few gig per individual file. Cheers > From: "Igor Fedotov" [ mailto:ifedo...@suse.de | ] > To: "andrei" [ mailto:and...@arhont.com | ] > Cc: "ceph-users" [ mailto:ceph-users@lists.ceph.com | > ] > Sent: Wednesday, 3 July, 2019 12:29:33 > Subject: Re: [ceph-users] troubleshooting space usage > Hi Andrei, > Additionally I'd like to see performance counters dump for a couple of > HDD OSDs > (obtained through 'ceph daemon osd.N perf dump' command). > W.r.t average object size - I was thinking that you might know what > objects had > been uploaded... If not then you might want to estimate it by using > "rados get" > command on the pool: retrieve some random object set and check their > sizes. But > let's check performance counters first - most probably they will show > loses > caused by allocation. > Also I've just found similar issue (still unresolved) in our internal > tracker - > but its root cause is definitely different from allocation overhead. > Looks like > some orphaned objects in the pool. Could you please compare and share the > amounts of objects in the pool reported by "ceph (or rados) df detail" and > radosgw tools? > Thanks, > Igor > On 7/3/2019 12:56 PM, Andrei Mikhailovsky wrote: >> Hi Igor, >> Many thanks for your reply. Here are the details about the cluster: >> 1. Ceph version - 13.2.5-1xenial (installed from Ceph repository for >> ubuntu >> 16.04) >> 2. main devices for radosgw pool - hdd. we do use a few ssds for the >> other pool, >> but it is not used by radosgw >> 3. we use BlueStore >> 4. Average rgw object size - I have no idea how to check that. Couldn't >> find a >> simple answer from google either. Could you please let me know how to >> check >> that? >> 5. Ceph osd df tree: >> 6. Other useful info on the cluster: >> # ceph osd df tree >> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME >> -1 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - root uk >> -5 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - datacenter ldex >> -11 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - room ldex-dc3 >> -13 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - row row-a >> -4 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - rack ldex-rack-a5 >> -2 28.04495 - 28 TiB 22 TiB 6.2 TiB 77.96 0.98 - host arh-ibstorage1-ib >> 0 hdd 2.73000 0.7 2.8 TiB 2.3 TiB 519 GiB 81.61 1.03 145 osd.0 >> 1 hdd 2.73000 1.0 2.8 TiB 1.9 TiB 847 GiB 70.00 0.88 130 osd.1 >> 2 hdd 2.73000 1.0 2.8 TiB 2.2 TiB 561 GiB 80.12 1.01 152 osd.2 >> 3 hdd 2.73000 1.0 2.8 TiB 2.3 TiB 469 GiB 83.41 1.05 160 osd.3 >> 4 hdd 2.73000 1.0 2.8 TiB 1.8 TiB 983 GiB 65.18 0.82 141 osd.4 >> 32 hdd 5.45999 1.0 5.5 TiB 4.4 TiB 1.1 TiB 80.68 1.02 306 osd.32 >> 35 hdd 2.73000 1.0 2.8 TiB 1.7 TiB 1.0 TiB 62.89 0.79 126 osd.35 >> 36 hdd 2.73000 1.0 2.8 TiB 2.3 TiB 464 GiB 83.58 1.05 175 osd.36 >> 37 hdd 2.73000 0.8 2.8 TiB 2.5 TiB 301 GiB 89.34 1.13 160 osd.37 >> 5 ssd 0.74500 1.0 745 GiB 642 GiB 103 GiB 86.15 1.09 65 osd.5 >> -3 28.04495 - 28 TiB 24 TiB 4.5 TiB 84.03 1.06 - host arh-ibstorage2-ib >> 9 hdd 2.73000 0.95000 2.8 TiB 2.4 TiB 405 GiB 85.65 1.08 158 osd.9 >> 10 hdd 2.73000 0.8 2.8 TiB 2.4 TiB
Re: [ceph-users] troubleshooting space usage
Yep, this looks fine.. hmm... sorry, but I'm out of ideas what's happening.. Anyway I think ceph reports are more trustworthy than rgw ones. Looks like some issue with rgw reporting or may be some object leakage. Regards, Igor On 7/3/2019 6:34 PM, Andrei Mikhailovsky wrote: Hi Igor. The numbers are identical it seems: .rgw.buckets 19 15 TiB 78.22 4.3 TiB *8786934* # cat /root/ceph-rgw.buckets-rados-ls-all |wc -l *8786934* Cheers *From: *"Igor Fedotov" *To: *"andrei" *Cc: *"ceph-users" *Sent: *Wednesday, 3 July, 2019 13:49:02 *Subject: *Re: [ceph-users] troubleshooting space usage Looks fine - comparing bluestore_allocated vs. bluestore_stored shows a little difference. So that's not the allocation overhead. What's about comparing object counts reported by ceph and radosgw tools? Igor. On 7/3/2019 3:25 PM, Andrei Mikhailovsky wrote: Thanks Igor, Here is a link to the ceph perf data on several osds. https://paste.ee/p/IzDMy In terms of the object sizes. We use rgw to backup the data from various workstations and servers. So, the sizes would be from a few kb to a few gig per individual file. Cheers *From: *"Igor Fedotov" *To: *"andrei" *Cc: *"ceph-users" *Sent: *Wednesday, 3 July, 2019 12:29:33 *Subject: *Re: [ceph-users] troubleshooting space usage Hi Andrei, Additionally I'd like to see performance counters dump for a couple of HDD OSDs (obtained through 'ceph daemon osd.N perf dump' command). W.r.t average object size - I was thinking that you might know what objects had been uploaded... If not then you might want to estimate it by using "rados get" command on the pool: retrieve some random object set and check their sizes. But let's check performance counters first - most probably they will show loses caused by allocation. Also I've just found similar issue (still unresolved) in our internal tracker - but its root cause is definitely different from allocation overhead. Looks like some orphaned objects in the pool. Could you please compare and share the amounts of objects in the pool reported by "ceph (or rados) df detail" and radosgw tools? Thanks, Igor On 7/3/2019 12:56 PM, Andrei Mikhailovsky wrote: Hi Igor, Many thanks for your reply. Here are the details about the cluster: 1. Ceph version - 13.2.5-1xenial (installed from Ceph repository for ubuntu 16.04) 2. main devices for radosgw pool - hdd. we do use a few ssds for the other pool, but it is not used by radosgw 3. we use BlueStore 4. Average rgw object size - I have no idea how to check that. Couldn't find a simple answer from google either. Could you please let me know how to check that? 5. Ceph osd df tree: 6. Other useful info on the cluster: # ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME -1 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - root uk -5 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - datacenter ldex -11 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - room ldex-dc3 -13 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - row row-a -4 112.17979 - 113 TiB 90 TiB 23 TiB 79.25 1.00 - rack ldex-rack-a5 -2 28.04495 - 28 TiB 22 TiB 6.2 TiB 77.96 0.98 - host arh-ibstorage1-ib 0 hdd 2.73000 0.7 2.8 TiB 2.3 TiB 519 GiB 81.61 1.03 145 osd.0 1 hdd 2.73000 1.0 2.8 TiB 1.9 TiB 847 GiB 70.00 0.88 130 osd.1 2 hdd 2.73000 1.0 2.8 TiB 2.2 TiB 561 GiB 80.12 1.01 152 osd.2 3 hdd 2.73000 1.0 2.8 TiB 2.3 TiB 469 GiB 83.41 1.05 160 osd.3 4 hdd 2.73000 1.0 2.8 TiB 1.8 TiB 983 GiB 65.18 0.82 141 osd.4 32 hdd 5.45999 1.0 5.5 TiB 4.4 TiB 1.1 TiB 80.68
Re: [ceph-users] Two clusters in one network
There's nothing special about layer 2 networks in Ceph, so yeah that's as valid as any other network setup. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Jul 4, 2019 at 10:14 AM Jarek wrote: > > Are two clusters in one layer2 network safe in production use? > The goal is a rbd-mirror between them. > > -- > Pozdrawiam > Jarosław Mociak - Nettelekom GK Sp. z o.o. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Two clusters in one network
Are two clusters in one layer2 network safe in production use? The goal is a rbd-mirror between them. -- Pozdrawiam Jarosław Mociak - Nettelekom GK Sp. z o.o. pgpGjD23Bl8Ln.pgp Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs not deep-scrubbed in time
Hi, thanks for you quickly answer. This option is set to false. root@heku1 ~# ceph daemon osd.1 config get osd_scrub_auto_repair { "osd_scrub_auto_repair": "false" } Best regards Alex Am 03.07.2019 um 15:42 schrieb Paul Emmerich: auto repair enabled___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com