[ceph-users] assertion error trying to start mds server
I've been in the process of updating my gentoo based cluster both with new hardware and a somewhat postponed update. This includes some major stuff including the switch from gcc 4.x to 5.4.0 on existing hardware and using gcc 6.4.0 to make better use of AMD Ryzen on the new hardware. The existing cluster was on 10.2.2, but I was going to 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin transitioning to bluestore on the osd's. The Ryzen units are slated to be bluestore based OSD servers if and when I get to that point. Up until the mds failure, they were simply cephfs clients. I had three OSD servers updated to 10.2.7-r1 (one is also a MON) and had two servers left to update. Both of these are also MONs and were acting as a pair of dual active MDS servers running 10.2.2. Monday morning I found out the hard way that an UPS one of them was on has a dead battery. After I fsck'd and came back up, I saw the following assertion error when it was trying to start it's mds.B server: mdsbeacon(64162/B up:replay seq 3 v4699) v7 126+0+0 (709014160 0 0) 0x7f6fb4001bc0 con 0x55f94779d 8d0 0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In function 'virtual void EImportStart::r eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972 mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv) ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x55f93d64a122] 2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce] 3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34] 4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d] 5: (()+0x74a4) [0x7f6fd009b4a4] 6: (clone()+0x6d) [0x7f6fce5a598d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-mds.B.log When I was googling around, I ran into this Cern presentation and tried out the offline backware scrubbing commands on slide 25 first: https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf Both ran without any messages, so I'm assuming I have sane contents in the cephfs_data and cephfs_metadata pools. Still no luck getting things restarted, so I tried the cephfs-journal-tool journal reset on slide 23. That didn't work either. Just for giggles, I tried setting up the two Ryzen boxes as new mds.C and mds.D servers which would run on 10.2.7-r1 instead of using mds.A and mds.B (10.2.2). The D server fails with the same assert as follows: === 132+0+1979520 (4198351460 0 1611007530) 0x7fffc4000a70 con 0x7fffe0013310 0> 2017-10-09 13:01:31.571195 7fffd99f5700 -1 mds/journal.cc: In function 'virtual void EImportStart::replay(MDSRank*)' thread 7fffd99f5700 time 2017-10-09 13:01:31.570608 mds/journal.cc: 2949: FAILED assert(mds->sessionmap.get_version() == cmapv) ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55b7ebc8] 2: (EImportStart::replay(MDSRank*)+0x9ea) [0x55a5674a] 3: (MDLog::_replay_thread()+0xe51) [0x559cef21] 4: (MDLog::ReplayThread::entry()+0xd) [0x557778cd] 5: (()+0x7364) [0x77bc5364] 6: (clone()+0x6d) [0x76051ccd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] min_size & hybrid OSD latency
Hello, On Wed, 11 Oct 2017 00:05:26 +0200 Jack wrote: > Hi, > > I would like some information about the following > > Let say I have a running cluster, with 4 OSDs: 2 SSDs, and 2 HDDs > My single pool has size=3, min_size=2 > > For a write-only pattern, I thought I would get SSDs performance level, > because the write would be acked as soon as min_size OSDs acked > > But I am right ? > You're the 2nd person in very recent times to come up with that wrong conclusion about min_size. All writes have to be ACKed, the only time where hybrid stuff helps is to accelerate reads. Which is something that people like me at least have very little interest in as the writes need to be fast. Christian > (the same setup could involve some high latency OSDs, in the case of > country-level cluster) > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW flush_read_list error
In Luminous 12.2.1, when running a GET on a large (1GB file) repeatedly for an hour from RGW, the following error was hit intermittently a number of times. The first error was hit after 45 minutes and then the error happened frequently for the remainder of the test. ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 Here is some more context from the rgw log around one of the failures. 2017-10-10 18:20:32.321681 I | rgw: 2017-10-10 18:20:32.321643 7f8929f41700 1 civetweb: 0x55bd25899000: 10.32.0.1 - - [10/Oct/2017:18:19:07 +] "GET /bucket100/testfile.tst HTTP/1.1" 1 0 - aws-sdk-java/1.9.0 Linux/4.4.0-93-generic OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131 2017-10-10 18:20:32.383855 I | rgw: 2017-10-10 18:20:32.383786 7f8924736700 1 == starting new request req=0x7f892472f140 = 2017-10-10 18:20:46.605668 I | rgw: 2017-10-10 18:20:46.605576 7f894af83700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 2017-10-10 18:20:46.605934 I | rgw: 2017-10-10 18:20:46.605914 7f894af83700 1 == req done req=0x7f894af7c140 op status=-5 http_status=200 == 2017-10-10 18:20:46.606249 I | rgw: 2017-10-10 18:20:46.606225 7f8924736700 0 ERROR: flush_read_list(): d->client_cb->handle_data() returned -5 I don't see anything else standing out in the log. The object store was configured with an erasure-coded data pool with k=2 and m=1. There are a number of threads around this, but I don't see a resolution. Is there a tracking issue for this? http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007756.ht ml https://www.spinics.net/lists/ceph-users/msg16117.html https://www.spinics.net/lists/ceph-devel/msg37657.html Here's our tracking Rook issue. https://github.com/rook/rook/issues/1067 Thanks, Travis On 10/10/17, 3:05 PM, "ceph-users on behalf of Jack"wrote: >Hi, > >I would like some information about the following > >Let say I have a running cluster, with 4 OSDs: 2 SSDs, and 2 HDDs >My single pool has size=3, min_size=2 > >For a write-only pattern, I thought I would get SSDs performance level, >because the write would be acked as soon as min_size OSDs acked > >But I am right ? > >(the same setup could involve some high latency OSDs, in the case of >country-level cluster) >___ >ceph-users mailing list >ceph-users@lists.ceph.com >https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.ceph >.com%2Flistinfo.cgi%2Fceph-users-ceph.com=02%7C01%7CTravis.Nielsen%40 >quantum.com%7C16f668da252f4e6f355308d5102b09c1%7C322a135f14fb4d72aede12227 >2134ae0%7C1%7C0%7C636432699404298770=tmIMMyQ7ia%2FVmHrSGcF9t4sMpt2bj >dexriEhEg3XUGU%3D=0 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] min_size & hybrid OSD latency
Hi, I would like some information about the following Let say I have a running cluster, with 4 OSDs: 2 SSDs, and 2 HDDs My single pool has size=3, min_size=2 For a write-only pattern, I thought I would get SSDs performance level, because the write would be acked as soon as min_size OSDs acked But I am right ? (the same setup could involve some high latency OSDs, in the case of country-level cluster) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] right way to recover a failed OSD (disk) when using BlueStore ?
Hi, i see some notes there that did'nt existed on jewel : http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd In my case what im using right now on that OSD is this : root@ndc-cl-osd4:~# ls -lsah /var/lib/ceph/osd/ceph-104 total 64K 0 drwxr-xr-x 2 ceph ceph 310 Sep 21 10:56 . 4.0K drwxr-xr-x 25 ceph ceph 4.0K Sep 21 10:56 .. 0 lrwxrwxrwx 1 ceph ceph 58 Sep 21 10:30 block -> /dev/disk/by-partuuid/0ffa3ed7-169f-485c-9170-648ce656e9b1 0 lrwxrwxrwx 1 ceph ceph 58 Sep 21 10:30 block.db -> /dev/disk/by-partuuid/5873e2cb-3c26-4a7d-8ff1-1bc3e2d62e5a 0 lrwxrwxrwx 1 ceph ceph 58 Sep 21 10:30 block.wal -> /dev/disk/by-partuuid/aed9e5e4-c798-46b5-8243-e462e74f6485 block.db and block.wal are on two different NVME partitions, witch are nvme1n1p17 and nvme1n1p18 so assuming after hot swaping the device, the drive letter is "sdx" according to the link above what would be the right command to re-use the two NVME partitions for block db and wal ? I presume that everything else is the same. best. On Sat, Sep 30, 2017 at 9:00 PM, David Turnerwrote: > I'm pretty sure that the process is the same as with filestore. The > cluster doesn't really know if an osd is filestore or bluestore... It's > just an osd running a daemon. > > If there are any differences, they would be in the release notes for > Luminous as changes from Jewel. > > On Sat, Sep 30, 2017, 6:28 PM Alejandro Comisario > wrote: > >> Hi all. >> Independetly that i've deployerd a ceph Luminous cluster with Bluestore >> using ceph-ansible (https://github.com/ceph/ceph-ansible) what is the >> right way to replace a disk when using Bluestore ? >> >> I will try to forget everything i know on how to recover things with >> filestore and start fresh. >> >> Any how-to's ? experiences ? i dont seem to find an official way of doing >> it. >> best. >> >> -- >> *Alejandro Comisario* >> *CTO | NUBELIU* >> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857 >> _ >> www.nubeliu.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > -- *Alejandro Comisario* *CTO | NUBELIU* E-mail: alejandro@nubeliu.comCell: +54911 3770 1857 _ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory
On Tue, Oct 10, 2017 at 5:40 PM, Shawfeng Dongwrote: > Hi Yoann, > > I confirm too that your recipe works! > > We run CentOS 7: > [root@pulpo-admin ~]# uname -r > 3.10.0-693.2.2.el7.x86_64 > > Here were the old caps for user 'hydra': > # ceph auth get client.hydra > exported keyring for client.hydra > [client.hydra] > key = AQ== > caps mds = "allow rw" > caps mgr = "allow r" > caps mon = "allow r" > caps osd = "allow rw" > > Our CephFS name is 'pulpos', when I tried to restrict CephFS client > capabilities: > # ceph fs authorize pulpos client.hydra /hydra rw > I got this error: > Error EINVAL: key for client.hydra exists but cap mds does not match > > In retrospect, the error means exactly what it says: the user caps and > CephFS client caps must match! You can't *restrict* (narrow down) user caps > with 'ceph fs authorize'. > > For example, this won't work (I can't give 'rw' cap to all pools the > restrict it): > # ceph auth caps client.hydra mon 'allow r' mgr 'allow r' osd 'allow rw' mds > 'allow rw path=/hydra' > updated caps for client.hydra > # ceph fs authorize pulpos client.hydra /hydra rw > Error EINVAL: key for client.hydra exists but cap osd does not match > > I find only this works: > # ceph auth caps client.hydra mon 'allow r' mgr 'allow r' osd 'allow rw > pool=cephfs_data' mds 'allow rw path=/hydra' > updated caps for client.hydra > # ceph fs authorize pulpos client.hydra /hydra rw > [client.hydra] > key = AQ== > ``` > > But I still have 2 lingering questions: > 1. If the user caps and CephFS client caps must match, why do we need 2 > commands ('ceph auth' & 'ceph fs authorzie')? The first one is sufficient. The "fs authorize" command is a new thing that's meant to make it easier, so that people don't have to know the syntax of the path= stuff, and so that they don't have to manually list out their data pools etc. > 2. We only give 'rw' cap to the data pool and it works. Why is it > unnecessary to give 'rw' cap to the metadata pool? Clients never need access to the metadata pool. Was something giving them access? John > > Best regards, > Shaw > > > > On Tue, Oct 10, 2017 at 4:20 AM, Yoann Moulin wrote: >> >> >> >> I am trying to follow the instructions at: >> >> http://docs.ceph.com/docs/master/cephfs/client-auth/ >> >> to restrict a client to a subdirectory of Ceph filesystem, but always >> >> get >> >> an error. >> >> >> >> We are running the latest stable release of Ceph (v12.2.1) on CentOS 7 >> >> servers. The user 'hydra' has the following capabilities: >> >> # ceph auth get client.hydra >> >> exported keyring for client.hydra >> >> [client.hydra] >> >> key = AQ== >> >> caps mds = "allow rw" >> >> caps mgr = "allow r" >> >> caps mon = "allow r" >> >> caps osd = "allow rw" >> >> >> >> When I tried to restrict the client to only mount and work within the >> >> directory /hydra of the Ceph filesystem 'pulpos', I got an error: >> >> # ceph fs authorize pulpos client.hydra /hydra rw >> >> Error EINVAL: key for client.dong exists but cap mds does not match >> >> >> >> I've tried a few combinations of user caps and CephFS client caps; but >> >> always got the same error! >> > >> > The "fs authorize" command isn't smart enough to edit existing >> > capabilities safely, so it is cautious and refuses to overwrite what >> > is already there. If you remove your client.hydra user and try again, >> > it should create it for you with the correct capabilities. >> >> I confirm it works perfectly ! it should be added to the documentation. :) >> >> # ceph fs authorize cephfs client.foo1 /foo1 rw >> [client.foo1] >> key = XXX1 >> # ceph fs authorize cephfs client.foo2 / r /foo2 rw >> [client.foo2] >> key = XXX2 >> >> # ceph auth get client.foo1 >> exported keyring for client.foo1 >> [client.foo1] >> key = XXX1 >> caps mds = "allow rw path=/foo1" >> caps mon = "allow r" >> caps osd = "allow rw pool=cephfs_data" >> >> # ceph auth get client.foo2 >> exported keyring for client.foo2 >> [client.foo2] >> key = XXX2 >> caps mds = "allow r, allow rw path=/foo2" >> caps mon = "allow r" >> caps osd = "allow rw pool=cephfs_data" >> >> Best regards, >> >> -- >> Yoann Moulin >> EPFL IC-IT >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory
Hi Yoann, I confirm too that your recipe works! We run CentOS 7: [root@pulpo-admin ~]# uname -r 3.10.0-693.2.2.el7.x86_64 Here were the old caps for user 'hydra': # ceph auth get client.hydra exported keyring for client.hydra [client.hydra] key = AQ== caps mds = "allow rw" caps mgr = "allow r" caps mon = "allow r" caps osd = "allow rw" Our CephFS name is 'pulpos', when I tried to restrict CephFS client capabilities: # ceph fs authorize pulpos client.hydra /hydra rw I got this error: Error EINVAL: key for client.hydra exists but cap mds does not match In retrospect, the error means exactly what it says: the user caps and CephFS client caps must match! You can't *restrict* (narrow down) user caps with 'ceph fs authorize'. For example, this won't work (I can't give 'rw' cap to all pools the restrict it): # ceph auth caps client.hydra mon 'allow r' mgr 'allow r' osd 'allow rw' mds 'allow rw path=/hydra' updated caps for client.hydra # ceph fs authorize pulpos client.hydra /hydra rw Error EINVAL: key for client.hydra exists but cap osd does not match I find only this works: # ceph auth caps client.hydra mon 'allow r' mgr 'allow r' osd 'allow rw pool=cephfs_data' mds 'allow rw path=/hydra' updated caps for client.hydra # ceph fs authorize pulpos client.hydra /hydra rw [client.hydra] key = AQ== ``` But I still have 2 lingering questions: 1. If the user caps and CephFS client caps must match, why do we need 2 commands ('ceph auth' & 'ceph fs authorzie')? The first one is sufficient. 2. We only give 'rw' cap to the data pool and it works. Why is it unnecessary to give 'rw' cap to the metadata pool? Best regards, Shaw On Tue, Oct 10, 2017 at 4:20 AM, Yoann Moulinwrote: > > >> I am trying to follow the instructions at: > >> http://docs.ceph.com/docs/master/cephfs/client-auth/ > >> to restrict a client to a subdirectory of Ceph filesystem, but always > get > >> an error. > >> > >> We are running the latest stable release of Ceph (v12.2.1) on CentOS 7 > >> servers. The user 'hydra' has the following capabilities: > >> # ceph auth get client.hydra > >> exported keyring for client.hydra > >> [client.hydra] > >> key = AQ== > >> caps mds = "allow rw" > >> caps mgr = "allow r" > >> caps mon = "allow r" > >> caps osd = "allow rw" > >> > >> When I tried to restrict the client to only mount and work within the > >> directory /hydra of the Ceph filesystem 'pulpos', I got an error: > >> # ceph fs authorize pulpos client.hydra /hydra rw > >> Error EINVAL: key for client.dong exists but cap mds does not match > >> > >> I've tried a few combinations of user caps and CephFS client caps; but > >> always got the same error! > > > > The "fs authorize" command isn't smart enough to edit existing > > capabilities safely, so it is cautious and refuses to overwrite what > > is already there. If you remove your client.hydra user and try again, > > it should create it for you with the correct capabilities. > > I confirm it works perfectly ! it should be added to the documentation. :) > > # ceph fs authorize cephfs client.foo1 /foo1 rw > [client.foo1] > key = XXX1 > # ceph fs authorize cephfs client.foo2 / r /foo2 rw > [client.foo2] > key = XXX2 > > # ceph auth get client.foo1 > exported keyring for client.foo1 > [client.foo1] > key = XXX1 > caps mds = "allow rw path=/foo1" > caps mon = "allow r" > caps osd = "allow rw pool=cephfs_data" > > # ceph auth get client.foo2 > exported keyring for client.foo2 > [client.foo2] > key = XXX2 > caps mds = "allow r, allow rw path=/foo2" > caps mon = "allow r" > caps osd = "allow rw pool=cephfs_data" > > Best regards, > > -- > Yoann Moulin > EPFL IC-IT > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?
Probably chooseleaf also instead of choose. Konrad Riedelskrev: (10 oktober 2017 17:05:52 CEST) >Hello Ceph-users, > >after switching to luminous I was excited about the great >crush-device-class feature - now we have 5 servers with 1x2TB NVMe >based OSDs, 3 of them additionally with 4 HDDS per server. (we have >only three 400G NVMe disks for block.wal and block.db and therefore >can't distribute all HDDs evenly on all servers.) > >Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on >the same >Host: > >ceph pg map 5.b >osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8] > >(on rebooting this host I had 4 stale PGs) > >I've written a small perl script to add hostname after OSD number and >got many PGs where >ceph placed 2 replicas on the same host... : > >5.1e7: 8 - daniel 9 - daniel 11 - udo >5.1eb: 10 - udo 7 - daniel 9 - daniel >5.1ec: 10 - udo 11 - udo 7 - daniel >5.1ed: 13 - felix 16 - felix 5 - udo > > >Is there any way I can correct this? > > >Please see crushmap below. Thanks for any help! > ># begin crush map >tunable choose_local_tries 0 >tunable choose_local_fallback_tries 0 >tunable choose_total_tries 50 >tunable chooseleaf_descend_once 1 >tunable chooseleaf_vary_r 1 >tunable chooseleaf_stable 1 >tunable straw_calc_version 1 >tunable allowed_bucket_algs 54 > ># devices >device 0 osd.0 class hdd >device 1 device1 >device 2 osd.2 class ssd >device 3 device3 >device 4 device4 >device 5 osd.5 class hdd >device 6 device6 >device 7 osd.7 class hdd >device 8 osd.8 class hdd >device 9 osd.9 class hdd >device 10 osd.10 class hdd >device 11 osd.11 class hdd >device 12 osd.12 class hdd >device 13 osd.13 class hdd >device 14 osd.14 class hdd >device 15 device15 >device 16 osd.16 class hdd >device 17 device17 >device 18 device18 >device 19 device19 >device 20 device20 >device 21 device21 >device 22 device22 >device 23 device23 >device 24 osd.24 class hdd >device 25 device25 >device 26 osd.26 class hdd >device 27 osd.27 class hdd >device 28 osd.28 class hdd >device 29 osd.29 class hdd >device 30 osd.30 class ssd >device 31 osd.31 class ssd >device 32 osd.32 class ssd >device 33 osd.33 class ssd > ># types >type 0 osd >type 1 host >type 2 rack >type 3 row >type 4 room >type 5 datacenter >type 6 root > ># buckets >host daniel { > id -4 # do not change unnecessarily > id -2 class hdd # do not change unnecessarily > id -9 class ssd # do not change unnecessarily > # weight 3.459 > alg straw2 > hash 0 # rjenkins1 > item osd.31 weight 1.819 > item osd.7 weight 0.547 > item osd.8 weight 0.547 > item osd.9 weight 0.547 >} >host felix { > id -5 # do not change unnecessarily > id -3 class hdd # do not change unnecessarily > id -10 class ssd# do not change unnecessarily > # weight 3.653 > alg straw2 > hash 0 # rjenkins1 > item osd.33 weight 1.819 > item osd.13 weight 0.547 > item osd.14 weight 0.467 > item osd.16 weight 0.547 > item osd.0 weight 0.274 >} >host udo { > id -6 # do not change unnecessarily > id -7 class hdd # do not change unnecessarily > id -11 class ssd# do not change unnecessarily > # weight 4.006 > alg straw2 > hash 0 # rjenkins1 > item osd.32 weight 1.819 > item osd.5 weight 0.547 > item osd.10 weight 0.547 > item osd.11 weight 0.547 > item osd.12 weight 0.547 >} >host moritz { > id -13 # do not change unnecessarily > id -14 class hdd# do not change unnecessarily > id -15 class ssd# do not change unnecessarily > # weight 1.819 > alg straw2 > hash 0 # rjenkins1 > item osd.30 weight 1.819 >} >host bruno { > id -16 # do not change unnecessarily > id -17 class hdd# do not change unnecessarily > id -18 class ssd# do not change unnecessarily > # weight 3.183 > alg straw2 > hash 0 # rjenkins1 > item osd.24 weight 0.273 > item osd.26 weight 0.273 > item osd.27 weight 0.273 > item osd.28 weight 0.273 > item osd.29 weight 0.273 > item osd.2 weight 1.819 >} >root default { > id -1 # do not change unnecessarily > id -8 class hdd # do not change unnecessarily > id -12 class ssd# do not change unnecessarily > # weight 16.121 > alg straw2 > hash 0 # rjenkins1 > item daniel weight 3.459 > item felix weight 3.653 > item udo weight 4.006 > item moritz weight 1.819 > item bruno weight 3.183 >} > ># rules >rule ssd { > id 0 > type replicated > min_size 1 > max_size 10 > step take default class ssd > step choose firstn 0 type osd > step emit >} >rule hdd { >
Re: [ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?
I think your failure domain within your rules is wrong. step choose firstn 0 type osd Should be: step choose firstn 0 type host On 10/10/2017 5:05 PM, Konrad Riedel wrote: > Hello Ceph-users, > > after switching to luminous I was excited about the great > crush-device-class feature - now we have 5 servers with 1x2TB NVMe > based OSDs, 3 of them additionally with 4 HDDS per server. (we have > only three 400G NVMe disks for block.wal and block.db and therefore > can't distribute all HDDs evenly on all servers.) > > Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on > the same > Host: > > ceph pg map 5.b > osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8] > > (on rebooting this host I had 4 stale PGs) > > I've written a small perl script to add hostname after OSD number and > got many PGs where > ceph placed 2 replicas on the same host... : > > 5.1e7: 8 - daniel 9 - daniel 11 - udo > 5.1eb: 10 - udo 7 - daniel 9 - daniel > 5.1ec: 10 - udo 11 - udo 7 - daniel > 5.1ed: 13 - felix 16 - felix 5 - udo > > > Is there any way I can correct this? > > > Please see crushmap below. Thanks for any help! > > # begin crush map > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 50 > tunable chooseleaf_descend_once 1 > tunable chooseleaf_vary_r 1 > tunable chooseleaf_stable 1 > tunable straw_calc_version 1 > tunable allowed_bucket_algs 54 > > # devices > device 0 osd.0 class hdd > device 1 device1 > device 2 osd.2 class ssd > device 3 device3 > device 4 device4 > device 5 osd.5 class hdd > device 6 device6 > device 7 osd.7 class hdd > device 8 osd.8 class hdd > device 9 osd.9 class hdd > device 10 osd.10 class hdd > device 11 osd.11 class hdd > device 12 osd.12 class hdd > device 13 osd.13 class hdd > device 14 osd.14 class hdd > device 15 device15 > device 16 osd.16 class hdd > device 17 device17 > device 18 device18 > device 19 device19 > device 20 device20 > device 21 device21 > device 22 device22 > device 23 device23 > device 24 osd.24 class hdd > device 25 device25 > device 26 osd.26 class hdd > device 27 osd.27 class hdd > device 28 osd.28 class hdd > device 29 osd.29 class hdd > device 30 osd.30 class ssd > device 31 osd.31 class ssd > device 32 osd.32 class ssd > device 33 osd.33 class ssd > > # types > type 0 osd > type 1 host > type 2 rack > type 3 row > type 4 room > type 5 datacenter > type 6 root > > # buckets > host daniel { > id -4 # do not change unnecessarily > id -2 class hdd # do not change unnecessarily > id -9 class ssd # do not change unnecessarily > # weight 3.459 > alg straw2 > hash 0 # rjenkins1 > item osd.31 weight 1.819 > item osd.7 weight 0.547 > item osd.8 weight 0.547 > item osd.9 weight 0.547 > } > host felix { > id -5 # do not change unnecessarily > id -3 class hdd # do not change unnecessarily > id -10 class ssd # do not change unnecessarily > # weight 3.653 > alg straw2 > hash 0 # rjenkins1 > item osd.33 weight 1.819 > item osd.13 weight 0.547 > item osd.14 weight 0.467 > item osd.16 weight 0.547 > item osd.0 weight 0.274 > } > host udo { > id -6 # do not change unnecessarily > id -7 class hdd # do not change unnecessarily > id -11 class ssd # do not change unnecessarily > # weight 4.006 > alg straw2 > hash 0 # rjenkins1 > item osd.32 weight 1.819 > item osd.5 weight 0.547 > item osd.10 weight 0.547 > item osd.11 weight 0.547 > item osd.12 weight 0.547 > } > host moritz { > id -13 # do not change unnecessarily > id -14 class hdd # do not change unnecessarily > id -15 class ssd # do not change unnecessarily > # weight 1.819 > alg straw2 > hash 0 # rjenkins1 > item osd.30 weight 1.819 > } > host bruno { > id -16 # do not change unnecessarily > id -17 class hdd # do not change unnecessarily > id -18 class ssd # do not change unnecessarily > # weight 3.183 > alg straw2 > hash 0 # rjenkins1 > item osd.24 weight 0.273 > item osd.26 weight 0.273 > item osd.27 weight 0.273 > item osd.28 weight 0.273 > item osd.29 weight 0.273 > item osd.2 weight 1.819 > } > root default { > id -1 # do not change unnecessarily > id -8 class hdd # do not change unnecessarily > id -12 class ssd # do not change unnecessarily > # weight 16.121 > alg straw2 > hash 0 # rjenkins1 > item daniel weight 3.459 > item felix weight 3.653 > item udo weight 4.006 > item moritz weight 1.819 > item bruno weight 3.183 > } > > # rules > rule ssd { > id 0 > type replicated > min_size 1 > max_size 10 > step take default class ssd > step choose firstn 0 type osd > step emit > } > rule hdd { > id 1 > type replicated > min_size 1 >
[ceph-users] All replicas of pg 5.b got placed on the same host - how to correct?
Hello Ceph-users, after switching to luminous I was excited about the great crush-device-class feature - now we have 5 servers with 1x2TB NVMe based OSDs, 3 of them additionally with 4 HDDS per server. (we have only three 400G NVMe disks for block.wal and block.db and therefore can't distribute all HDDs evenly on all servers.) Output from "ceph pg dump" shows that some PGs end up on HDD OSDs on the same Host: ceph pg map 5.b osdmap e12912 pg 5.b (5.b) -> up [9,7,8] acting [9,7,8] (on rebooting this host I had 4 stale PGs) I've written a small perl script to add hostname after OSD number and got many PGs where ceph placed 2 replicas on the same host... : 5.1e7: 8 - daniel 9 - daniel 11 - udo 5.1eb: 10 - udo 7 - daniel 9 - daniel 5.1ec: 10 - udo 11 - udo 7 - daniel 5.1ed: 13 - felix 16 - felix 5 - udo Is there any way I can correct this? Please see crushmap below. Thanks for any help! # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class hdd device 1 device1 device 2 osd.2 class ssd device 3 device3 device 4 device4 device 5 osd.5 class hdd device 6 device6 device 7 osd.7 class hdd device 8 osd.8 class hdd device 9 osd.9 class hdd device 10 osd.10 class hdd device 11 osd.11 class hdd device 12 osd.12 class hdd device 13 osd.13 class hdd device 14 osd.14 class hdd device 15 device15 device 16 osd.16 class hdd device 17 device17 device 18 device18 device 19 device19 device 20 device20 device 21 device21 device 22 device22 device 23 device23 device 24 osd.24 class hdd device 25 device25 device 26 osd.26 class hdd device 27 osd.27 class hdd device 28 osd.28 class hdd device 29 osd.29 class hdd device 30 osd.30 class ssd device 31 osd.31 class ssd device 32 osd.32 class ssd device 33 osd.33 class ssd # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host daniel { id -4 # do not change unnecessarily id -2 class hdd # do not change unnecessarily id -9 class ssd # do not change unnecessarily # weight 3.459 alg straw2 hash 0 # rjenkins1 item osd.31 weight 1.819 item osd.7 weight 0.547 item osd.8 weight 0.547 item osd.9 weight 0.547 } host felix { id -5 # do not change unnecessarily id -3 class hdd # do not change unnecessarily id -10 class ssd# do not change unnecessarily # weight 3.653 alg straw2 hash 0 # rjenkins1 item osd.33 weight 1.819 item osd.13 weight 0.547 item osd.14 weight 0.467 item osd.16 weight 0.547 item osd.0 weight 0.274 } host udo { id -6 # do not change unnecessarily id -7 class hdd # do not change unnecessarily id -11 class ssd# do not change unnecessarily # weight 4.006 alg straw2 hash 0 # rjenkins1 item osd.32 weight 1.819 item osd.5 weight 0.547 item osd.10 weight 0.547 item osd.11 weight 0.547 item osd.12 weight 0.547 } host moritz { id -13 # do not change unnecessarily id -14 class hdd# do not change unnecessarily id -15 class ssd# do not change unnecessarily # weight 1.819 alg straw2 hash 0 # rjenkins1 item osd.30 weight 1.819 } host bruno { id -16 # do not change unnecessarily id -17 class hdd# do not change unnecessarily id -18 class ssd# do not change unnecessarily # weight 3.183 alg straw2 hash 0 # rjenkins1 item osd.24 weight 0.273 item osd.26 weight 0.273 item osd.27 weight 0.273 item osd.28 weight 0.273 item osd.29 weight 0.273 item osd.2 weight 1.819 } root default { id -1 # do not change unnecessarily id -8 class hdd # do not change unnecessarily id -12 class ssd# do not change unnecessarily # weight 16.121 alg straw2 hash 0 # rjenkins1 item daniel weight 3.459 item felix weight 3.653 item udo weight 4.006 item moritz weight 1.819 item bruno weight 3.183 } # rules rule ssd { id 0 type replicated min_size 1 max_size 10 step take default class ssd step choose firstn 0 type osd step emit } rule hdd { id 1 type replicated min_size 1 max_size 10 step take default class hdd step choose firstn 0 type osd step emit } # end crush map -- Mit freundlichen Grüßen
Re: [ceph-users] Ceph-mgr summarize recovery counters
On Tue, Oct 10, 2017 at 3:50 PM, Benjeman Meekhofwrote: > Hi John, > > Thanks for the guidance! Is pg_status something we should expect to > find in Luminous (12.2.1)? It doesn't seem to exist. We do have a > 'pg_summary' object which contains a list of every PG and current > state (active, etc) but nothing about I/O. > > Calls to self.get('pg_status') in our module log: mgr get_python > Python module requested unknown data 'pg_status' Yes, it's new in master. When modules like influx & prometheus are using those calls in master though, we can backport things like the pg_status implementation at the same time as backporting the modules if we do that. John > > thanks, > Ben > > On Thu, Oct 5, 2017 at 8:42 AM, John Spray wrote: >> On Wed, Oct 4, 2017 at 7:14 PM, Gregory Farnum wrote: >>> On Wed, Oct 4, 2017 at 9:14 AM, Benjeman Meekhof wrote: Wondering if anyone can tell me how to summarize recovery bytes/ops/objects from counters available in the ceph-mgr python interface? To put another way, how does the ceph -s command put together that infomation and can I access that information from a counter queryable by the ceph-mgr python module api? I want info like the 'recovery' part of the status output. I have a ceph-mgr module that feeds influxdb but I'm not sure what counters from ceph-mgr to summarize to create this information. OSD have available a recovery_ops counter which is not quite the same. Maybe the various 'subop_..' counters encompass recovery ops? It's not clear to me but I'm hoping it is obvious to someone more familiar with the internals. io: client: 2034 B/s wr, 0 op/s rd, 0 op/s wr recovery: 1173 MB/s, 8 keys/s, 682 objects/s >>> >>> >>> You'll need to run queries against the PGMap. I'm not sure how that >>> works in the python interfaces but I'm led to believe it's possible. >>> Documentation is probably all in the PGMap.h header; you can look at >>> functions like the "recovery_rate_summary" to see what they're doing. >> >> Try get("pg_status") from a python module, that should contain the >> recovery/client IO amongst other things. >> >> You may find that the fields only appear when they're nonzero, I would >> be happy to see a change that fixed the underlying functions to always >> output the fields (e.g. in PGMapDigest::recovery_rate_summary) when >> writing to a Formatter. Skipping the irrelevant stuff is only useful >> when doing plain text output. >> >> John >> >>> -Greg >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph-mgr summarize recovery counters
Hi John, Thanks for the guidance! Is pg_status something we should expect to find in Luminous (12.2.1)? It doesn't seem to exist. We do have a 'pg_summary' object which contains a list of every PG and current state (active, etc) but nothing about I/O. Calls to self.get('pg_status') in our module log: mgr get_python Python module requested unknown data 'pg_status' thanks, Ben On Thu, Oct 5, 2017 at 8:42 AM, John Spraywrote: > On Wed, Oct 4, 2017 at 7:14 PM, Gregory Farnum wrote: >> On Wed, Oct 4, 2017 at 9:14 AM, Benjeman Meekhof wrote: >>> Wondering if anyone can tell me how to summarize recovery >>> bytes/ops/objects from counters available in the ceph-mgr python >>> interface? To put another way, how does the ceph -s command put >>> together that infomation and can I access that information from a >>> counter queryable by the ceph-mgr python module api? >>> >>> I want info like the 'recovery' part of the status output. I have a >>> ceph-mgr module that feeds influxdb but I'm not sure what counters >>> from ceph-mgr to summarize to create this information. OSD have >>> available a recovery_ops counter which is not quite the same. Maybe >>> the various 'subop_..' counters encompass recovery ops? It's not >>> clear to me but I'm hoping it is obvious to someone more familiar with >>> the internals. >>> >>> io: >>> client: 2034 B/s wr, 0 op/s rd, 0 op/s wr >>> recovery: 1173 MB/s, 8 keys/s, 682 objects/s >> >> >> You'll need to run queries against the PGMap. I'm not sure how that >> works in the python interfaces but I'm led to believe it's possible. >> Documentation is probably all in the PGMap.h header; you can look at >> functions like the "recovery_rate_summary" to see what they're doing. > > Try get("pg_status") from a python module, that should contain the > recovery/client IO amongst other things. > > You may find that the fields only appear when they're nonzero, I would > be happy to see a change that fixed the underlying functions to always > output the fields (e.g. in PGMapDigest::recovery_rate_summary) when > writing to a Formatter. Skipping the irrelevant stuff is only useful > when doing plain text output. > > John > >> -Greg >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rgw resharding operation seemingly won't end
Thanks for the response Yehuda. Staus: [root@objproxy02 UMobjstore]# radosgw-admin reshard status —bucket=$bucket_name [ { "reshard_status": 1, "new_bucket_instance_id": "8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.47370206.1", "num_shards": 4 } ] I cleared the flag using the bucket check —fix command and will keep an eye on that tracker issue. Do you have any insight into why the RGWs ultimately paused/reloaded and failed to come back? I am happy to provide more information that could assist. At the moment we are somewhat nervous to reenable dynamic sharding as it seems to have contributed to this problem. Thanks, Ryan > On Oct 9, 2017, at 5:26 PM, Yehuda Sadeh-Weinraubwrote: > > On Mon, Oct 9, 2017 at 1:59 PM, Ryan Leimenstoll > wrote: >> Hi all, >> >> We recently upgraded to Ceph 12.2.1 (Luminous) from 12.2.0 however are now >> seeing issues running radosgw. Specifically, it appears an automatically >> triggered resharding operation won’t end, despite the jobs being cancelled >> (radosgw-admin reshard cancel). I have also disabled dynamic sharding for >> the time being in the ceph.conf. >> >> >> [root@objproxy02 ~]# radosgw-admin reshard list >> [] >> >> The two buckets were also reported in the `radosgw-admin reshard list` >> before our RGW frontends paused recently (and only came back after a service >> restart). These two buckets cannot currently be written to at this point >> either. >> >> 2017-10-06 22:41:19.547260 7f90506e9700 0 block_while_resharding ERROR: >> bucket is still resharding, please retry >> 2017-10-06 22:41:19.547411 7f90506e9700 0 WARNING: set_req_state_err >> err_no=2300 resorting to 500 >> 2017-10-06 22:41:19.547729 7f90506e9700 0 ERROR: >> RESTFUL_IO(s)->complete_header() returned err=Input/output error >> 2017-10-06 22:41:19.548570 7f90506e9700 1 == req done req=0x7f90506e3180 >> op status=-2300 http_status=500 == >> 2017-10-06 22:41:19.548790 7f90506e9700 1 civetweb: 0x55766d111000: >> $MY_IP_HERE$ - - [06/Oct/2017:22:33:47 -0400] "PUT / >> $REDACTED_BUCKET_NAME$/$REDACTED_KEY_NAME$ HTTP/1.1" 1 0 - Boto3/1.4.7 >> Python/2.7.12 Linux/4.9.43-17.3 >> 9.amzn1.x86_64 exec-env/AWS_Lambda_python2.7 Botocore/1.7.2 Resource >> [.. slightly later in the logs..] >> 2017-10-06 22:41:53.516272 7f90406c9700 1 rgw realm reloader: Frontends >> paused >> 2017-10-06 22:41:53.528703 7f907893f700 0 ERROR: failed to clone shard, >> completion_mgr.get_next() returned ret=-125 >> 2017-10-06 22:44:32.049564 7f9074136700 0 ERROR: keystone revocation >> processing returned error r=-22 >> 2017-10-06 22:59:32.059222 7f9074136700 0 ERROR: keystone revocation >> processing returned error r=-22 >> >> Can anyone advise on the best path forward to stop the current sharding >> states and avoid this moving forward? >> > > What does 'radosgw-admin reshard status --bucket=' return? > I think just manually resharding the buckets should clear this flag, > is that not an option? > manual reshard: radosgw-admin bucket reshard --bucket= > --num-shards= > > also, the 'radosgw-admin bucket check --fix' might clear that flag. > > For some reason it seems that the reshard cancellation code is not > clearing that flag on the bucket index header (pretty sure it used to > do it at one point). I'll open a tracker ticket. > > Thanks, > Yehuda > >> >> Some other details: >> - 3 rgw instances >> - Ceph Luminous 12.2.1 >> - 584 active OSDs, rgw bucket index is on Intel NVMe OSDs >> >> >> Thanks, >> Ryan Leimenstoll >> rleim...@umiacs.umd.edu >> University of Maryland Institute for Advanced Computer Studies >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]
On 10-10-2017 14:21, Alfredo Deza wrote: > On Tue, Oct 10, 2017 at 8:14 AM, Willem Jan Withagenwrote: >> On 10-10-2017 13:51, Alfredo Deza wrote: >>> On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzer wrote: Hello, (pet peeve alert) On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote: > To put this in context, the goal here is to kill ceph-disk in mimic. >> >> Right, that means we need a ceph-volume zfs before things get shot down. >> Fortunately there is little history to carry over. >> >> But then still somebody needs to do the work. ;-| >> Haven't looked at ceph-volume, but I'll put it on the agenda. > > An interesting take on zfs (and anything else we didn't set up from > the get-go) is that we envisioned developers might > want to craft plugins for ceph-volume and expand its capabilities, > without placing the burden of coming up > with new device technology to support. > > The other nice aspect of this is that a plugin would get to re-use all > the tooling in place in ceph-volume. The plugin architecture > exists but it isn't fully developed/documented yet. I was part of the original discussion when ceph-volume said it was going to be plugable... And would be a great proponent of thye plugins. If only because ceph-disk is rather convoluted to add to. Not that it cannot be done, but the code is rather loaded with linuxisms for its devices. And it takes some care to not upset the old code, even to the point that code for a routine is refactored into 3 new routines: one OS selctor and then the old code for Linux, and the new code for FreeBSD. And that starts to look like a poor mans plugin. :) But still I need to find the time, and sharpen my python skills. Luckily mimic is 9 months away. :) --WjW ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-volume: migration and disk partition support
On Tue, Oct 10, 2017 at 3:28 AM, Stefan Koomanwrote: > Hi, > > Quoting Alfredo Deza (ad...@redhat.com): >> Hi, >> >> Now that ceph-volume is part of the Luminous release, we've been able >> to provide filestore support for LVM-based OSDs. We are making use of >> LVM's powerful mechanisms to store metadata which allows the process >> to no longer rely on UDEV and GPT labels (unlike ceph-disk). >> >> Bluestore support should be the next step for `ceph-volume lvm`, and >> while that is planned we are thinking of ways to improve the current >> caveats (like OSDs not coming up) for clusters that have deployed OSDs >> with ceph-disk. > > I'm a bit confused after reading this. Just to make things clear. Would > bluestore be put on top of a LVM volume (in an ideal world)? Has > bluestore in Ceph luminious support for LVM? I.e. is there code in > bluestore to support LVM? Or is it _just_ support of `ceph-volume lvm` > for bluestore? There is currently no support in `ceph-volume lvm` for bluestore yet. It is being worked on today and should be ready soon (hopefully in the next Luminous release). And yes, in the case of `ceph-volume lvm` it means that bluestore would be "on top" of LVM volumes. > >> --- New clusters --- >> The `ceph-volume lvm` deployment is straightforward (currently >> supported in ceph-ansible), but there isn't support for plain disks >> (with partitions) currently, like there is with ceph-disk. >> >> Is there a pressing interest in supporting plain disks with >> partitions? Or only supporting LVM-based OSDs fine? > > We're still in a green field situation. Users with an installed base > will have to comment on this. If the assumption that bluestore would be > put on top of LVM is true, it would make things simpler (in our own Ceph > ansible playbook). There is already support in ceph-ansible too, which will mean that when bluestore support is added, it will be added in ceph-ansible at the same time. > > Gr. Stefan > > -- > | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 > | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]
On Tue, Oct 10, 2017 at 8:14 AM, Willem Jan Withagenwrote: > On 10-10-2017 13:51, Alfredo Deza wrote: >> On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzer wrote: >>> >>> Hello, >>> >>> (pet peeve alert) >>> On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote: >>> To put this in context, the goal here is to kill ceph-disk in mimic. > > Right, that means we need a ceph-volume zfs before things get shot down. > Fortunately there is little history to carry over. > > But then still somebody needs to do the work. ;-| > Haven't looked at ceph-volume, but I'll put it on the agenda. An interesting take on zfs (and anything else we didn't set up from the get-go) is that we envisioned developers might want to craft plugins for ceph-volume and expand its capabilities, without placing the burden of coming up with new device technology to support. The other nice aspect of this is that a plugin would get to re-use all the tooling in place in ceph-volume. The plugin architecture exists but it isn't fully developed/documented yet. > > --WjW > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to debug (in order to repair) damaged MDS (rank)?
On 10/10/2017 02:10 PM, John Spray wrote: > Yes. worked, rank 6 is back and cephfs up again. thank you very much. > Do a final ls to make sure you got all of them -- it is > dangerous to leave any fragments behind. will do. > BTW opened http://tracker.ceph.com/issues/21749 for the underlying bug. thanks; I've saved all the logs, so I'm happy to provide anything you need. Regards, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]
On 10-10-2017 13:51, Alfredo Deza wrote: > On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzerwrote: >> >> Hello, >> >> (pet peeve alert) >> On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote: >> >>> To put this in context, the goal here is to kill ceph-disk in mimic. Right, that means we need a ceph-volume zfs before things get shot down. Fortunately there is little history to carry over. But then still somebody needs to do the work. ;-| Haven't looked at ceph-volume, but I'll put it on the agenda. --WjW ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to debug (in order to repair) damaged MDS (rank)?
On Tue, Oct 10, 2017 at 12:30 PM, Daniel Baumannwrote: > Hi John, > > thank you very much for your help. > > On 10/10/2017 12:57 PM, John Spray wrote: >> A) Do a "rados -p ls | grep "^506\." or similar, to >> get a list of the objects > > done, gives me these: > > 506. > 506.0017 > 506.001b > 506.0019 > 506.001a > 506.001c > 506.0018 > 506.0016 > 506.001e > 506.001f > 506.001d > >> B) Write a short bash loop to do a "rados -p get" on >> each of those objects into a file. > > done, saved them as the object name as filename, resulting in these 11 > files: > >90 Oct 10 13:17 506. > 4.0M Oct 10 13:17 506.0016 > 4.0M Oct 10 13:17 506.0017 > 4.0M Oct 10 13:17 506.0018 > 4.0M Oct 10 13:17 506.0019 > 4.0M Oct 10 13:17 506.001a > 4.0M Oct 10 13:17 506.001b > 4.0M Oct 10 13:17 506.001c > 4.0M Oct 10 13:17 506.001d > 4.0M Oct 10 13:17 506.001e > 4.0M Oct 10 13:17 506.001f > >> C) Stop the MDS, set "debug mds = 20" and "debug journaler = 20", >> mark the rank repaired, start the MDS again, and then gather the >> resulting log (it should end in the same "Error -22 recovering >> write_pos", but have much much more detail about what came before). > > I've attached the entire log from right before issueing "repaired" until > after the mds drops to standby again. > >> Because you've hit a serious bug, it's really important to gather all >> this and share it, so that we can try to fix it and prevent it >> happening again to you or others. > > absolutely, sure. If you need anything more, I'm happy to share. > >> You have two options, depending on how much downtime you can tolerate: >> - carefully remove all the metadata objects that start with 506. -- > > given the outtage (and people need access to their data), I'd go with > this. Just to be safe: that would go like this? > > rados -p rm 506. > rados -p rm 506.0016 Yes. Do a final ls to make sure you got all of them -- it is dangerous to leave any fragments behind. BTW opened http://tracker.ceph.com/issues/21749 for the underlying bug. John > [...] > > Regards, > Daniel > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]
On Mon, Oct 9, 2017 at 8:50 PM, Christian Balzerwrote: > > Hello, > > (pet peeve alert) > On Mon, 9 Oct 2017 15:09:29 + (UTC) Sage Weil wrote: > >> To put this in context, the goal here is to kill ceph-disk in mimic. >> >> One proposal is to make it so new OSDs can *only* be deployed with LVM, >> and old OSDs with the ceph-disk GPT partitions would be started via >> ceph-volume support that can only start (but not deploy new) OSDs in that >> style. >> >> Is the LVM-only-ness concerning to anyone? >> > If the provision below is met, not really. > >> Looking further forward, NVMe OSDs will probably be handled a bit >> differently, as they'll eventually be using SPDK and kernel-bypass (hence, >> no LVM). For the time being, though, they would use LVM. >> > And so it begins. > LVM does a lot of nice things, but not everything for everybody. > It is also another layer added with all the (minor) reductions in > performance (with normal storage, not NVMe) and of course potential bugs. > ceph-volume was crafted in a way that we wouldn't be forcing anyone to a single backend (e.g. LVM). Initially it went even further, as just being a simple orchestrator for getting devices mounted and starting the OSD with minimal configuration and *regardless* of what type of devices were being used. The current status of the LVM portion is *very* robust, although it is lacking a big chunk of feature parity with ceph-disk. I anticipate potential bugs anyway :) >> >> On Fri, 6 Oct 2017, Alfredo Deza wrote: >> > Now that ceph-volume is part of the Luminous release, we've been able >> > to provide filestore support for LVM-based OSDs. We are making use of >> > LVM's powerful mechanisms to store metadata which allows the process >> > to no longer rely on UDEV and GPT labels (unlike ceph-disk). >> > >> > Bluestore support should be the next step for `ceph-volume lvm`, and >> > while that is planned we are thinking of ways to improve the current >> > caveats (like OSDs not coming up) for clusters that have deployed OSDs >> > with ceph-disk. >> > >> > --- New clusters --- >> > The `ceph-volume lvm` deployment is straightforward (currently >> > supported in ceph-ansible), but there isn't support for plain disks >> > (with partitions) currently, like there is with ceph-disk. >> > >> > Is there a pressing interest in supporting plain disks with >> > partitions? Or only supporting LVM-based OSDs fine? >> >> Perhaps the "out" here is to support a "dir" option where the user can >> manually provision and mount an OSD on /var/lib/ceph/osd/*, with 'journal' >> or 'block' symlinks, and ceph-volume will do the last bits that initialize >> the filestore or bluestore OSD from there. Then if someone has a scenario >> that isn't captured by LVM (or whatever else we support) they can always >> do it manually? >> > Basically this. > Since all my old clusters were deployed like this, with no > chance/intention to upgrade to GPT or even LVM. > How would symlinks work with Bluestore, the tiny XFS bit? In this case, we are looking to allow ceph-volume to scan currently deployed OSDs, and get all the information needed and save it as a plain configuration file that will be read at boot time. That is the only other option that is not dependent on udev/ceph-disk that doesn't mean redoing an OSD from scratch. It would be a one-time operation to get out of old deployment's tie into udev/gpt/ceph-disk > >> > --- Existing clusters --- >> > Migration to ceph-volume, even with plain disk support means >> > re-creating the OSD from scratch, which would end up moving data. >> > There is no way to make a GPT/ceph-disk OSD become a ceph-volume one >> > without starting from scratch. >> > >> > A temporary workaround would be to provide a way for existing OSDs to >> > be brought up without UDEV and ceph-disk, by creating logic in >> > ceph-volume that could load them with systemd directly. This wouldn't >> > make them lvm-based, nor it would mean there is direct support for >> > them, just a temporary workaround to make them start without UDEV and >> > ceph-disk. >> > >> > I'm interested in what current users might look for here,: is it fine >> > to provide this workaround if the issues are that problematic? Or is >> > it OK to plan a migration towards ceph-volume OSDs? >> >> IMO we can't require any kind of data migration in order to upgrade, which >> means we either have to (1) keep ceph-disk around indefinitely, or (2) >> teach ceph-volume to start existing GPT-style OSDs. Given all of the >> flakiness around udev, I'm partial to #2. The big question for me is >> whether #2 alone is sufficient, or whether ceph-volume should also know >> how to provision new OSDs using partitions and no LVM. Hopefully not? >> > I really disliked the udev/GPT stuff from the get-go and flakiness is > being kind for sometimes completely indeterministic behavior. > Yep, forcing users to always fit one model seemed annoying to me. I understand the
Re: [ceph-users] how to debug (in order to repair) damaged MDS (rank)?
Hi John, thank you very much for your help. On 10/10/2017 12:57 PM, John Spray wrote: > A) Do a "rados -p ls | grep "^506\." or similar, to > get a list of the objects done, gives me these: 506. 506.0017 506.001b 506.0019 506.001a 506.001c 506.0018 506.0016 506.001e 506.001f 506.001d > B) Write a short bash loop to do a "rados -p get" on > each of those objects into a file. done, saved them as the object name as filename, resulting in these 11 files: 90 Oct 10 13:17 506. 4.0M Oct 10 13:17 506.0016 4.0M Oct 10 13:17 506.0017 4.0M Oct 10 13:17 506.0018 4.0M Oct 10 13:17 506.0019 4.0M Oct 10 13:17 506.001a 4.0M Oct 10 13:17 506.001b 4.0M Oct 10 13:17 506.001c 4.0M Oct 10 13:17 506.001d 4.0M Oct 10 13:17 506.001e 4.0M Oct 10 13:17 506.001f > C) Stop the MDS, set "debug mds = 20" and "debug journaler = 20", > mark the rank repaired, start the MDS again, and then gather the > resulting log (it should end in the same "Error -22 recovering > write_pos", but have much much more detail about what came before). I've attached the entire log from right before issueing "repaired" until after the mds drops to standby again. > Because you've hit a serious bug, it's really important to gather all > this and share it, so that we can try to fix it and prevent it > happening again to you or others. absolutely, sure. If you need anything more, I'm happy to share. > You have two options, depending on how much downtime you can tolerate: > - carefully remove all the metadata objects that start with 506. -- given the outtage (and people need access to their data), I'd go with this. Just to be safe: that would go like this? rados -p rm 506. rados -p rm 506.0016 [...] Regards, Daniel 2017-10-10 13:21:55.413752 7f3f3011a700 5 mds.mds9 handle_mds_map epoch 96224 from mon.0 2017-10-10 13:21:55.413836 7f3f3011a700 10 mds.mds9 my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=file layout v2} 2017-10-10 13:21:55.413847 7f3f3011a700 10 mds.mds9 mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} 2017-10-10 13:21:55.413852 7f3f3011a700 10 mds.mds9 map says I am 147.87.226.189:6800/1634944095 mds.6.96224 state up:replay 2017-10-10 13:21:55.414088 7f3f3011a700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 13:21:55.414141 7f3f3011a700 10 mds.mds9 handle_mds_map: initializing MDS rank 6 2017-10-10 13:21:55.414410 7f3f3011a700 10 mds.6.0 update_log_config log_to_monitors {default=true} 2017-10-10 13:21:55.414415 7f3f3011a700 10 mds.6.0 create_logger 2017-10-10 13:21:55.414635 7f3f3011a700 7 mds.6.server operator(): full = 0 epoch = 0 2017-10-10 13:21:55.414644 7f3f3011a700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 13:21:55.414648 7f3f3011a700 4 mds.6.0 handle_osd_map epoch 0, 0 new blacklist entries 2017-10-10 13:21:55.414660 7f3f3011a700 10 mds.6.server apply_blacklist: killed 0 2017-10-10 13:21:55.414830 7f3f3011a700 10 mds.mds9 handle_mds_map: handling map as rank 6 2017-10-10 13:21:55.414839 7f3f3011a700 1 mds.6.96224 handle_mds_map i am now mds.6.96224 2017-10-10 13:21:55.414843 7f3f3011a700 1 mds.6.96224 handle_mds_map state change up:boot --> up:replay 2017-10-10 13:21:55.414855 7f3f3011a700 10 mds.beacon.mds9 set_want_state: up:standby -> up:replay 2017-10-10 13:21:55.414859 7f3f3011a700 1 mds.6.96224 replay_start 2017-10-10 13:21:55.414873 7f3f3011a700 7 mds.6.cache set_recovery_set 0,1,2,3,4,5,7,8 2017-10-10 13:21:55.414883 7f3f3011a700 1 mds.6.96224 recovery set is 0,1,2,3,4,5,7,8 2017-10-10 13:21:55.414893 7f3f3011a700 1 mds.6.96224 waiting for osdmap 18607 (which blacklists prior instance) 2017-10-10 13:21:55.414901 7f3f3011a700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 13:21:55.416011 7f3f3011a700 7 mds.6.server operator(): full = 0 epoch = 18608 2017-10-10 13:21:55.416024 7f3f3011a700 4 mds.6.96224 handle_osd_map epoch 18608, 0 new blacklist entries 2017-10-10 13:21:55.416027 7f3f3011a700 10 mds.6.server apply_blacklist: killed 0 2017-10-10 13:21:55.416076 7f3f2a10e700 10 MDSIOContextBase::complete: 12C_IO_Wrapper 2017-10-10 13:21:55.416095 7f3f2a10e700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart 2017-10-10 13:21:55.416101 7f3f2a10e700 2 mds.6.96224 boot_start 0: opening inotable 2017-10-10 13:21:55.416120 7f3f2a10e700 10 mds.6.inotable: load 2017-10-10 13:21:55.416301 7f3f2a10e700 2 mds.6.96224 boot_start 0: opening sessionmap 2017-10-10 13:21:55.416310
[ceph-users] BlueStore Cache Ratios
I've been reading about BlueStore and I came across the BlueStore Cache and its ratios. I couldn't fully understand it. Are .99 KV, .01 MetaData and .0 Data ratios right? they seem a little too disproporcionate. Also .99 KV and Cache of 3GB for SSD means that almost the 3GB would be used for KV but there is also another attributed called bluestore_cache_kv_max which is by fault 512MB, then what is the rest of the cache used for?, nothing? shouldnt it be more kv_max value or less KV ratio? I know it really depends on the enviroment (size, amount of IOS, files...) but all of this data seems to me a little too odd and not razonable. Is there any way I can make an aprox about how much KV and metadata is generating for GB of actual data? Does it make anypoint to left some cache for the Data itself or its better to just store ONodes and metadata? Another little question I dont really understand how BlueStore gets its speed from cause it actually writes the data directly to the end device (not like FS where you had a journal), then... shouldnt speed be limitated for that device write speed even having a SSD for RocksDB? Thanks a lot. *Jorge Pinilla López* jorp...@unizar.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory
>> I am trying to follow the instructions at: >> http://docs.ceph.com/docs/master/cephfs/client-auth/ >> to restrict a client to a subdirectory of Ceph filesystem, but always get >> an error. >> >> We are running the latest stable release of Ceph (v12.2.1) on CentOS 7 >> servers. The user 'hydra' has the following capabilities: >> # ceph auth get client.hydra >> exported keyring for client.hydra >> [client.hydra] >> key = AQ== >> caps mds = "allow rw" >> caps mgr = "allow r" >> caps mon = "allow r" >> caps osd = "allow rw" >> >> When I tried to restrict the client to only mount and work within the >> directory /hydra of the Ceph filesystem 'pulpos', I got an error: >> # ceph fs authorize pulpos client.hydra /hydra rw >> Error EINVAL: key for client.dong exists but cap mds does not match >> >> I've tried a few combinations of user caps and CephFS client caps; but >> always got the same error! > > The "fs authorize" command isn't smart enough to edit existing > capabilities safely, so it is cautious and refuses to overwrite what > is already there. If you remove your client.hydra user and try again, > it should create it for you with the correct capabilities. I confirm it works perfectly ! it should be added to the documentation. :) # ceph fs authorize cephfs client.foo1 /foo1 rw [client.foo1] key = XXX1 # ceph fs authorize cephfs client.foo2 / r /foo2 rw [client.foo2] key = XXX2 # ceph auth get client.foo1 exported keyring for client.foo1 [client.foo1] key = XXX1 caps mds = "allow rw path=/foo1" caps mon = "allow r" caps osd = "allow rw pool=cephfs_data" # ceph auth get client.foo2 exported keyring for client.foo2 [client.foo2] key = XXX2 caps mds = "allow r, allow rw path=/foo2" caps mon = "allow r" caps osd = "allow rw pool=cephfs_data" Best regards, -- Yoann Moulin EPFL IC-IT ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory
On Tue, Oct 10, 2017 at 2:22 AM, Shawfeng Dongwrote: > Dear all, > > I am trying to follow the instructions at: > http://docs.ceph.com/docs/master/cephfs/client-auth/ > to restrict a client to a subdirectory of Ceph filesystem, but always get > an error. > > We are running the latest stable release of Ceph (v12.2.1) on CentOS 7 > servers. The user 'hydra' has the following capabilities: > # ceph auth get client.hydra > exported keyring for client.hydra > [client.hydra] > key = AQ== > caps mds = "allow rw" > caps mgr = "allow r" > caps mon = "allow r" > caps osd = "allow rw" > > When I tried to restrict the client to only mount and work within the > directory /hydra of the Ceph filesystem 'pulpos', I got an error: > # ceph fs authorize pulpos client.hydra /hydra rw > Error EINVAL: key for client.dong exists but cap mds does not match > > I've tried a few combinations of user caps and CephFS client caps; but > always got the same error! The "fs authorize" command isn't smart enough to edit existing capabilities safely, so it is cautious and refuses to overwrite what is already there. If you remove your client.hydra user and try again, it should create it for you with the correct capabilities. John > > Has anyone able to get this to work? What is your recipe? > > Thanks, > Shaw > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to debug (in order to repair) damaged MDS (rank)?
On Tue, Oct 10, 2017 at 10:28 AM, Daniel Baumannwrote: > Hi all, > > unfortunatly I'm still struggling bringing cephfs back up after one of > the MDS has been marked "damaged" (see messages from monday). > > 1. When I mark the rank as "repaired", this is what I get in the monitor >log (leaving unrelated leveldb compacting chatter aside): > > 2017-10-10 10:51:23.177865 7f3290710700 0 log_channel(audit) log [INF] > : from='client.? 147.87.226.72:0/1658479115' entity='client.admin' cmd > ='[{"prefix": "mds repaired", "rank": "6"}]': finished > 2017-10-10 10:51:23.177993 7f3290710700 0 log_channel(cluster) log > [DBG] : fsmap cephfs-9/9/9 up {0=mds1=up:resolve,1=mds2=up:resolve,2=mds3 > =up:resolve,3=mds4=up:resolve,4=mds5=up:resolve,5=mds6=up:resolve,6=mds9=up:replay,7=mds7=up:resolve,8=mds8=up:resolve} > [...] > > 2017-10-10 10:51:23.492040 7f328ab1c700 1 mon.mon1@0(leader).mds e96186 > mds mds.? 147.87.226.189:6800/524543767 can't write to fsmap compat= > {},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versi > oned encoding,6=dirfrag is stored in omap,8=file layout v2} > [...] > > 2017-10-10 10:51:24.291827 7f328d321700 -1 log_channel(cluster) log > [ERR] : Health check failed: 1 mds daemon damaged (MDS_DAMAGE) > > 2. ...and this is what I get on the mds: > > 2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro) > _decode error from assimilate_prefetch > 2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover: > Error -22 recovering write_pos This is probably the root cause: somehow the PurgeQueue (one of the on-disk metadata structures) has become corrupt. The purge queue objects for rank 6 will all have names starting "506." in the metadata pool. This is probably the result of a bug of some kind, so to give us a chance of working out what went wrong let's gather some evidence first: A) Do a "rados -p ls | grep "^506\." or similar, to get a list of the objects B) Write a short bash loop to do a "rados -p get" on each of those objects into a file. C) Stop the MDS, set "debug mds = 20" and "debug journaler = 20", mark the rank repaired, start the MDS again, and then gather the resulting log (it should end in the same "Error -22 recovering write_pos", but have much much more detail about what came before). Because you've hit a serious bug, it's really important to gather all this and share it, so that we can try to fix it and prevent it happening again to you or others. Once you've put all that evidence somewhere safe, you can start intervening to repair it. The good news is that this is the best part of your metadata to damage, because all it does is record the list of deleted files to purge. You have two options, depending on how much downtime you can tolerate: - carefully remove all the metadata objects that start with 506. -- this will cause that MDS rank to completely forget about purging anything in its queue. This will leave some orphan data objects in the data pool that will never get cleaned up (without doing some more offline repair). - inspect the detailed logs from step C of the evidence gathering, to work out exactly how far the journal loading got before hitting something corrupt. Then with some finer-grained editing of the on-disk objects, we can persuade it to skip over the part that was damaged. John > (see attachment for the full mds log during the "repair" action) > > > I'm really stuck here and would greatly appreciate any help. How can I > see what is actually going on/the problem? Running ceph-mon/ceph-mds > with debug levels logs just "damaged" as quoted above, but doesn't tell > what is wrong or why it's failing. > > would going back to single MDS with "ceph fs reset" allow me to access > the data again? > > > Regards, > Daniel > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] advice on number of objects per OSD
Hi, Are there any recommendations on what is the limit when osd performance start to decline because of large number of objects? Or perhaps a procedure on how to find this number (luminous)? My understanding is that the recommended object size is 10-100 MB, but is there any performance hit due to large number of objects? I ran across a number of about 1M objects, is that so? We do not have special SSD for journal and use librados for I/O. Alexander. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 1 MDSs behind on trimming (was Re: clients failing to advance oldest client/flush tid)
On Tue, Oct 10, 2017 at 3:48 AM, Nigel Williamswrote: > On 9 October 2017 at 19:21, Jake Grimmett wrote: >> HEALTH_WARN 9 clients failing to advance oldest client/flush tid; >> 1 MDSs report slow requests; 1 MDSs behind on trimming (This is the less worrying of the original thread's messages, so I've edited subject line) > On a proof-of-concept 12.2.1 cluster (few random files added, 30 OSDs, > default Ceph settings) I can get the above error by doing this from a > client: > > bonnie++ -s 0 -n 1000 -u 0 > > This makes 1 million files in a single directory (we wanted to see > what might break). > > This takes a few hours to run but seems to finish without incident. > Over that time we get this in the logs: We do sometimes see this in systems that have the metadata pool either on kinda-slow drives, or on drives that are shared with a very busy data pool. If either is the case (i.e. if your OSDs are very busy) then the warning is probably nothing to worry about (it does make me wonder if we should make the default journal length longer though). You can make the system more tolerant of slow metadata writeback by adjusting mds_log_max_segments upwards (for example, doubling from the default 30 is not a big deal). If your OSDs are *not* very busy, and you're still seeing this warning, then you are hitting a bug and it's worth investigating. John > > root@c0mon-101:/var/log/ceph# zcat ceph-mon.c0mon-101.log.6.gz|fgrep MDS_TRIM > 2017-10-04 11:14:18.489943 7ff914a26700 0 log_channel(cluster) log > [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM) > 2017-10-04 11:14:22.523117 7ff914a26700 0 log_channel(cluster) log > [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on > trimming) > 2017-10-04 11:14:26.589797 7ff914a26700 0 log_channel(cluster) log > [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM) > 2017-10-04 11:14:34.614567 7ff914a26700 0 log_channel(cluster) log > [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on > trimming) > 2017-10-04 20:38:22.812032 7ff914a26700 0 log_channel(cluster) log > [WRN] : Health check failed: 1 MDSs behind on trimming (MDS_TRIM) > 2017-10-04 20:41:14.700521 7ff914a26700 0 log_channel(cluster) log > [INF] : Health check cleared: MDS_TRIM (was: 1 MDSs behind on > trimming) > root@c0mon-101:/var/log/ceph# > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to debug (in order to repair) damaged MDS (rank)?
Hi all, unfortunatly I'm still struggling bringing cephfs back up after one of the MDS has been marked "damaged" (see messages from monday). 1. When I mark the rank as "repaired", this is what I get in the monitor log (leaving unrelated leveldb compacting chatter aside): 2017-10-10 10:51:23.177865 7f3290710700 0 log_channel(audit) log [INF] : from='client.? 147.87.226.72:0/1658479115' entity='client.admin' cmd ='[{"prefix": "mds repaired", "rank": "6"}]': finished 2017-10-10 10:51:23.177993 7f3290710700 0 log_channel(cluster) log [DBG] : fsmap cephfs-9/9/9 up {0=mds1=up:resolve,1=mds2=up:resolve,2=mds3 =up:resolve,3=mds4=up:resolve,4=mds5=up:resolve,5=mds6=up:resolve,6=mds9=up:replay,7=mds7=up:resolve,8=mds8=up:resolve} [...] 2017-10-10 10:51:23.492040 7f328ab1c700 1 mon.mon1@0(leader).mds e96186 mds mds.? 147.87.226.189:6800/524543767 can't write to fsmap compat= {},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versi oned encoding,6=dirfrag is stored in omap,8=file layout v2} [...] 2017-10-10 10:51:24.291827 7f328d321700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 mds daemon damaged (MDS_DAMAGE) 2. ...and this is what I get on the mds: 2017-10-10 11:21:26.537204 7fcb01702700 -1 mds.6.journaler.pq(ro) _decode error from assimilate_prefetch 2017-10-10 11:21:26.537223 7fcb01702700 -1 mds.6.purge_queue _recover: Error -22 recovering write_pos (see attachment for the full mds log during the "repair" action) I'm really stuck here and would greatly appreciate any help. How can I see what is actually going on/the problem? Running ceph-mon/ceph-mds with debug levels logs just "damaged" as quoted above, but doesn't tell what is wrong or why it's failing. would going back to single MDS with "ceph fs reset" allow me to access the data again? Regards, Daniel 2017-10-10 11:21:26.419394 7fcb0670c700 10 mds.mds9 mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=de fault file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2} 2017-10-10 11:21:26.419399 7fcb0670c700 10 mds.mds9 map says I am 147.87.226.189:6800/1182896077 mds.6.96195 state up:replay 2017-10-10 11:21:26.419623 7fcb0670c700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 11:21:26.419679 7fcb0670c700 10 mds.mds9 handle_mds_map: initializing MDS rank 6 2017-10-10 11:21:26.419916 7fcb0670c700 10 mds.6.0 update_log_config log_to_monitors {default=true} 2017-10-10 11:21:26.419920 7fcb0670c700 10 mds.6.0 create_logger 2017-10-10 11:21:26.420138 7fcb0670c700 7 mds.6.server operator(): full = 0 epoch = 0 2017-10-10 11:21:26.420146 7fcb0670c700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 11:21:26.420150 7fcb0670c700 4 mds.6.0 handle_osd_map epoch 0, 0 new blacklist entries 2017-10-10 11:21:26.420159 7fcb0670c700 10 mds.6.server apply_blacklist: killed 0 2017-10-10 11:21:26.420338 7fcb0670c700 10 mds.mds9 handle_mds_map: handling map as rank 6 2017-10-10 11:21:26.420347 7fcb0670c700 1 mds.6.96195 handle_mds_map i am now mds.6.96195 2017-10-10 11:21:26.420351 7fcb0670c700 1 mds.6.96195 handle_mds_map state change up:boot --> up:replay 2017-10-10 11:21:26.420366 7fcb0670c700 10 mds.beacon.mds9 set_want_state: up:standby -> up:replay 2017-10-10 11:21:26.420370 7fcb0670c700 1 mds.6.96195 replay_start 2017-10-10 11:21:26.420375 7fcb0670c700 7 mds.6.cache set_recovery_set 0,1,2,3,4,5,7,8 2017-10-10 11:21:26.420380 7fcb0670c700 1 mds.6.96195 recovery set is 0,1,2,3,4,5,7,8 2017-10-10 11:21:26.420395 7fcb0670c700 1 mds.6.96195 waiting for osdmap 18593 (which blacklists prior instance) 2017-10-10 11:21:26.420401 7fcb0670c700 4 mds.6.purge_queue operator(): data pool 7 not found in OSDMap 2017-10-10 11:21:26.421206 7fcb0670c700 7 mds.6.server operator(): full = 0 epoch = 18593 2017-10-10 11:21:26.421217 7fcb0670c700 4 mds.6.96195 handle_osd_map epoch 18593, 0 new blacklist entries 2017-10-10 11:21:26.421220 7fcb0670c700 10 mds.6.server apply_blacklist: killed 0 2017-10-10 11:21:26.421253 7fcb00700700 10 MDSIOContextBase::complete: 12C_IO_Wrapper 2017-10-10 11:21:26.421263 7fcb00700700 10 MDSInternalContextBase::complete: 15C_MDS_BootStart 2017-10-10 11:21:26.421267 7fcb00700700 2 mds.6.96195 boot_start 0: opening inotable 2017-10-10 11:21:26.421285 7fcb00700700 10 mds.6.inotable: load 2017-10-10 11:21:26.421441 7fcb00700700 2 mds.6.96195 boot_start 0: opening sessionmap 2017-10-10 11:21:26.421449 7fcb00700700 10 mds.6.sessionmap load 2017-10-10 11:21:26.421551 7fcb00700700 2 mds.6.96195 boot_start 0: opening mds log 2017-10-10 11:21:26.421558 7fcb00700700 5 mds.6.log open discovering log bounds 2017-10-10 11:21:26.421720 7fcaff6fe700 10 mds.6.log _submit_thread start 2017-10-10 11:21:26.423002 7fcb00700700 10 MDSIOContextBase::complete: N12_GLOBAL__N_112C_IO_SM_LoadE
Re: [ceph-users] ceph-volume: migration and disk partition support
On Fri, Oct 6, 2017 at 6:56 PM, Alfredo Dezawrote: > Hi, > > Now that ceph-volume is part of the Luminous release, we've been able > to provide filestore support for LVM-based OSDs. We are making use of > LVM's powerful mechanisms to store metadata which allows the process > to no longer rely on UDEV and GPT labels (unlike ceph-disk). > > Bluestore support should be the next step for `ceph-volume lvm`, and > while that is planned we are thinking of ways to improve the current > caveats (like OSDs not coming up) for clusters that have deployed OSDs > with ceph-disk. > > --- New clusters --- > The `ceph-volume lvm` deployment is straightforward (currently > supported in ceph-ansible), but there isn't support for plain disks > (with partitions) currently, like there is with ceph-disk. > > Is there a pressing interest in supporting plain disks with > partitions? Or only supporting LVM-based OSDs fine? > > --- Existing clusters --- > Migration to ceph-volume, even with plain disk support means > re-creating the OSD from scratch, which would end up moving data. > There is no way to make a GPT/ceph-disk OSD become a ceph-volume one > without starting from scratch. > > A temporary workaround would be to provide a way for existing OSDs to > be brought up without UDEV and ceph-disk, by creating logic in > ceph-volume that could load them with systemd directly. This wouldn't > make them lvm-based, nor it would mean there is direct support for > them, just a temporary workaround to make them start without UDEV and > ceph-disk. > > I'm interested in what current users might look for here,: is it fine > to provide this workaround if the issues are that problematic? Or is > it OK to plan a migration towards ceph-volume OSDs? Without fully understanding the technical details and plans, it will be hard to answer this. In general, I wouldn't plan to recreate all OSDs. In our case, we don't currently plan to recreate FileStore OSDs as Bluestore after the Luminous upgrade, as that would be too much work. *New* OSDs will be created the *new* way (is that ceph-disk bluestore? ceph-volume lvm bluestore??) It wouldn't be nice if we created new OSDs today with ceph-disk bluestore, then have to recreate all those with ceph-volume bluestore in a few months. Disks/servers have a ~5 year lifetime, and we want to format OSDs exactly once. I'd hope those OSDs remain bootable for the upcoming releases. (ceph-disk activation works reliably enough here -- just don't remove the existing functionality and we'll be happy). -- dan > > -Alfredo > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-volume: migration and disk partition support
Hi, Quoting Alfredo Deza (ad...@redhat.com): > Hi, > > Now that ceph-volume is part of the Luminous release, we've been able > to provide filestore support for LVM-based OSDs. We are making use of > LVM's powerful mechanisms to store metadata which allows the process > to no longer rely on UDEV and GPT labels (unlike ceph-disk). > > Bluestore support should be the next step for `ceph-volume lvm`, and > while that is planned we are thinking of ways to improve the current > caveats (like OSDs not coming up) for clusters that have deployed OSDs > with ceph-disk. I'm a bit confused after reading this. Just to make things clear. Would bluestore be put on top of a LVM volume (in an ideal world)? Has bluestore in Ceph luminious support for LVM? I.e. is there code in bluestore to support LVM? Or is it _just_ support of `ceph-volume lvm` for bluestore? > --- New clusters --- > The `ceph-volume lvm` deployment is straightforward (currently > supported in ceph-ansible), but there isn't support for plain disks > (with partitions) currently, like there is with ceph-disk. > > Is there a pressing interest in supporting plain disks with > partitions? Or only supporting LVM-based OSDs fine? We're still in a green field situation. Users with an installed base will have to comment on this. If the assumption that bluestore would be put on top of LVM is true, it would make things simpler (in our own Ceph ansible playbook). Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unable to restrict a CephFS client to a subdirectory
Hello, > I am trying to follow the instructions at: > http://docs.ceph.com/docs/master/cephfs/client-auth/ > to restrict a client to a subdirectory of Ceph filesystem, but always get an > error. > > We are running the latest stable release of Ceph (v12.2.1) on CentOS 7 > servers. The user 'hydra' has the following capabilities: > # ceph auth get client.hydra > exported keyring for client.hydra > [client.hydra] > key = AQ== > caps mds = "allow rw" > caps mgr = "allow r" > caps mon = "allow r" > caps osd = "allow rw" > > When I tried to restrict the client to only mount and work within the > directory /hydra of the Ceph filesystem 'pulpos', I got an error: > # ceph fs authorize pulpos client.hydra /hydra rw > Error EINVAL: key for client.dong exists but cap mds does not match > > I've tried a few combinations of user caps and CephFS client caps; but always > got the same error! > > Has anyone able to get this to work? What is your recipe? In the case, the client runs an old kernel (at least 4.4 is old, 4.10 is not), you need to give a read access to the entire cephfs fs, if not, you won't be able to mount the subdirectory. 1/ give read access to the mds and rw to the subdirectory : # ceph auth get-or-create client.foo mon "allow r" osd "allow rw pool=cephfs_data" mds "allow r, allow rw path=/foo" or, if client.foo already exist : # ceph auth caps client.foo mon "allow r" osd "allow rw pool=cephfs_data" mds "allow r, allow rw path=/foo" [client.foo] key = XXX caps mds = "allow r, allow rw path=/foo" caps mon = "allow r" caps osd = "allow rw pool=cephfs_data" 2/ you give read access to / and rw access to the subdirectory : # ceph fs authorize cephfs client.foo / r /foo rw Then you get the secret key and mount : # ceph --cluster container auth get-key client.foo > foo.secret # mount.ceph mds1,mds2,mds3:/foo /foo -v -o name=foo,secretfile=/path/to/foo.secret With an old kernel, you will always be able to mount the root of the cephfs fs. # mount.ceph mds1,mds2,mds3:/ /foo -v -o name=foo,secretfile=/path/to/foo.secret if your client runs a not so old kernel you can do this : 1/ you need to give an access to the specific path like : # ceph auth get-or-create client.bar mon "allow r" osd "allow rw pool=cephfs_data" mds "allow rw path=/bar" or, if the client.bar already exist : # ceph auth caps client.bar mon "allow r" osd "allow rw pool=cephfs_data" mds "allow rw path=/bar" [client.bar] key = XXX caps mds = "allow rw path=/bar" caps mon = "allow r" caps osd = "allow rw pool=cephfs_data" 2/ you give rw access only on the subdirectory : # ceph fs authorize cephfs client.bar /bar rw Then you get the secret key and mount : # ceph --cluster container auth get-key client.bar > bar.secret # mount.ceph mds1,mds2,mds3:/bar /bar -v -o name=bar,secretfile=/path/to/bar.secret if you try to mount the cephfs root, you should get an access denied # mount.ceph mds1,mds2,mds3:/ /bar -v -o name=bar,secretfile=/path/to/bar.secret In the case you want to increase the security, you might give a look to namespace and file layout http://docs.ceph.com/docs/master/cephfs/file-layouts/ I don't have given a look at yet but looks like really interesting ! > > Thanks, > Shaw > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Yoann Moulin EPFL IC-IT ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com