Re: [ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space
Just chiming in to say that I too had some issues with backfill_toofull PGs, despite no OSD's being in a backfill_full state, albeit, there were some nearfull OSDs. I was able to get through it by reweighting down the OSD that was the target reported by ceph pg dump | grep 'backfill_toofull'. This was on 14.2.2. Reed > On Aug 21, 2019, at 2:50 PM, Vladimir Brik > wrote: > > Hello > > After increasing number of PGs in a pool, ceph status is reporting "Degraded > data redundancy (low space): 1 pg backfill_toofull", but I don't understand > why, because all OSDs seem to have enough space. > > ceph health detail says: > pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85] > > $ ceph pg map 40.155 > osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85] > > So I guess Ceph wants to move 40.155 from 66 to 79 (or other way around?). > According to "osd df", OSD 66's utilization is 71.90%, OSD 79's utilization > is 58.45%. The OSD with least free space in the cluster is 81.23% full, and > it's not any of the ones above. > > OSD backfillfull_ratio is 90% (is there a better way to determine this?): > $ ceph osd dump | grep ratio > full_ratio 0.95 > backfillfull_ratio 0.9 > nearfull_ratio 0.7 > > Does anybody know why a PG could be in the backfill_toofull state if no OSD > is in the backfillfull state? > > > Vlad > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mutliple CephFS Filesystems Nautilus (14.2.2)
On Wed, Aug 21, 2019 at 2:02 PM wrote: > How experimental is the multiple CephFS filesystems per cluster feature? We > plan to use different sets of pools (meta / data) per filesystem. > > Are there any known issues? No. It will likely work fine but some things may change in a future version that makes upgrading more difficult. > While we're on the subject, is it possible to assign a different active MDS > to each filesystem? The monitors do the assignment. You cannot specify which file system an MDS servers. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mutliple CephFS Filesystems Nautilus (14.2.2)
All; How experimental is the multiple CephFS filesystems per cluster feature? We plan to use different sets of pools (meta / data) per filesystem. Are there any known issues? While we're on the subject, is it possible to assign a different active MDS to each filesystem? Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
> Are you running multisite? No > Do you have dynamic bucket resharding turned on? Yes. "radosgw-admin reshard list" prints "[]" > Are you using lifecycle? I am not sure. How can I check? "radosgw-admin lc list" says "[]" > And just to be clear -- sometimes all 3 of your rados gateways are > simultaneously in this state? Multiple, but I have not seen all 3 being in this state simultaneously. Currently one gateway has 1 thread using 100% of CPU, and another has 5 threads each using 100% CPU. Here are the fruits of my attempts to capture the call graph using perf and gdbpmp: https://icecube.wisc.edu/~vbrik/perf.data https://icecube.wisc.edu/~vbrik/gdbpmp.data These are the commands that I ran and their outputs (note I couldn't get perf not to generate the warning): rgw-3 gdbpmp # ./gdbpmp.py -n 100 -p 73688 -o gdbpmp.data Attaching to process 73688...Done. Gathering Samples Profiling complete with 100 samples. rgw-3 ~ # perf record --call-graph fp -p 73688 -- sleep 10 [ perf record: Woken up 54 times to write data ] Warning: Processed 574207 events and lost 4 chunks! Check IO/CPU overload! [ perf record: Captured and wrote 58.866 MB perf.data (233750 samples) ] Vlad On 8/21/19 11:16 AM, J. Eric Ivancich wrote: On 8/21/19 10:22 AM, Mark Nelson wrote: Hi Vladimir, On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello [much elided] You might want to try grabbing a a callgraph from perf instead of just running perf top or using my wallclock profiler to see if you can drill down and find out where in that method it's spending the most time. I agree with Mark -- a call graph would be very helpful in tracking down what's happening. There are background tasks that run. Are you running multisite? Do you have dynamic bucket resharding turned on? Are you using lifecycle? And garbage collection is another background task. And just to be clear -- sometimes all 3 of your rados gateways are simultaneously in this state? But the call graph would be incredibly helpful. Thank you, Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space
Hello After increasing number of PGs in a pool, ceph status is reporting "Degraded data redundancy (low space): 1 pg backfill_toofull", but I don't understand why, because all OSDs seem to have enough space. ceph health detail says: pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85] $ ceph pg map 40.155 osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85] So I guess Ceph wants to move 40.155 from 66 to 79 (or other way around?). According to "osd df", OSD 66's utilization is 71.90%, OSD 79's utilization is 58.45%. The OSD with least free space in the cluster is 81.23% full, and it's not any of the ones above. OSD backfillfull_ratio is 90% (is there a better way to determine this?): $ ceph osd dump | grep ratio full_ratio 0.95 backfillfull_ratio 0.9 nearfull_ratio 0.7 Does anybody know why a PG could be in the backfill_toofull state if no OSD is in the backfillfull state? Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
On 8/21/19 10:22 AM, Mark Nelson wrote: > Hi Vladimir, > > > On 8/21/19 8:54 AM, Vladimir Brik wrote: >> Hello >> [much elided] > You might want to try grabbing a a callgraph from perf instead of just > running perf top or using my wallclock profiler to see if you can drill > down and find out where in that method it's spending the most time. I agree with Mark -- a call graph would be very helpful in tracking down what's happening. There are background tasks that run. Are you running multisite? Do you have dynamic bucket resharding turned on? Are you using lifecycle? And garbage collection is another background task. And just to be clear -- sometimes all 3 of your rados gateways are simultaneously in this state? But the call graph would be incredibly helpful. Thank you, Eric -- J. Eric Ivancich he/him/his Red Hat Storage Ann Arbor, Michigan, USA ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
Correction: the number of threads stuck using 100% of a CPU core varies from 1 to 5 (it's not always 5) Vlad On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. Does anybody have an idea what might be going on or how to debug it? I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl" This is our configuration: rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log (error log file doesn't exist) Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik wrote: > > Hello > > I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, > radosgw process on those machines starts consuming 100% of 5 CPU cores > for days at a time, even though the machine is not being used for data > transfers (nothing in radosgw logs, couple of KB/s of network). > > This situation can affect any number of our rados gateways, lasts from > few hours to few days and stops if radosgw process is restarted or on > its own. > > Does anybody have an idea what might be going on or how to debug it? I > don't see anything obvious in the logs. Perf top is saying that CPU is > consumed by radosgw shared object in symbol get_obj_data::flush, which, > if I interpret things correctly, is called from a symbol with a long > name that contains the substring "boost9intrusive9list_impl" > > This is our configuration: > rgw_frontends = civetweb num_threads=5000 port=443s > ssl_certificate=/etc/ceph/rgw.crt > error_log_file=/var/log/ceph/civetweb.error.log Probably unrelated to your problem, but running with lots of threads is usually an indicator that the async beast frontend would be a better fit for your setup. (But the code you see in perf should not be related to the frontend) Paul > > (error log file doesn't exist) > > > Thanks, > > Vlad > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
Hi Vladimir, On 8/21/19 8:54 AM, Vladimir Brik wrote: Hello I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. Does anybody have an idea what might be going on or how to debug it? I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl" I don't normally look at the RGW code so maybe Matt/Casey/Eric can chime in. That code is in src/rgw/rgw_rados.cc in the get_obj_data struct. The flush method does some sorting/merging and then walks through a listed of completed IOs and appears to copy a bufferlist out of each one, then deletes it from the list and passes the BL off to client_cb->handle_data. Looks like it could be pretty CPU intensive but if you are seeing that much CPU for that long it sounds like something is rather off. You might want to try grabbing a a callgraph from perf instead of just running perf top or using my wallclock profiler to see if you can drill down and find out where in that method it's spending the most time. My wallclock profiler is here: https://github.com/markhpc/gdbpmp Mark This is our configuration: rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log (error log file doesn't exist) Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred
Hello I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, radosgw process on those machines starts consuming 100% of 5 CPU cores for days at a time, even though the machine is not being used for data transfers (nothing in radosgw logs, couple of KB/s of network). This situation can affect any number of our rados gateways, lasts from few hours to few days and stops if radosgw process is restarted or on its own. Does anybody have an idea what might be going on or how to debug it? I don't see anything obvious in the logs. Perf top is saying that CPU is consumed by radosgw shared object in symbol get_obj_data::flush, which, if I interpret things correctly, is called from a symbol with a long name that contains the substring "boost9intrusive9list_impl" This is our configuration: rgw_frontends = civetweb num_threads=5000 port=443s ssl_certificate=/etc/ceph/rgw.crt error_log_file=/var/log/ceph/civetweb.error.log (error log file doesn't exist) Thanks, Vlad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Applications slow in VMs running RBD disks
Use 100% Flash setup avoid rotational disk for get some performance in HDD with windows. Windows is very sensitive to disk latency and interface with latency provides a bad sense for customer some times. You can check in your Graphana for ceph your avg read/write when in windows go up 50ms or more the performance is a painful Regards Manuel De: ceph-users En nombre de Gesiel Galvão Bernardes Enviado el: miércoles, 21 de agosto de 2019 14:26 Para: ceph-users Asunto: [ceph-users] Applications slow in VMs running RBD disks Hi, I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I having problems with slowness in aplications that many times not consuming very CPU or RAM. This problem affect mostly Windows. Appearly the problem is that normally the application load many short files (ex: DLLs) and these files take a long time to load, generating a slowness. I using 8Tb disks, with 3x replica (I've tried with erasure and 2x too), and tried with SSD cache and without SSD cache and the problem persists. Using the same disks with NFS, the applications run fine. I've already tried change RBD object size (from 4mb to 128k), use qemu writeback cache, configure virtio iscsi queues, use virtio (virtio-blk) driver, and and none of these brought effective improvement for this problem. Anyone already had similar problem and / or have any idea how to solve this? Or have a idea that where I should look to resolve this? Thanks Advance, Gesiel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Applications slow in VMs running RBD disks
Hi Eliza, Em qua, 21 de ago de 2019 às 09:30, Eliza escreveu: > Hi > > on 2019/8/21 20:25, Gesiel Galvão Bernardes wrote: > > I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I > > having problems with slowness in aplications that many times not > > consuming very CPU or RAM. This problem affect mostly Windows. Appearly > > the problem is that normally the application load many short files (ex: > > DLLs) and these files take a long time to load, generating a slowness. > > Did you check/test your network connection? > Do you have a fast network setup? I have a bond of two 10GB interfaces, with little use. > > regards. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Applications slow in VMs running RBD disks
Hi on 2019/8/21 20:25, Gesiel Galvão Bernardes wrote: I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I having problems with slowness in aplications that many times not consuming very CPU or RAM. This problem affect mostly Windows. Appearly the problem is that normally the application load many short files (ex: DLLs) and these files take a long time to load, generating a slowness. Did you check/test your network connection? Do you have a fast network setup? regards. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Applications slow in VMs running RBD disks
Hi, I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I having problems with slowness in aplications that many times not consuming very CPU or RAM. This problem affect mostly Windows. Appearly the problem is that normally the application load many short files (ex: DLLs) and these files take a long time to load, generating a slowness. I using 8Tb disks, with 3x replica (I've tried with erasure and 2x too), and tried with SSD cache and without SSD cache and the problem persists. Using the same disks with NFS, the applications run fine. I've already tried change RBD object size (from 4mb to 128k), use qemu writeback cache, configure virtio iscsi queues, use virtio (virtio-blk) driver, and and none of these brought effective improvement for this problem. Anyone already had similar problem and / or have any idea how to solve this? Or have a idea that where I should look to resolve this? Thanks Advance, Gesiel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mon db change from rocksdb to leveldb
You can't downgrade from Luminous to Kraken well officially at least. I guess it maybe could somehow work but you'd need to re-create all the services. For the mon example: delete a mon, create a new old one, let it sync, etc. Still a bad idea. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Aug 21, 2019 at 1:37 PM nokia ceph wrote: > > Hi Team, > > One of our old customer had Kraken and they are going to upgrade to Luminous > . In the process they also requesting for downgrade procedure. > Kraken used leveldb for ceph-mon data , from luminous it changed to rocksdb , > upgrade works without any issues. > > When we downgrade , the ceph-mon does not start and the mon kv_backend not > moving away from rocksdb . > > After downgrade , when kv_backend is rocksdb following error thrown by > ceph-mon , trying to load data from rocksdb and end up in this error, > > 2019-08-21 11:22:45.200188 7f1a0406f7c0 4 rocksdb: Recovered from manifest > file:/var/lib/ceph/mon/ceph-cn1/store.db/MANIFEST-000716 > succeeded,manifest_file_number is 716, next_file_number is 718, last_sequence > is 311614, log_number is 0,prev_log_number is 0,max_column_family is 0 > > 2019-08-21 11:22:45.200198 7f1a0406f7c0 4 rocksdb: Column family [default] > (ID 0), log number is 715 > > 2019-08-21 11:22:45.200247 7f1a0406f7c0 4 rocksdb: EVENT_LOG_v1 > {"time_micros": 1566386565200240, "job": 1, "event": "recovery_started", > "log_files": [717]} > 2019-08-21 11:22:45.200252 7f1a0406f7c0 4 rocksdb: Recovering log #717 mode 2 > 2019-08-21 11:22:45.200282 7f1a0406f7c0 4 rocksdb: Creating manifest 719 > > 2019-08-21 11:22:45.201222 7f1a0406f7c0 4 rocksdb: EVENT_LOG_v1 > {"time_micros": 1566386565201218, "job": 1, "event": "recovery_finished"} > 2019-08-21 11:22:45.202582 7f1a0406f7c0 4 rocksdb: DB pointer 0x55d4dacf > 2019-08-21 11:22:45.202726 7f1a0406f7c0 -1 ERROR: on disk data includes > unsupported features: compat={},rocompat={},incompat={9=luminous ondisk > layout} > 2019-08-21 11:22:45.202735 7f1a0406f7c0 -1 error checking features: (1) > Operation not permitted > > We changed the kv_backend file inside /var/lib/ceph/mon/ceph-cn1 to leveldb > and ceph-mon failed with following error, > > 2019-08-21 11:24:07.922978 7fc5a25de7c0 -1 WARNING: the following dangerous > and experimental features are enabled: bluestore,rocksdb > 2019-08-21 11:24:07.922983 7fc5a25de7c0 0 set uid:gid to 167:167 (ceph:ceph) > 2019-08-21 11:24:07.923009 7fc5a25de7c0 0 ceph version 11.2.0 > (f223e27eeb35991352ebc1f67423d4ebc252adb7), process ceph-mon, pid 3509050 > 2019-08-21 11:24:07.923050 7fc5a25de7c0 0 pidfile_write: ignore empty > --pid-file > 2019-08-21 11:24:07.944867 7fc5a25de7c0 -1 WARNING: the following dangerous > and experimental features are enabled: bluestore,rocksdb > 2019-08-21 11:24:07.950304 7fc5a25de7c0 0 load: jerasure load: lrc load: isa > 2019-08-21 11:24:07.950563 7fc5a25de7c0 -1 error opening mon data directory > at '/var/lib/ceph/mon/ceph-cn1': (22) Invalid argument > > Is there any possibility to toggle ceph-mon db between leveldb and rocksdb? > Tried to add mon_keyvaluedb = leveldb and filestore_omap_backend = leveldb in > ceph.conf also not worked. > thanks, > Muthu > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mon db change from rocksdb to leveldb
Hi Team, One of our old customer had Kraken and they are going to upgrade to Luminous . In the process they also requesting for downgrade procedure. Kraken used leveldb for ceph-mon data , from luminous it changed to rocksdb , upgrade works without any issues. When we downgrade , the ceph-mon does not start and the mon kv_backend not moving away from rocksdb . After downgrade , when kv_backend is rocksdb following error thrown by ceph-mon , trying to load data from rocksdb and end up in this error, 2019-08-21 11:22:45.200188 7f1a0406f7c0 4 rocksdb: Recovered from manifest file:/var/lib/ceph/mon/ceph-cn1/store.db/MANIFEST-000716 succeeded,manifest_file_number is 716, next_file_number is 718, last_sequence is 311614, log_number is 0,prev_log_number is 0,max_column_family is 0 2019-08-21 11:22:45.200198 7f1a0406f7c0 4 rocksdb: Column family [default] (ID 0), log number is 715 2019-08-21 11:22:45.200247 7f1a0406f7c0 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1566386565200240, "job": 1, "event": "recovery_started", "log_files": [717]} 2019-08-21 11:22:45.200252 7f1a0406f7c0 4 rocksdb: Recovering log #717 mode 2 2019-08-21 11:22:45.200282 7f1a0406f7c0 4 rocksdb: Creating manifest 719 2019-08-21 11:22:45.201222 7f1a0406f7c0 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1566386565201218, "job": 1, "event": "recovery_finished"} 2019-08-21 11:22:45.202582 7f1a0406f7c0 4 rocksdb: DB pointer 0x55d4dacf 2019-08-21 11:22:45.202726 7f1a0406f7c0 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={9=luminous ondisk layout} 2019-08-21 11:22:45.202735 7f1a0406f7c0 -1 error checking features: (1) Operation not permitted We changed the kv_backend file inside /var/lib/ceph/mon/ceph-cn1 to leveldb and ceph-mon failed with following error, 2019-08-21 11:24:07.922978 7fc5a25de7c0 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb 2019-08-21 11:24:07.922983 7fc5a25de7c0 0 set uid:gid to 167:167 (ceph:ceph) 2019-08-21 11:24:07.923009 7fc5a25de7c0 0 ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7), process ceph-mon, pid 3509050 2019-08-21 11:24:07.923050 7fc5a25de7c0 0 pidfile_write: ignore empty --pid-file 2019-08-21 11:24:07.944867 7fc5a25de7c0 -1 WARNING: the following dangerous and experimental features are enabled: bluestore,rocksdb 2019-08-21 11:24:07.950304 7fc5a25de7c0 0 load: jerasure load: lrc load: isa 2019-08-21 11:24:07.950563 7fc5a25de7c0 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-cn1': (22) Invalid argument Is there any possibility to toggle ceph-mon db between leveldb and rocksdb? Tried to add mon_keyvaluedb = leveldb and filestore_omap_backend = leveldb in ceph.conf also not worked. thanks, Muthu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs-snapshots causing mds failover, hangs
hi zheng, On 8/21/19 4:32 AM, Yan, Zheng wrote: > Please enable debug mds (debug_mds=10), and try reproducing it again. we will get back with the logs on monday. thank you & with kind regards, t. signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fixing a bad PG per OSD decision with pg-autoscaling?
HI Nigel, In Nautilus you can decrease PG , but it take weeks , for example for us to go from 4096 to 2048 took more than 2 weeks. First at all pg-autoscaling is activable by pool. And you’re going to get a lot of warning , but it works. Normally is recommended upgrade a cluster with HEALTH_OK state. Also is recommended to use the unmap method the get the perfect distribution at balancer module, but it don’t work with misplaced/degraded error states. >From my poin of view I will try go Healthy , them upgrade. Remember that you MUST repair all your SSD pre-nautilus due statistics scheme changed. Regards Manuel De: ceph-users En nombre de Nigel Williams Enviado el: miércoles, 21 de agosto de 2019 0:33 Para: ceph-users Asunto: [ceph-users] fixing a bad PG per OSD decision with pg-autoscaling? Due to a gross miscalculation several years ago I set way too many PGs for our original Hammer cluster. We've lived with it ever since, but now we are on Luminous, changes result in stuck-requests and balancing problems. The cluster currently has 12% misplaced, and is grinding to re-balance but is unusable to clients (even with osd_max_pg_per_osd_hard_ratio set to 32, and mon_max_pg_per_osd set to 1000). Can I safely press on upgrading to Nautilus in this state so I can enable the pg-autoscaling to finally fix the problem? thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com