Re: [ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space

2019-08-21 Thread Reed Dier
Just chiming in to say that I too had some issues with backfill_toofull PGs, 
despite no OSD's being in a backfill_full state, albeit, there were some 
nearfull OSDs.

I was able to get through it by reweighting down the OSD that was the target 
reported by ceph pg dump | grep 'backfill_toofull'.

This was on 14.2.2.

Reed

> On Aug 21, 2019, at 2:50 PM, Vladimir Brik  
> wrote:
> 
> Hello
> 
> After increasing number of PGs in a pool, ceph status is reporting "Degraded 
> data redundancy (low space): 1 pg backfill_toofull", but I don't understand 
> why, because all OSDs seem to have enough space.
> 
> ceph health detail says:
> pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85]
> 
> $ ceph pg map 40.155
> osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85]
> 
> So I guess Ceph wants to move 40.155 from 66 to 79 (or other way around?). 
> According to "osd df", OSD 66's utilization is 71.90%, OSD 79's utilization 
> is 58.45%. The OSD with least free space in the cluster is 81.23% full, and 
> it's not any of the ones above.
> 
> OSD backfillfull_ratio is 90% (is there a better way to determine this?):
> $ ceph osd dump | grep ratio
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.7
> 
> Does anybody know why a PG could be in the backfill_toofull state if no OSD 
> is in the backfillfull state?
> 
> 
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mutliple CephFS Filesystems Nautilus (14.2.2)

2019-08-21 Thread Patrick Donnelly
On Wed, Aug 21, 2019 at 2:02 PM  wrote:
> How experimental is the multiple CephFS filesystems per cluster feature?  We 
> plan to use different sets of pools (meta / data) per filesystem.
>
> Are there any known issues?

No. It will likely work fine but some things may change in a future
version that makes upgrading more difficult.

> While we're on the subject, is it possible to assign a different active MDS 
> to each filesystem?

The monitors do the assignment. You cannot specify which file system
an MDS servers.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mutliple CephFS Filesystems Nautilus (14.2.2)

2019-08-21 Thread DHilsbos
All;

How experimental is the multiple CephFS filesystems per cluster feature?  We 
plan to use different sets of pools (meta / data) per filesystem.

Are there any known issues?

While we're on the subject, is it possible to assign a different active MDS to 
each filesystem?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Vladimir Brik

> Are you running multisite?
No

> Do you have dynamic bucket resharding turned on?
Yes. "radosgw-admin reshard list" prints "[]"

> Are you using lifecycle?
I am not sure. How can I check? "radosgw-admin lc list" says "[]"

> And just to be clear -- sometimes all 3 of your rados gateways are
> simultaneously in this state?
Multiple, but I have not seen all 3 being in this state simultaneously. 
Currently one gateway has 1 thread using 100% of CPU, and another has 5 
threads each using 100% CPU.


Here are the fruits of my attempts to capture the call graph using perf 
and gdbpmp:

https://icecube.wisc.edu/~vbrik/perf.data
https://icecube.wisc.edu/~vbrik/gdbpmp.data

These are the commands that I ran and their outputs (note I couldn't get 
perf not to generate the warning):

rgw-3 gdbpmp # ./gdbpmp.py -n 100 -p 73688 -o gdbpmp.data
Attaching to process 73688...Done.
Gathering 
Samples

Profiling complete with 100 samples.

rgw-3 ~ # perf record --call-graph fp -p 73688 -- sleep 10
[ perf record: Woken up 54 times to write data ]
Warning:
Processed 574207 events and lost 4 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 58.866 MB perf.data (233750 samples) ]





Vlad



On 8/21/19 11:16 AM, J. Eric Ivancich wrote:

On 8/21/19 10:22 AM, Mark Nelson wrote:

Hi Vladimir,


On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello



[much elided]


You might want to try grabbing a a callgraph from perf instead of just
running perf top or using my wallclock profiler to see if you can drill
down and find out where in that method it's spending the most time.


I agree with Mark -- a call graph would be very helpful in tracking down
what's happening.

There are background tasks that run. Are you running multisite? Do you
have dynamic bucket resharding turned on? Are you using lifecycle? And
garbage collection is another background task.

And just to be clear -- sometimes all 3 of your rados gateways are
simultaneously in this state?

But the call graph would be incredibly helpful.

Thank you,

Eric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space

2019-08-21 Thread Vladimir Brik

Hello

After increasing number of PGs in a pool, ceph status is reporting 
"Degraded data redundancy (low space): 1 pg backfill_toofull", but I 
don't understand why, because all OSDs seem to have enough space.


ceph health detail says:
pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85]

$ ceph pg map 40.155
osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85]

So I guess Ceph wants to move 40.155 from 66 to 79 (or other way 
around?). According to "osd df", OSD 66's utilization is 71.90%, OSD 
79's utilization is 58.45%. The OSD with least free space in the cluster 
is 81.23% full, and it's not any of the ones above.


OSD backfillfull_ratio is 90% (is there a better way to determine this?):
$ ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.7

Does anybody know why a PG could be in the backfill_toofull state if no 
OSD is in the backfillfull state?



Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread J. Eric Ivancich
On 8/21/19 10:22 AM, Mark Nelson wrote:
> Hi Vladimir,
> 
> 
> On 8/21/19 8:54 AM, Vladimir Brik wrote:
>> Hello
>>

[much elided]

> You might want to try grabbing a a callgraph from perf instead of just
> running perf top or using my wallclock profiler to see if you can drill
> down and find out where in that method it's spending the most time.

I agree with Mark -- a call graph would be very helpful in tracking down
what's happening.

There are background tasks that run. Are you running multisite? Do you
have dynamic bucket resharding turned on? Are you using lifecycle? And
garbage collection is another background task.

And just to be clear -- sometimes all 3 of your rados gateways are
simultaneously in this state?

But the call graph would be incredibly helpful.

Thank you,

Eric

-- 
J. Eric Ivancich
he/him/his
Red Hat Storage
Ann Arbor, Michigan, USA
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Vladimir Brik
Correction: the number of threads stuck using 100% of a CPU core varies 
from 1 to 5 (it's not always 5)


Vlad

On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
radosgw process on those machines starts consuming 100% of 5 CPU cores 
for days at a time, even though the machine is not being used for data 
transfers (nothing in radosgw logs, couple of KB/s of network).


This situation can affect any number of our rados gateways, lasts from 
few hours to few days and stops if radosgw process is restarted or on 
its own.


Does anybody have an idea what might be going on or how to debug it? I 
don't see anything obvious in the logs. Perf top is saying that CPU is 
consumed by radosgw shared object in symbol get_obj_data::flush, which, 
if I interpret things correctly, is called from a symbol with a long 
name that contains the substring "boost9intrusive9list_impl"


This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s 
ssl_certificate=/etc/ceph/rgw.crt 
error_log_file=/var/log/ceph/civetweb.error.log


(error log file doesn't exist)


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Paul Emmerich
On Wed, Aug 21, 2019 at 3:55 PM Vladimir Brik
 wrote:
>
> Hello
>
> I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically,
> radosgw process on those machines starts consuming 100% of 5 CPU cores
> for days at a time, even though the machine is not being used for data
> transfers (nothing in radosgw logs, couple of KB/s of network).
>
> This situation can affect any number of our rados gateways, lasts from
> few hours to few days and stops if radosgw process is restarted or on
> its own.
>
> Does anybody have an idea what might be going on or how to debug it? I
> don't see anything obvious in the logs. Perf top is saying that CPU is
> consumed by radosgw shared object in symbol get_obj_data::flush, which,
> if I interpret things correctly, is called from a symbol with a long
> name that contains the substring "boost9intrusive9list_impl"
>
> This is our configuration:
> rgw_frontends = civetweb num_threads=5000 port=443s
> ssl_certificate=/etc/ceph/rgw.crt
> error_log_file=/var/log/ceph/civetweb.error.log

Probably unrelated to your problem, but running with lots of threads
is usually an indicator that the async beast frontend would be a
better fit for your setup.
(But the code you see in perf should not be related to the frontend)


Paul

>
> (error log file doesn't exist)
>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Mark Nelson

Hi Vladimir,


On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. 
Periodically, radosgw process on those machines starts consuming 100% 
of 5 CPU cores for days at a time, even though the machine is not 
being used for data transfers (nothing in radosgw logs, couple of KB/s 
of network).


This situation can affect any number of our rados gateways, lasts from 
few hours to few days and stops if radosgw process is restarted or on 
its own.


Does anybody have an idea what might be going on or how to debug it? I 
don't see anything obvious in the logs. Perf top is saying that CPU is 
consumed by radosgw shared object in symbol get_obj_data::flush, 
which, if I interpret things correctly, is called from a symbol with a 
long name that contains the substring "boost9intrusive9list_impl"



I don't normally look at the RGW code so maybe Matt/Casey/Eric can chime 
in.  That code is in src/rgw/rgw_rados.cc in the get_obj_data struct.  
The flush method does some sorting/merging and then walks through a 
listed of completed IOs and appears to copy a bufferlist out of each 
one, then deletes it from the list and passes the BL off to 
client_cb->handle_data.  Looks like it could be pretty CPU intensive but 
if you are seeing that much CPU for that long it sounds like something 
is rather off.



You might want to try grabbing a a callgraph from perf instead of just 
running perf top or using my wallclock profiler to see if you can drill 
down and find out where in that method it's spending the most time.



My wallclock profiler is here:


https://github.com/markhpc/gdbpmp


Mark




This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s 
ssl_certificate=/etc/ceph/rgw.crt 
error_log_file=/var/log/ceph/civetweb.error.log


(error log file doesn't exist)


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Vladimir Brik

Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
radosgw process on those machines starts consuming 100% of 5 CPU cores 
for days at a time, even though the machine is not being used for data 
transfers (nothing in radosgw logs, couple of KB/s of network).


This situation can affect any number of our rados gateways, lasts from 
few hours to few days and stops if radosgw process is restarted or on 
its own.


Does anybody have an idea what might be going on or how to debug it? I 
don't see anything obvious in the logs. Perf top is saying that CPU is 
consumed by radosgw shared object in symbol get_obj_data::flush, which, 
if I interpret things correctly, is called from a symbol with a long 
name that contains the substring "boost9intrusive9list_impl"


This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s 
ssl_certificate=/etc/ceph/rgw.crt 
error_log_file=/var/log/ceph/civetweb.error.log


(error log file doesn't exist)


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Applications slow in VMs running RBD disks

2019-08-21 Thread EDH - Manuel Rios Fernandez
Use 100% Flash setup avoid rotational disk for get some performance in HDD with 
windows.

 

Windows is very sensitive to disk latency and interface with latency provides a 
bad sense for customer some times.

 

You can check in your Graphana for ceph your avg read/write when in windows go 
up 50ms or more the performance is a painful

 

Regards

 

Manuel

 

 

De: ceph-users  En nombre de Gesiel Galvão 
Bernardes
Enviado el: miércoles, 21 de agosto de 2019 14:26
Para: ceph-users 
Asunto: [ceph-users] Applications slow in VMs running RBD disks

 

Hi,

I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I having 
problems with slowness in aplications that many times not consuming very CPU or 
RAM. This problem affect mostly Windows. Appearly the problem is that normally 
the application load many short files (ex: DLLs)  and these files take a long 
time to load, generating a slowness.

I using 8Tb disks, with 3x replica (I've tried with erasure and 2x too), and 
tried with SSD cache and without SSD cache and the problem persists. Using the 
same disks with NFS, the applications run fine.

I've already tried change RBD object size (from 4mb to 128k), use qemu 
writeback cache, configure virtio iscsi queues, use virtio (virtio-blk) driver, 
and and none of these brought effective improvement for this problem.

Anyone already had similar problem and / or have any idea how to solve this? Or 
have a idea that where I should look to resolve this?

Thanks Advance,

Gesiel

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Applications slow in VMs running RBD disks

2019-08-21 Thread Gesiel Galvão Bernardes
Hi Eliza,

Em qua, 21 de ago de 2019 às 09:30, Eliza  escreveu:

> Hi
>
> on 2019/8/21 20:25, Gesiel Galvão Bernardes wrote:
> > I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I
> > having problems with slowness in aplications that many times not
> > consuming very CPU or RAM. This problem affect mostly Windows. Appearly
> > the problem is that normally the application load many short files (ex:
> > DLLs)  and these files take a long time to load, generating a slowness.
>
> Did you check/test your network connection?
> Do you have a fast network setup?


 I have a bond of two 10GB interfaces, with little use.

>
>
regards.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Applications slow in VMs running RBD disks

2019-08-21 Thread Eliza

Hi

on 2019/8/21 20:25, Gesiel Galvão Bernardes wrote:
I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I 
having problems with slowness in aplications that many times not 
consuming very CPU or RAM. This problem affect mostly Windows. Appearly 
the problem is that normally the application load many short files (ex: 
DLLs)  and these files take a long time to load, generating a slowness.


Did you check/test your network connection?
Do you have a fast network setup?

regards.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Applications slow in VMs running RBD disks

2019-08-21 Thread Gesiel Galvão Bernardes
Hi,

I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I having
problems with slowness in aplications that many times not consuming very
CPU or RAM. This problem affect mostly Windows. Appearly the problem is
that normally the application load many short files (ex: DLLs)  and these
files take a long time to load, generating a slowness.

I using 8Tb disks, with 3x replica (I've tried with erasure and 2x too),
and tried with SSD cache and without SSD cache and the problem persists.
Using the same disks with NFS, the applications run fine.

I've already tried change RBD object size (from 4mb to 128k), use qemu
writeback cache, configure virtio iscsi queues, use virtio (virtio-blk)
driver, and and none of these brought effective improvement for this
problem.

Anyone already had similar problem and / or have any idea how to solve
this? Or have a idea that where I should look to resolve this?

Thanks Advance,

Gesiel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon db change from rocksdb to leveldb

2019-08-21 Thread Paul Emmerich
You can't downgrade from Luminous to Kraken well officially at least.

I guess it maybe could somehow work but you'd need to re-create all
the services. For the mon example: delete a mon, create a new old one,
let it sync, etc.
Still a bad idea.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Aug 21, 2019 at 1:37 PM nokia ceph  wrote:
>
> Hi Team,
>
> One of our old customer had Kraken and they are going to upgrade to Luminous 
> . In the process they also requesting for downgrade procedure.
> Kraken used leveldb for ceph-mon data , from luminous it changed to rocksdb , 
> upgrade works without any issues.
>
> When we downgrade , the ceph-mon does not start and the mon kv_backend not 
> moving away from rocksdb .
>
> After downgrade , when kv_backend is rocksdb following error thrown by 
> ceph-mon , trying to load data from rocksdb and end up in this error,
>
> 2019-08-21 11:22:45.200188 7f1a0406f7c0  4 rocksdb: Recovered from manifest 
> file:/var/lib/ceph/mon/ceph-cn1/store.db/MANIFEST-000716 
> succeeded,manifest_file_number is 716, next_file_number is 718, last_sequence 
> is 311614, log_number is 0,prev_log_number is 0,max_column_family is 0
>
> 2019-08-21 11:22:45.200198 7f1a0406f7c0  4 rocksdb: Column family [default] 
> (ID 0), log number is 715
>
> 2019-08-21 11:22:45.200247 7f1a0406f7c0  4 rocksdb: EVENT_LOG_v1 
> {"time_micros": 1566386565200240, "job": 1, "event": "recovery_started", 
> "log_files": [717]}
> 2019-08-21 11:22:45.200252 7f1a0406f7c0  4 rocksdb: Recovering log #717 mode 2
> 2019-08-21 11:22:45.200282 7f1a0406f7c0  4 rocksdb: Creating manifest 719
>
> 2019-08-21 11:22:45.201222 7f1a0406f7c0  4 rocksdb: EVENT_LOG_v1 
> {"time_micros": 1566386565201218, "job": 1, "event": "recovery_finished"}
> 2019-08-21 11:22:45.202582 7f1a0406f7c0  4 rocksdb: DB pointer 0x55d4dacf
> 2019-08-21 11:22:45.202726 7f1a0406f7c0 -1 ERROR: on disk data includes 
> unsupported features: compat={},rocompat={},incompat={9=luminous ondisk 
> layout}
> 2019-08-21 11:22:45.202735 7f1a0406f7c0 -1 error checking features: (1) 
> Operation not permitted
>
> We changed the kv_backend file inside /var/lib/ceph/mon/ceph-cn1 to leveldb 
> and ceph-mon failed with following error,
>
> 2019-08-21 11:24:07.922978 7fc5a25de7c0 -1 WARNING: the following dangerous 
> and experimental features are enabled: bluestore,rocksdb
> 2019-08-21 11:24:07.922983 7fc5a25de7c0  0 set uid:gid to 167:167 (ceph:ceph)
> 2019-08-21 11:24:07.923009 7fc5a25de7c0  0 ceph version 11.2.0 
> (f223e27eeb35991352ebc1f67423d4ebc252adb7), process ceph-mon, pid 3509050
> 2019-08-21 11:24:07.923050 7fc5a25de7c0  0 pidfile_write: ignore empty 
> --pid-file
> 2019-08-21 11:24:07.944867 7fc5a25de7c0 -1 WARNING: the following dangerous 
> and experimental features are enabled: bluestore,rocksdb
> 2019-08-21 11:24:07.950304 7fc5a25de7c0  0 load: jerasure load: lrc load: isa
> 2019-08-21 11:24:07.950563 7fc5a25de7c0 -1 error opening mon data directory 
> at '/var/lib/ceph/mon/ceph-cn1': (22) Invalid argument
>
> Is there any possibility to toggle ceph-mon db between leveldb and rocksdb?
> Tried to add mon_keyvaluedb = leveldb and filestore_omap_backend = leveldb in 
> ceph.conf also not worked.
> thanks,
> Muthu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mon db change from rocksdb to leveldb

2019-08-21 Thread nokia ceph
Hi Team,

One of our old customer had Kraken and they are going to upgrade to
Luminous . In the process they also requesting for downgrade procedure.
Kraken used leveldb for ceph-mon data , from luminous it changed to rocksdb
, upgrade works without any issues.

When we downgrade , the ceph-mon does not start and the mon kv_backend not
moving away from rocksdb .

After downgrade , when kv_backend is rocksdb following error thrown by
ceph-mon , trying to load data from rocksdb and end up in this error,

2019-08-21 11:22:45.200188 7f1a0406f7c0  4 rocksdb: Recovered from manifest
file:/var/lib/ceph/mon/ceph-cn1/store.db/MANIFEST-000716
succeeded,manifest_file_number is 716, next_file_number is 718,
last_sequence is 311614, log_number is 0,prev_log_number is
0,max_column_family is 0

2019-08-21 11:22:45.200198 7f1a0406f7c0  4 rocksdb: Column family [default]
(ID 0), log number is 715

2019-08-21 11:22:45.200247 7f1a0406f7c0  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1566386565200240, "job": 1, "event": "recovery_started",
"log_files": [717]}
2019-08-21 11:22:45.200252 7f1a0406f7c0  4 rocksdb: Recovering log #717
mode 2
2019-08-21 11:22:45.200282 7f1a0406f7c0  4 rocksdb: Creating manifest 719

2019-08-21 11:22:45.201222 7f1a0406f7c0  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1566386565201218, "job": 1, "event": "recovery_finished"}
2019-08-21 11:22:45.202582 7f1a0406f7c0  4 rocksdb: DB pointer
0x55d4dacf
2019-08-21 11:22:45.202726 7f1a0406f7c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={9=luminous ondisk
layout}
2019-08-21 11:22:45.202735 7f1a0406f7c0 -1 error checking features: (1)
Operation not permitted

We changed the kv_backend file inside /var/lib/ceph/mon/ceph-cn1 to leveldb
and ceph-mon failed with following error,

2019-08-21 11:24:07.922978 7fc5a25de7c0 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
2019-08-21 11:24:07.922983 7fc5a25de7c0  0 set uid:gid to 167:167
(ceph:ceph)
2019-08-21 11:24:07.923009 7fc5a25de7c0  0 ceph version 11.2.0
(f223e27eeb35991352ebc1f67423d4ebc252adb7), process ceph-mon, pid 3509050
2019-08-21 11:24:07.923050 7fc5a25de7c0  0 pidfile_write: ignore empty
--pid-file
2019-08-21 11:24:07.944867 7fc5a25de7c0 -1 WARNING: the following dangerous
and experimental features are enabled: bluestore,rocksdb
2019-08-21 11:24:07.950304 7fc5a25de7c0  0 load: jerasure load: lrc load:
isa
2019-08-21 11:24:07.950563 7fc5a25de7c0 -1 error opening mon data directory
at '/var/lib/ceph/mon/ceph-cn1': (22) Invalid argument

Is there any possibility to toggle ceph-mon db between leveldb and rocksdb?
Tried to add mon_keyvaluedb = leveldb and filestore_omap_backend = leveldb
in ceph.conf also not worked.
thanks,
Muthu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-snapshots causing mds failover, hangs

2019-08-21 Thread thoralf schulze
hi zheng,

On 8/21/19 4:32 AM, Yan, Zheng wrote:
> Please enable debug mds (debug_mds=10), and try reproducing it again.

we will get back with the logs on monday.

thank you & with kind regards,
t.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing a bad PG per OSD decision with pg-autoscaling?

2019-08-21 Thread EDH - Manuel Rios Fernandez
HI Nigel,

 

In Nautilus you can decrease PG , but it take weeks , for example for us to go 
from 4096 to 2048 took more than 2 weeks.

 

First at all pg-autoscaling is activable by pool. And you’re going to get a lot 
of warning , but it works.

 

Normally is recommended upgrade a cluster with HEALTH_OK state.

 

Also is recommended to use the unmap method the get the perfect distribution at 
balancer module, but it don’t work with misplaced/degraded error states.

 

>From my poin of view I will try go Healthy , them upgrade.

 

Remember that you MUST repair all your SSD pre-nautilus due statistics scheme 
changed. 

 

Regards

 

Manuel

 

 

De: ceph-users  En nombre de Nigel Williams
Enviado el: miércoles, 21 de agosto de 2019 0:33
Para: ceph-users 
Asunto: [ceph-users] fixing a bad PG per OSD decision with pg-autoscaling?

 

Due to a gross miscalculation several years ago I set way too many PGs for our 
original Hammer cluster. We've lived with it ever since, but now we are on 
Luminous, changes result in stuck-requests and balancing problems. 

 

The cluster currently has 12% misplaced, and is grinding to re-balance but is 
unusable to clients (even with osd_max_pg_per_osd_hard_ratio set to 32, and 
mon_max_pg_per_osd set to 1000).

 

Can I safely press on upgrading to Nautilus in this state so I can enable the 
pg-autoscaling to finally fix the problem?

 

thanks.

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com