[ceph-users] pg_autoscaler is not working

2019-11-26 Thread Thomas Schneider
Hi,

I enabled pg_autoscaler on a specific pool ssd.
I failed to increase pg_num / pgp_num on pools ssd to 1024:
root@ld3955:~# ceph osd pool autoscale-status
 POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO 
TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
 cephfs_metadata  395.8M    3.0    118.9T 
0. 4.0   8  off
 hdb_backup   713.2T    3.0 1354T 
1.5793 1.0   16384  off
 nvme 0 2.0    23840G 
0. 1.0 128  off
 cephfs_data   1068G    3.0    118.9T 
0.0263 1.0  32  off
 hdd  733.9G    3.0    118.9T 
0.0181 1.0    2048  off
 ssd   1711G    2.0    27771G 
0.1233 1.0    1024  on

The target size for this pool is correctly set to 1024:
root@ld3955:~# ceph osd pool ls detail
pool 11 'hdb_backup' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 16384 pgp_num 10977 pgp_num_target 16384
last_change 344888 lfor 0/0/319352 flags hashpspool,selfmanaged_snaps
stripe_width 0 pg_num_min 8192 application rbd
    removed_snaps [1~3]
pool 59 'hdd' replicated size 3 min_size 2 crush_rule 3 object_hash
rjenkins pg_num 2048 pgp_num 2048 last_change 319283 lfor
307105/317145/317153 flags hashpspool,selfmanaged_snaps stripe_width 0
pg_num_min 1024 application rbd
    removed_snaps [1~3]
pool 60 'ssd' replicated size 2 min_size 2 crush_rule 4 object_hash
rjenkins pg_num 512 pgp_num 512 pg_num_target 1024 pgp_num_target 1024
autoscale_mode on last_change 341736 lfor 305915/305915/305915 flags
hashpspool,selfmanaged_snaps,creating stripe_width 0 pg_num_min 512
application rbd
    removed_snaps [1~3]
pool 62 'cephfs_data' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 32 pgp_num 32 last_change 319282 lfor
300310/300310/300310 flags hashpspool stripe_width 0 pg_num_min 32
application cephfs
pool 63 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 8 pgp_num 8 last_change 319280 flags
hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8
recovery_priority 5 application cephfs
pool 65 'nvme' replicated size 2 min_size 2 crush_rule 2 object_hash
rjenkins pg_num 128 pgp_num 128 last_change 319281 flags hashpspool
stripe_width 0 pg_num_min 128 application rbd

However there's no activity on the cluster regarding pool ssd, means the
pg_num is not increasing.
The cluster is working for another pool hdb_backup though; the pg_num of
this pool was modified to 16384 recently (to be precise on Monday).

What makes things worse is that now I cannot increase pg_num (or
pgp_num) manually.
root@ld3955:~# ceph osd pool get ssd pg_num
pg_num: 512
root@ld3955:~# ceph osd pool get ssd pgp_num
pgp_num: 512
root@ld3955:~# ceph osd pool set ssd pg_num 1024
root@ld3955:~# ceph osd pool get ssd pg_num
pg_num: 512

How can I increase pg_num / pgp_num of pool ssd?

THX

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cannot increate pg_num / pgp_num on a pool

2019-11-24 Thread Thomas

Hi,
I failed to increase pg_num / pgp_num on pools ssd to 1024:

root@ld3976:~# ceph osd pool get ssd pg_num
pg_num: 512
root@ld3976:~# ceph osd pool get ssd pgp_num
pgp_num: 512
root@ld3976:~# ceph osd pool set ssd pg_num 1024
root@ld3976:~# ceph osd pool get ssd pg_num
pg_num: 512

When I check autoscale-status is shows PG_NUM=1024 for pool ssd.
root@ld3976:~# ceph osd pool autoscale-status
 POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO 
TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
 cephfs_metadata  395.8M    3.0    118.9T 
0. 4.0   8  off
 hdb_backup   725.4T    3.0 1354T 
1.6064 1.0    8192  off
 nvme 0 2.0    23840G 
0. 1.0 128  off
 cephfs_data   1068G    3.0    118.9T 
0.0263 1.0  32  off
 hdd  733.9G    3.0    118.9T 
0.0181 1.0    2048  off
 ssd   1711G    2.0    27771G 
0.1233 1.0    1024  off


I disabled pg-autoscale just today due to issues with MGR service (see 
here ).


How can I increase pg_num / pgp_num on pools ssd?

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cannot increate pg_num / pgp_num on a pool

2019-11-24 Thread Thomas

Hi,
I failed to increase pg_num / pgp_num on pools ssd to 1024:

root@ld3976:~# ceph osd pool get ssd pg_num
pg_num: 512
root@ld3976:~# ceph osd pool get ssd pgp_num
pgp_num: 512
root@ld3976:~# ceph osd pool set ssd pg_num 1024
root@ld3976:~# ceph osd pool get ssd pg_num
pg_num: 512

When I check autoscale-status is shows PG_NUM=1024 for pool ssd.
root@ld3976:~# ceph osd pool autoscale-status
 POOL   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO 
TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
 cephfs_metadata  395.8M    3.0    118.9T 
0. 4.0   8  off
 hdb_backup   725.4T    3.0 1354T 
1.6064 1.0    8192  off
 nvme 0 2.0    23840G 
0. 1.0 128  off
 cephfs_data   1068G    3.0    118.9T 
0.0263 1.0  32  off
 hdd  733.9G    3.0    118.9T 
0.0181 1.0    2048  off
 ssd   1711G    2.0    27771G 
0.1233 1.0    1024  off


I disabled pg-autoscale just today due to issues with MGR service (see 
here ).


How can I increase pg_num / pgp_num on pools ssd?

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Command ceph osd df hangs

2019-11-21 Thread Thomas Schneider
Hi,

issue solved!

I have stopped active MGR service and waited until standby MGR became
active.
Then I started the (previously stopped) MGR service in order to have 2
standby.

Thanks Eugen

Am 21.11.2019 um 15:23 schrieb Eugen Block:
> Hi,
>
> check if the active MGR is hanging.
> I had this when testing pg_autoscaler, after some time every command
> would hang. Restarting the MGR helped for a short period of time, then
> I disabled pg_autoscaler. This is an upgraded cluster, currently on
> Nautilus.
>
> Regards,
> Eugen
>
>
> Zitat von Thomas Schneider <74cmo...@gmail.com>:
>
>> Hi,
>> command ceph osd df does not return any output.
>> Based on the strace output there's a timeout.
>> [...]
>> mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
>> 0) = 0x7f53006b9000
>> brk(0x55c2579b6000) = 0x55c2579b6000
>> brk(0x55c2579d7000) = 0x55c2579d7000
>> brk(0x55c2579f9000) = 0x55c2579f9000
>> brk(0x55c257a1a000) = 0x55c257a1a000
>> mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
>> 0) = 0x7f5300679000
>> brk(0x55c257a3b000) = 0x55c257a3b000
>> brk(0x55c257a5c000) = 0x55c257a5c000
>> brk(0x55c257a7d000) = 0x55c257a7d000
>> clone(child_stack=0x7f53095c1fb0,
>> flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
>>
>> parent_tidptr=0x7f53095c29d0, tls=0x7f53095c2700,
>> child_tidptr=0x7f53095c29d0) = 3261669
>> futex(0x55c257489940, FUTEX_WAKE_PRIVATE, 1) = 1
>> futex(0x55c2576246e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
>> NULL, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily
>> unavailable)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=2000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=4000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=8000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=16000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=32000}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
>> select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}^Cstrace: Process
>> 3261645 detached
>>  
>> Interrupted
>> Traceback (most recent call last):
>>   File "/usr/bin/ceph", line 1263, in 
>>     retval = main()
>>   File "/usr/bin/ceph", line 1194, in main
>>
>>     verbose)
>>   File "/usr/bin/ceph", line 619, in new_style_command
>>     ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
>> sigdict, inbuf, verbose)
>>   File "/usr/bin/ceph", line 593, in do_command
>>     return ret, '', ''
>> UnboundLocalError: local variable 'ret' referenced before assignment
>>
>>
>> How can I fix this?
>> Do you need the full strace output to analyse this issue?
>>
>> This Ceph health status is reported since hours and I cannot identify
>> any progress. Not sure if this is related to the issue with ceph osd df,
>> though.
>>
>> 2019-11-21 15:00:00.000262 mon.ld5505 [ERR] overall HEALTH_ERR 1
>> filesystem is degraded; 1 filesystem has a failed mds daemon; 1
>> filesystem is offline; insufficient standby MDS daemons available;
>> nodown,noout,noscrub,nodeep-scrub flag(s) set; 81 osds down; Reduced
>> data availability: 1366 pgs inactive, 241 pgs peering; Degraded data
>> redundancy: 6437/190964568 objects degraded (0.003%), 7 pgs degraded, 7
>> pgs undersized; 1 subtrees have overcommitted pool target_size_bytes; 1
>> subtrees have overcommitted pool target_size_ratio
>>
>> THX
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Command ceph osd df hangs

2019-11-21 Thread Thomas Schneider
Hi,
command ceph osd df does not return any output.
Based on the strace output there's a timeout.
[...]
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f53006b9000
brk(0x55c2579b6000) = 0x55c2579b6000
brk(0x55c2579d7000) = 0x55c2579d7000
brk(0x55c2579f9000) = 0x55c2579f9000
brk(0x55c257a1a000) = 0x55c257a1a000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f5300679000
brk(0x55c257a3b000) = 0x55c257a3b000
brk(0x55c257a5c000) = 0x55c257a5c000
brk(0x55c257a7d000) = 0x55c257a7d000
clone(child_stack=0x7f53095c1fb0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7f53095c29d0, tls=0x7f53095c2700,
child_tidptr=0x7f53095c29d0) = 3261669
futex(0x55c257489940, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c2576246e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
NULL, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=4000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=8000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=5}^Cstrace: Process
3261645 detached
 
Interrupted
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1263, in 
    retval = main()
  File "/usr/bin/ceph", line 1194, in main

    verbose)
  File "/usr/bin/ceph", line 619, in new_style_command
    ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
sigdict, inbuf, verbose)
  File "/usr/bin/ceph", line 593, in do_command
    return ret, '', ''
UnboundLocalError: local variable 'ret' referenced before assignment


How can I fix this?
Do you need the full strace output to analyse this issue?

This Ceph health status is reported since hours and I cannot identify
any progress. Not sure if this is related to the issue with ceph osd df,
though.

2019-11-21 15:00:00.000262 mon.ld5505 [ERR] overall HEALTH_ERR 1
filesystem is degraded; 1 filesystem has a failed mds daemon; 1
filesystem is offline; insufficient standby MDS daemons available;
nodown,noout,noscrub,nodeep-scrub flag(s) set; 81 osds down; Reduced
data availability: 1366 pgs inactive, 241 pgs peering; Degraded data
redundancy: 6437/190964568 objects degraded (0.003%), 7 pgs degraded, 7
pgs undersized; 1 subtrees have overcommitted pool target_size_bytes; 1
subtrees have overcommitted pool target_size_ratio

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot enable pg_autoscale_mode

2019-11-21 Thread Thomas Schneider
Update:
Issue is solved.

The output of "ceph osd dump" showed that the required setting was
incorrect, means
require_osd_release luminous

After executing
ceph osd require-osd-release nautilus
I can enable pg_autoscale_mode on any pool.

THX

Am 21.11.2019 um 13:51 schrieb Paul Emmerich:
> "ceph osd dump" shows you if the flag is set
>
>
> Paul
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot enable pg_autoscale_mode

2019-11-21 Thread Thomas Schneider
Looks like the flag is not correct.

root@ld3955:~# ceph osd dump | grep nautilus
root@ld3955:~# ceph osd dump | grep require
require_min_compat_client luminous
require_osd_release luminous


Am 21.11.2019 um 13:51 schrieb Paul Emmerich:
> "ceph osd dump" shows you if the flag is set
>
>
> Paul
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot enable pg_autoscale_mode

2019-11-21 Thread Thomas Schneider
Hello Paul,

I didn't skip this step.

Actually I'm sure that everything on Cluster is on Nautilus because I
had issues with SLES 12SP2 Clients that failed to connect due to
outdated client tools that could not connect to Nautilus.

Would it make sense to execute
ceph osd require-osd-release nautilus
again?

THX

Am 21.11.2019 um 12:17 schrieb Paul Emmerich:
> ceph osd require-osd-release nautilus

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cannot enable pg_autoscale_mode

2019-11-21 Thread Thomas Schneider
Hi,

I try to enable pg_autoscale_mode on a specific pool of my cluster,
however this returns an error.
root@ld3955:~# ceph osd pool set ssd pg_autoscale_mode on
Error EINVAL: must set require_osd_release to nautilus or later before
setting pg_autoscale_mode

The error message is clear, but my cluster is running Ceph 14.2.4;
please advise how to fix this.
root@ld3955:~# ceph --version
ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus
(stable)

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Error in MGR log: auth: could not find secret_id

2019-11-20 Thread Thomas Schneider
Hi,
my Ceph cluster is in unhealthy state and busy with recovery.

I'm observing the MGR log and this is showing this error message regularely:
2019-11-20 09:51:45.211 7f7205581700  0 auth: could not find secret_id=4193
2019-11-20 09:51:45.211 7f7205581700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=4193
2019-11-20 09:51:46.403 7f7205581700  0 auth: could not find secret_id=4193
2019-11-20 09:51:46.403 7f7205581700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=4193
2019-11-20 09:51:46.543 7f71f3826700  0 log_channel(cluster) log [DBG] :
pgmap v2508: 8432 pgs: 1 active+recovering+remapped, 1
active+remapped+backfilling, 4 active+recovering, 2
undersized+degraded+peered, 3 remapped+peering, 104 peering, 24
activating, 3 creating+peering, 8290 active+clean; 245 TiB data, 732 TiB
used, 791 TiB / 1.5 PiB avail; 67 KiB/s wr, 1 op/s; 8272/191737068
objects degraded (0.004%); 4392/191737068 objects misplaced (0.002%)
2019-11-20 09:51:46.603 7f7205d82700  0 auth: could not find secret_id=4193
2019-11-20 09:51:46.603 7f7205d82700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=4193
2019-11-20 09:51:46.947 7f7205d82700  0 auth: could not find secret_id=4193
2019-11-20 09:51:46.947 7f7205d82700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=4193
2019-11-20 09:51:47.015 7f7205d82700  0 auth: could not find secret_id=4193
2019-11-20 09:51:47.015 7f7205d82700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=4193
2019-11-20 09:51:47.815 7f7205d82700  0 auth: could not find secret_id=4193
2019-11-20 09:51:47.815 7f7205d82700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=4193
2019-11-20 09:51:48.567 7f71f3826700  0 log_channel(cluster) log [DBG] :
pgmap v2509: 8432 pgs: 1 active+recovering+remapped, 1
active+remapped+backfilling, 4 active+recovering, 2
undersized+degraded+peered, 3 remapped+peering, 104 peering, 24
activating, 3 creating+peering, 8290 active+clean; 245 TiB data, 732 TiB
used, 791 TiB / 1.5 PiB avail; 65 KiB/s wr, 0 op/s; 8272/191737068
objects degraded (0.004%); 4392/191737068 objects misplaced (0.002%)
2019-11-20 09:51:49.447 7f7204d80700  0 auth: could not find secret_id=4193
2019-11-20 09:51:49.447 7f7204d80700  0 cephx: verify_authorizer could
not get service secret for service mgr secret_id=4193

The relevant MON log entries for this timestamp are:
2019-11-20 09:51:41.559 7f4f28311700  0 mon.ld5505@0(leader) e9
handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2019-11-20 09:51:41.559 7f4f28311700  0 log_channel(audit) log [DBG] :
from='client.? 10.97.206.97:0/1141066028' entity='client.admin'
cmd=[{"prefix":"df","format":"json"}]: dispatch
2019-11-20 09:51:45.847 7f4f28311700  0 mon.ld5505@0(leader) e9
handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2019-11-20 09:51:45.847 7f4f28311700  0 log_channel(audit) log [DBG] :
from='client.? 10.97.206.91:0/1573121305' entity='client.admin'
cmd=[{"format":"json","prefix":"df"}]: dispatch
2019-11-20 09:51:46.307 7f4f2730f700  0 --1-
[v2:10.97.206.93:3300/0,v1:10.97.206.93:6789/0] >>  conn(0x56253e8f5180
0x56253ebc1800 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner
accept peer addr is really - (socket is v1:10.97.206.95:51494/0)
2019-11-20 09:51:46.839 7f4f28311700  0 mon.ld5505@0(leader) e9
handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2019-11-20 09:51:46.839 7f4f28311700  0 log_channel(audit) log [DBG] :
from='client.? 10.97.206.99:0/413315398' entity='client.admin'
cmd=[{"format":"json","prefix":"df"}]: dispatch
2019-11-20 09:51:49.579 7f4f28311700  0 mon.ld5505@0(leader) e9
handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2019-11-20 09:51:49.579 7f4f28311700  0 log_channel(audit) log [DBG] :
from='client.? 10.97.206.96:0/2753573650' entity='client.admin'
cmd=[{"prefix":"df","format":"json"}]: dispatch
2019-11-20 09:51:49.607 7f4f28311700  0 mon.ld5505@0(leader) e9
handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2019-11-20 09:51:49.607 7f4f28311700  0 log_channel(audit) log [DBG] :
from='client.? 10.97.206.98:0/2643276575' entity='client.admin'
cmd=[{"format":"json","prefix":"df"}]: dispatch
^C2019-11-20 09:51:50.703 7f4f2730f700  0 --1-
[v2:10.97.206.93:3300/0,v1:10.97.206.93:6789/0] >>  conn(0x562542ed2400
0x562541a8d000 :6789 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner
accept peer addr is really - (socket is v1:10.97.206.98:52420/0)
2019-11-20 09:51:50.951 7f4f28311700  0 mon.ld5505@0(leader) e9
handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2019-11-20 09:51:50.951 7f4f28311700  0 log_channel(audit) log [DBG] :
from='client.127514502 10.97.206.92:0/3526816880' entity='client.admin'
cmd=[{"format":"json","prefix":"df"}]: dispatch


This auth issue must be fixed soon, because if not the error occurs
every second and 

Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-24 Thread Thomas
Hi,

I'm experiencing the same issue with this setting in ceph.conf:
    osd op queue = wpq
    osd op queue cut off = high

Furthermore I cannot read any old data in the relevant pool that is
serving CephFS.
However, I can write new data and read this new data.

Regards
Thomas

Am 24.09.2019 um 10:24 schrieb Yoann Moulin:
> Hello,
>
>>> I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk 
>>> (no SSD) in 20 servers.
>>>
>>> I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow 
>>> requests, 0 included below; oldest blocked for > 60281.199503 secs"
>>>
>>> After a few investigations, I saw that ALL ceph-osd process eat a lot of 
>>> memory, up to 130GB RSS each. It this value normal? May this related to
>>> slow requests? Is disk only increasing the probability to get slow requests?
>> If you haven't set:
>>
>> osd op queue cut off = high
>>
>> in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should
>> help quite a bit with pure HDD clusters.
> OK I'll try this, thanks.
>
> If I want to add this my ceph-ansible playbook parameters, in which files I 
> should add it and what is the best way to do it ?
>
> Add those 3 lines in all.yml or osds.yml ?
>
> ceph_conf_overrides:
>   global:
> osd_op_queue_cut_off: high
>
> Is there another (better?) way to do that?
>
> Thanks for your help.
>
> Best regards,
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help understanding EC object reads

2019-09-16 Thread Thomas Byrne - UKRI STFC
Thanks for responding!

It's good to hear that the primary OSD has some smarts when dealing with 
partial reads, and that seems to line up with what I was seeing, i.e. I would 
have expected drastically worse performance otherwise with our large object 
sizes and tiny block sizes.

I'm am still seeing some performance degradation with the small block sizes, 
but I guess that is coming from the inefficiencies of lots of small requests 
(time spent queuing for the PG etc.), rather than anything related to EC.

Cheers,
Tom

> -Original Message-
> From: Gregory Farnum 
> Sent: 09 September 2019 23:25
> To: Byrne, Thomas (STFC,RAL,SC) 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Help understanding EC object reads
> 
> On Thu, Aug 29, 2019 at 4:57 AM Thomas Byrne - UKRI STFC
>  wrote:
> >
> > Hi all,
> >
> > I’m investigating an issue with our (non-Ceph) caching layers of our large 
> > EC
> cluster. It seems to be turning users requests for whole objects into lots of
> small byte range requests reaching the OSDs, but I’m not sure how
> inefficient this behaviour is in reality.
> >
> > My limited understanding of an EC object partial read is that the entire
> object is reconstructed on the primary OSD, and then the requested byte
> range is sent to the client before the primary discards the reconstructed
> object.
> 
> Ah, it's not necessarily the entire object is reconstructed, but that any 
> stripes
> covering the requested range are reconstructed. It's changed a bit over time
> and there are some knobs controlling it, but I believe this is generally
> efficient — if you ask for a byte range which simply lives on the primary, 
> it's
> not going to talk to the other OSDs to provide that data.
> 
> >
> > Assuming this is correct, do multiple reads for different byte ranges of the
> same object at effectively the same time result in the entire object being
> reconstructed once for each request, or does the primary do something
> clever and use the same reconstructed object for multiple requests before
> discarding it?
> 
> I'm pretty sure it's per-request; the EC pool code generally assumes you have
> another cache on top of RADOS that deals with combining these requests.
> There is a small cache in the OSD but IIRC it's just for keeping stuff 
> consistent
> while writes are in progress.
> -Greg
> 
> >
> > If I’m completely off the mark with what is going on under the hood here, a
> nudge in the right direction would be appreciated!
> >
> >
> >
> > Cheers,
> >
> > Tom
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD Down After Reboot

2019-08-29 Thread Thomas Sumpter
Hi Folks,

I have found similar reports of this problem in the past but can't seem to find 
any solution to it.
We have ceph filesystem running mimic version 13.2.5.
OSDs are running on AWS EC2 instances with centos 7. OSD disk is an AWS nvme 
device.

Problem I,  sometimes when rebooting an OSD instance, the OSD volume fails to 
mount and the OSD cannot start.

ceph-volume.log repeats the following
[2019-08-28 09:10:42,061][ceph_volume.main][INFO  ] Running command: 
ceph-volume  lvm trigger 0-fcaffe93-4c03-403c-9702-7f1ec694a578
[2019-08-28 09:10:42,063][ceph_volume.process][INFO  ] Running command: 
/usr/sbin/lvs --noheadings --readonly --separator=";" -o 
lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2019-08-28 09:10:42,074][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, 
in newfunc
   return f(*a, **kw)
  File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in main
terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/main.py", line 
40, in main
terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
 File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, in 
is_root
return func(*a, **kw)
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/trigger.py", 
line 70, in main
Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", 
line 339, in main
self.activate(args)
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
in is_root
return func(*a, **kw)
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/lvm/activate.py", 
line 249, in activate
raise RuntimeError('could not find osd.%s with fsid %s' % (osd_id, 
osd_fsid))
RuntimeError: could not find osd.0 with fsid 
fcaffe93-4c03-403c-9702-7f1ec694a578

ceph-volume-systemd.log repeats
[2019-08-28 09:10:41,877][systemd][INFO  ] raw systemd input received: 
lvm-0-fcaffe93-4c03-403c-9702-7f1ec694a578
[2019-08-28 09:10:41,877][systemd][INFO  ] parsed sub-command: lvm, extra data: 
0-fcaffe93-4c03-403c-9702-7f1ec694a578
[2019-08-28 09:10:41,926][ceph_volume.process][INFO  ] Running command: 
/usr/sbin/ceph-volume lvm trigger 0-fcaffe93-4c03-403c-9702-7f1ec694a578
[2019-08-28 09:10:42,077][ceph_volume.process][INFO  ] stderr -->  
RuntimeError: could not find osd.0 with fsid 
fcaffe93-4c03-403c-9702-7f1ec694a578
[2019-08-28 09:10:42,084][systemd][WARNING] command returned non-zero exit 
status: 1
[2019-08-28 09:10:42,084][systemd][WARNING] failed activating OSD, retries 
left: 30

To recover I destroy the OSD, zap the disk and create it again.
# ceph osd destroy 0 --yes-i-really-mean-it
# ceph-volume lvm zap /dev/nvme1n1 -destroy
# ceph-volume lvm create --osd-id 0 --data /dev/nvme1n1
# systemctl start ceph-osd@0

Is there something I need to do so that the OSD can boot without these problems?

Thank you!
Tom


ceph-volume.log
Description: ceph-volume.log


ceph-volume-systemd.log
Description: ceph-volume-systemd.log
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help understanding EC object reads

2019-08-29 Thread Thomas Byrne - UKRI STFC
Hi all,

I'm investigating an issue with our (non-Ceph) caching layers of our large EC 
cluster. It seems to be turning users requests for whole objects into lots of 
small byte range requests reaching the OSDs, but I'm not sure how inefficient 
this behaviour is in reality.

My limited understanding of an EC object partial read is that the entire object 
is reconstructed on the primary OSD, and then the requested byte range is sent 
to the client before the primary discards the reconstructed object.

Assuming this is correct, do multiple reads for different byte ranges of the 
same object at effectively the same time result in the entire object being 
reconstructed once for each request, or does the primary do something clever 
and use the same reconstructed object for multiple requests before discarding 
it?

If I'm completely off the mark with what is going on under the hood here, a 
nudge in the right direction would be appreciated!

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] No files in snapshot

2019-08-26 Thread Thomas Schneider

Hi,

I'm running Debian 10 with btrfs-progs=5.2.1.

Creating snapshots with snapper=0.8.2 works w/o errors.

However, I run into an issue and need to restore various files.

I thought that I could simply take the files from a snapshot created before.
However, the files required don't exist in any snapshot!

Therefore I have created a new snapshot manually to verify if the files 
will be included, but there's nothing.


These files are required:
root@ld5507:/usr/bin# ls -l /var/lib/ceph/osd/ceph-4/
insgesamt 56
-rw-r--r-- 1 root root 402 Jun  7 12:18 activate.monmap
-rw-r--r-- 1 ceph ceph   3 Jun  7 12:18 active
lrwxrwxrwx 1 ceph ceph  58 Jun  7 12:18 block -> 
/dev/disk/by-partuuid/3bc0c812-2c6b-4544-bbe7-e0444c3448eb

-rw-r--r-- 1 ceph ceph  37 Jun  7 12:18 block_uuid
-rw-r--r-- 1 ceph ceph   2 Jun  7 12:18 bluefs
-rw-r--r-- 1 ceph ceph  37 Jun  7 12:18 ceph_fsid
-rw-r--r-- 1 ceph ceph  37 Jun  7 12:18 fsid
-rw--- 1 ceph ceph  56 Jun  7 12:18 keyring
-rw-r--r-- 1 ceph ceph   8 Jun  7 12:18 kv_backend
-rw-r--r-- 1 ceph ceph  21 Jun  7 12:18 magic
-rw-r--r-- 1 ceph ceph   4 Jun  7 12:18 mkfs_done
-rw-r--r-- 1 ceph ceph   6 Jun  7 12:18 ready
-rw-r--r-- 1 ceph ceph   3 Aug 23 09:56 require_osd_release
-rw-r--r-- 1 ceph ceph   0 Aug 21 12:41 systemd
-rw-r--r-- 1 ceph ceph  10 Jun  7 12:18 type
-rw-r--r-- 1 ceph ceph   2 Jun  7 12:18 whoami

And there are not files in latest snapshot:
root@ld5507:/usr/bin# ls -l 
/btrfs/@snapshots/158/snapshot/var/lib/ceph/osd/ceph-4/

insgesamt 0


Why is this?

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub start-time and end-time

2019-08-14 Thread Thomas Byrne - UKRI STFC
Hi Torben,

> Is it allowed to have the scrub period cross midnight ? eg have start time at 
> 22:00 and end time 07:00 next morning.

Yes, I think that's what the way it is mostly used, primarily to reduce the 
scrub impact during waking/working hours.

> I assume that if you only configure the one of them - it still behaves as if 
> it is unconfigured ??

The begin and end hours default to 0 and 24 hours respectively, so setting one 
have an effect. E.g. setting the end hour to 6 will mean scrubbing runs from 
midnight to 6AM, or setting the start hour to 16 will run scrubs from 4PM to 
midnight.

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW 4 MiB objects

2019-08-01 Thread Thomas Bennett
Hi Aleksey,

Thanks for the detailed breakdown!

We're currently using replication pools but will be testing ec pools soon
enough and this is a useful set of parameters to look at. Also, I had not
considered the bluestore parameters, thanks for pointing that out.

Kind regards

On Wed, Jul 31, 2019 at 2:36 PM Aleksey Gutikov 
wrote:

> Hi Thomas,
>
> We did some investigations some time before and got several rules how to
> configure rgw and osd for big files stored on erasure-coded pool.
> Hope it will be useful.
> And if I have any mistakes, please let me know.
>
> S3 object saving pipeline:
>
> - S3 object is divided into multipart shards by client.
> - Rgw shards each multipart shard into rados objects of size
> rgw_obj_stripe_size.
> - Primary osd stripes rados object into ec stripes of width ==
> ec.k*profile.stripe_unit, ec code them and send units into secondary
> osds and write into object store (bluestore).
> - Each subobject of rados object has size == (rados object size)/k.
> - Then while writing into disk bluestore can divide rados subobject into
> extents of minimal size == bluestore_min_alloc_size_hdd.
>
> Next rules can save some space and iops:
>
> - rgw_multipart_min_part_size SHOULD be multiple of rgw_obj_stripe_size
> (client can use different value greater than)
> - MUST rgw_obj_stripe_size == rgw_max_chunk_size
> - ec stripe == osd_pool_erasure_code_stripe_unit or profile.stripe_unit
> - rgw_obj_stripe_size SHOULD be multiple of profile.stripe_unit*ec.k
> - bluestore_min_alloc_size_hdd MAY be equal to bluefs_alloc_size (to
> avoid fragmentation)
> - rgw_obj_stripe_size/ec.k SHOULD be multiple of
> bluestore_min_alloc_size_hdd
> - bluestore_min_alloc_size_hdd MAY be multiple of profile.stripe_unit
>
> For example, if ec.k=5:
>
> - rgw_multipart_min_part_size = rgw_obj_stripe_size = rgw_max_chunk_size
> = 20M
> - rados object size == 20M
> - profile.stripe_unit = 256k
> - rados subobject size == 4M, 16 ec stripe units (20M / 5)
> - bluestore_min_alloc_size_hdd = bluefs_alloc_size = 1M
> - rados subobject can be written in 4 extents each containing 4 ec
> stripe units
>
>
>
> On 30.07.19 17:35, Thomas Bennett wrote:
> > Hi,
> >
> > Does anyone out there use bigger than default values for
> > rgw_max_chunk_size and rgw_obj_stripe_size?
> >
> > I'm planning to set rgw_max_chunk_size and rgw_obj_stripe_size  to
> > 20MiB, as it suits our use case and from our testing we can't see any
> > obvious reason not to.
> >
> > Is there some convincing experience that we should stick with 4MiBs?
> >
> > Regards,
> > Tom
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> --
>
> Best regards!
> Aleksei Gutikov | Ceph storage engeneer
> synesis.ru | Minsk. BY
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Thomas Bennett

Storage Engineer at SARAO
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to deal with slow requests related to OSD bugs

2019-08-01 Thread Thomas Bennett
Hi Xavier,

We have had OSDs backed with Samsung SSD 960 PRO 512GB nvmes which started
generating slow requests.

After running:

ceph osd tree up | grep nvme | awk '{print $4}' | xargs -P 10 -I _OSD sh -c
'BPS=$(ceph tell _OSD bench | jq -r .bytes_per_sec); MBPS=$(echo "scale=2;
$BPS/100" | bc -l); echo _OSD $MBPS MB/s' | sort -n -k 2 | column -t

I noticed that the data rate had dropped significantly on some of my NVMEs
(some where down from ~1000 MB/s to ~300 MB/s). This pointed me to the fact
that the NVMes not behaving as expected.

I thought it may be worth asking if you perhaps seeing something similar.

Cheers,
Tom

On Wed, Jul 24, 2019 at 6:39 PM Xavier Trilla 
wrote:

> Hi,
>
>
>
> We had an strange issue while adding a new OSD to our Ceph Luminous 12.2.8
> cluster. Our cluster has > 300 OSDs based on SSDs and NVMe.
>
>
>
> After adding a new OSD to the Ceph cluster one of the already running OSDs
> started to give us slow queries warnings.
>
>
>
> We checked the OSD and it was working properly, nothing strange on the
> logs and also it has disk activity. Looks like it stopped serving requests
> just for one PG.
>
>
>
> Request were just piling up, and the number of slow queries was just
> growing constantly till we restarted the OSD (All our OSDs are running
> bluestore).
>
>
>
> We’ve been checking out everything in our setup, and everything is
> properly configured (This cluster has been running for >5 years and it
> hosts several thousand VMs.)
>
>
>
> Beyond finding the real source of the issue -I guess I’ll have to add more
> OSDs and if it happens again I could just dump the stats of the OSD (ceph
> daemon osd.X dump_historic_slow_ops) – what I would like to find is a way
> to protect the cluster from this kind of issues.
>
>
>
> I mean, in some scenarios OSDs just suicide -actually I fixed the issue
> just restarting the offending OSD- but how can we deal with this kind of
> situation? I’ve been checking around, but I could not find anything
> (Obviously we could set our monitoring software to restart any OSD which
> has more than N slow queries, but I find that a little bit too aggressive).
>
>
>
> Is there anything build in Ceph to deal with these situations? A OSD
> blocking queries in a RBD scenario is a big deal, as plenty of VMs will
> have disk timeouts which can lead to the VM just panicking.
>
>
>
> Thanks!
>
> Xavier
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Thomas Bennett

Storage Engineer at SARAO
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW configuration parameters

2019-07-30 Thread Thomas Bennett
Hi Casey,

Thanks for your reply.

Just to make sure I understand correctly-  would that only be if the S3
object size for the put/get is multiples of your rgw_max_chunk_size?

Kind regards,
Tom

On Tue, 30 Jul 2019 at 16:57, Casey Bodley  wrote:

> Hi Thomas,
>
> I see that you're familiar with rgw_max_chunk_size, which is the most
> object data that radosgw will write in a single osd request. Each PutObj
> and GetObj request will issue multiple osd requests in parallel, up to
> these configured window sizes. Raising these values can potentially
> improve throughput at the cost of increased memory usage.
>
> On 7/30/19 10:36 AM, Thomas Bennett wrote:
> > Does anyone know what these parameters are for. I'm not 100% sure I
> > understand what a window is in context of rgw objects:
> >
> >   * rgw_get_obj_window_size
> >   * rgw_put_obj_min_window_size
> >
> > The code points to throttling I/O. But some more info would be useful.
> >
> > Kind regards,
> > Tom
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 
Thomas Bennett

Storage Engineer at SARAO
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW configuration parameters

2019-07-30 Thread Thomas Bennett
Does anyone know what these parameters are for. I'm not 100% sure I
understand what a window is in context of rgw objects:

   - rgw_get_obj_window_size
   - rgw_put_obj_min_window_size

The code points to throttling I/O. But some more info would be useful.

Kind regards,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW 4 MiB objects

2019-07-30 Thread Thomas Bennett
Hi,

Does anyone out there use bigger than default values for rgw_max_chunk_size
and rgw_obj_stripe_size?

I'm planning to set rgw_max_chunk_size and rgw_obj_stripe_size  to 20MiB,
as it suits our use case and from our testing we can't see any obvious
reason not to.

Is there some convincing experience that we should stick with 4MiBs?

Regards,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Thomas Byrne - UKRI STFC
As a counterpoint, adding large amounts of new hardware in gradually (or more 
specifically in a few steps) has a few benefits IMO.

- Being able to pause the operation and confirm the new hardware (and cluster) 
is operating as expected. You can identify problems with hardware with OSDs at 
10% weight that would be much harder to notice during backfilling, and could 
cause performance issues to the cluster if they ended up with their full 
complement of PGs.

- Breaking up long backfills. For a full cluster with large OSDs, backfills can 
take weeks. I find that letting the mon stores compact, and getting the cluster 
back to health OK is good for my sanity and gives a good stopping point to work 
on other cluster issues. This obviously depends on the cluster fullness and OSD 
size.

I still aim for the smallest amount of steps/work, but an initial crush 
weighting of 10-25% of final weight is a good sanity check of the new hardware, 
and gives a good indication of how to approach the rest of the backfill.

Cheers,
Tom

From: ceph-users  On Behalf Of Paul Emmerich
Sent: 24 July 2019 20:06
To: Reed Dier 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to add 100 new OSDs...

+1 on adding them all at the same time.

All these methods that gradually increase the weight aren't really necessary in 
newer releases of Ceph.

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:59 PM Reed Dier 
mailto:reed.d...@focusvq.com>> wrote:
Just chiming in to say that this too has been my preferred method for adding 
[large numbers of] OSDs.

Set the norebalance nobackfill flags.
Create all the OSDs, and verify everything looks good.
Make sure my max_backfills, recovery_max_active are as expected.
Make sure everything has peered.
Unset flags and let it run.

One crush map change, one data movement.

Reed



That works, but with newer releases I've been doing this:

- Make sure cluster is HEALTH_OK
- Set the 'norebalance' flag (and usually nobackfill)
- Add all the OSDs
- Wait for the PGs to peer. I usually wait a few minutes
- Remove the norebalance and nobackfill flag
- Wait for HEALTH_OK

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-25 Thread Thomas Byrne - UKRI STFC
I hadn't tried manual compaction, but it did the trick. The db shrunk down to 
50MB and the OSD booted instantly. Thanks!

I'm confused as to why the OSDs weren't doing this themselves, especially as 
the operation only took a few seconds. But for now I'm happy that this is easy 
to rectify if we run into it again.

I've uploaded the log of a slow boot with debug_bluestore turned up [1], and I 
can provide other logs/files if anyone thinks they could be useful.

Cheers,
Tom
 
[1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446

-Original Message-
From: Gregory Farnum  
Sent: 24 June 2019 17:30
To: Byrne, Thomas (STFC,RAL,SC) 
Cc: ceph-users 
Subject: Re: [ceph-users] OSDs taking a long time to boot due to 
'clear_temp_objects', even with fresh PGs

On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC  
wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming 
> unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our 
> production cluster, so have had a number of pools on them with many, many 
> objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in 
> clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> read and all available IOPS consumed. The OSD will finish booting and come up 
> fine, but will then start hammering the disk again and fall over at some 
> point later, causing the cluster to gradually fall apart. I'm guessing 
> something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
> quickly and stay up, but creating a pool will cause OSDs that get even a 
> single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
> old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
> problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
> 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
> OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is 
> acceptable for this cluster, but I'm a little concerned a similar thing could 
> happen on a production cluster. Ideally, I would like to try and understand 
> what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to 
> dig further into this?

Have you tried a manual compaction? The only other time I see this being 
reported was for FileStore-on-ZFS and it was just very slow at metadata 
scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme OSD 
Boot Time") There has been at least one PR about object listings being slow in 
BlueStore when there are a lot of deleted objects, which would match up with 
your many deleted pools/objects.

If you have any debug logs the BlueStore devs might be interested in them to 
check if the most recent patches will fix it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-24 Thread Thomas Byrne - UKRI STFC
Hi all,



Some bluestore OSDs in our Luminous test cluster have started becoming 
unresponsive and booting very slowly.



These OSDs have been used for stress testing for hardware destined for our 
production cluster, so have had a number of pools on them with many, many 
objects in the past. All these pools have since been deleted.



When booting the OSDs, they spend a few minutes *per PG* in clear_temp_objects 
function, even for brand new, empty PGs. The OSD is hammering the disk during 
the clear_temp_objects, with a constant ~30MB/s read and all available IOPS 
consumed. The OSD will finish booting and come up fine, but will then start 
hammering the disk again and fall over at some point later, causing the cluster 
to gradually fall apart. I'm guessing something is 'not optimal' in the rocksDB.



Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
quickly and stay up, but creating a pool will cause OSDs that get even a single 
PG to start exhibiting this behaviour again.



These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
OSD, but it's also the first time I've looked into this so may be normal?



Destroying and recreating an OSD resolves the issue for that OSD, which is 
acceptable for this cluster, but I'm a little concerned a similar thing could 
happen on a production cluster. Ideally, I would like to try and understand 
what has happened before recreating the problematic OSDs.



Has anyone got any thoughts on what might have happened, or tips on how to dig 
further into this?


Cheers,
Tom

Tom Byrne
Storage System Administrator
Scientific Computing Department
Science and Technology Facilities Council
Rutherford Appleton Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NFS-Ganesha Mounts as a Read-Only Filesystem

2019-04-06 Thread thomas
Hi all,

 

I have recently setup a Ceph cluster and on request using CephFS (MDS
version: ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988)
mimic (stable)) as a backend for NFS-Ganesha. I have successfully tested a
direct mount with CephFS to read/write files, however I’m perplexed as to
NFS mounting as read-only despite setting the RW flags.

 

[root@mon02 mnt]# touch cephfs/test.txt

touch: cannot touch âcephfs/test.txtâ: Read-only file system

 

Configuration of Ganesha is below:

 

NFS_CORE_PARAM

{

  Enable_NLM = false;

  Enable_RQUOTA = false;

  Protocols = 4;

}

 

NFSv4

{

  Delegations = true;

  RecoveryBackend = rados_ng;

  Minor_Versions =  1,2;

}

 

CACHEINODE {

  Dir_Chunk = 0;

  NParts = 1;

  Cache_Size = 1;

}

 

EXPORT

{

Export_ID = 15;

Path = "/";

Pseudo = "/cephfs/";

Access_Type = RW;

NFS_Protocols = "4";

Squash = No_Root_Squash;

Transport_Protocols = TCP;

SecType = "none";

Attr_Expiration_Time = 0;

Delegations = R;

 

FSAL {

Name = CEPH;

 User_Id = "ganesha";

 Filesystem = "cephfs";

 Secret_Access_Key = "";

}

}

 

 

Provided mount parameters:

 

mount -t nfs -o nfsvers=4.1,proto=tcp,rw,noatime,sync 172.16.32.15:/
/mnt/cephfs

 

I have tried stripping much of the config and altering mount options, but so
far completely unable to decipher the cause. Also seems I’m not the only one
who has been caught on this:

 

https://www.spinics.net/lists/ceph-devel/msg41201.html

 

Thanks in advance,

 

Thomas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd unmap fails with error: rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy

2019-02-27 Thread Thomas
Hi,
I have noticed an error when writing to a mapped RBD.
Therefore I unmounted the block device.
Then I tried to unmap it w/o success:
ld2110:~ # rbd unmap /dev/rbd0
rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy

The same block device is mapped on another client and there are no issues:
root@ld4257:~# rbd info hdb-backup/ld2110
rbd image 'ld2110':
    size 7.81TiB in 2048000 objects
    order 22 (4MiB objects)
    block_name_prefix: rbd_data.3cda0d6b8b4567
    format: 2
    features: layering
    flags:
    create_timestamp: Fri Feb 15 10:53:50 2019
root@ld4257:~# rados -p hdb-backup  listwatchers rbd_data.3cda0d6b8b4567
error listing watchers hdb-backup/rbd_data.3cda0d6b8b4567: (2) No such
file or directory
root@ld4257:~# rados -p hdb-backup  listwatchers rbd_header.3cda0d6b8b4567
watcher=10.76.177.185:0/1144812735 client.21865052 cookie=1
watcher=10.97.206.97:0/4023931980 client.18484780
cookie=18446462598732841027


Question:
How can I force to unmap the RBD on client ld2110 (= 10.76.177.185)?

THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Modify ceph.mon network required

2019-01-25 Thread Thomas
Thanks.

This procedure works very well.



Am 25.01.2019 um 14:24 schrieb Janne Johansson:
> Den fre 25 jan. 2019 kl 09:52 skrev cmonty14 <74cmo...@gmail.com>:
>> Hi,
>> I have identified a major issue with my cluster setup consisting of 3 nodes:
>> all monitors are connected to cluster network.
>>
>> Question:
>> How can I modify the network configuration of mon?
>>
>> It's not working to simply change the parameters in ceph.conf because
>> then the quorum fails.
>>
> Look at the docs for Adding and Removing Mons and then you remove one,
> and add it at the new address, and repeat for the other mons.
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Solved]reating a block device user with restricted access to image

2019-01-25 Thread Thomas
Update:

I have identified the root cause: user caps are not correct.
Errornous caps:
root@ld4257:/etc/ceph# ceph auth get client.gbsadm
exported keyring for client.gbsadm
[client.gbsadm]
    key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
    caps mon = "allow r"
    caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix rbd_id.gbs"

Correct caps:
root@ld4257:/etc/ceph# ceph auth get client.gbsadm
exported keyring for client.gbsadm
[client.gbsadm]
    key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
    caps mon = "allow r"
    caps osd = "allow rwx pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix rbd_id.gbs"

The error was caused by failure when copy & paste from Eugen's
instructions that are 100% correct!

Thanks for your great support!!!

Maybe another question related to this topic:
If I write a backup into a RBD, will Ceph use single IO stream or multi
IO stream on storage side?


Regards
Thomas


-


Hi,

unfortunately it's not working, yet.

I have modified user gbsadm:
root@ld4257:/etc/ceph# ceph auth get client.gbsadm
exported keyring for client.gbsadm
[client.gbsadm]
    key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
    caps mon = "allow r"
    caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix rbd_id.gbs"

But mapping fails with same error:
ld7581:/etc/ceph # rbd map backup/gbs --user gbsadm -k
/etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
rbd: sysfs write failed
2019-01-25 13:19:29.158211 7fc629ffb700 -1 librbd::image::OpenRequest:
failed to stat v2 image header: (1) Operation not permitted
2019-01-25 13:19:29.158476 7fc6297fa700 -1 librbd::ImageState:
0x55b623a91f70 failed to open image: (1) Operation not permitted
rbd: error opening image gbs: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted


Regards
Thomas

Am 25.01.2019 um 12:31 schrieb Eugen Block:
> You can check all objects of that pool to see if your caps match:
>
> rados -p backup ls | grep rbd_id
>
>
> Zitat von Eugen Block :
>
>>> caps osd = "allow pool backup object_prefix
>>> rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
>>> rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
>>> rbd_id.rbd-image"
>>
>> I think your caps are not entirely correct, the part "[...]
>> object_prefix rbd_id.rbd-image" should contain the
>> actual image name, so in your case it should be "[...] rbd_id.gbs".
>>
>> Regards,
>> Eugen
>>
>> Zitat von Thomas <74cmo...@gmail.com>:
>>
>>> Thanks.
>>>
>>> Unfortunately this is still not working.
>>>
>>> Here's the info of my image:
>>> root@ld4257:/etc/ceph# rbd info backup/gbs
>>> rbd image 'gbs':
>>>     size 500GiB in 128000 objects
>>>     order 22 (4MiB objects)
>>>     block_name_prefix: rbd_data.18102d6b8b4567
>>>     format: 2
>>>     features: layering
>>>     flags:
>>>     create_timestamp: Thu Jan 24 16:01:55 2019
>>>
>>> And here's the user caps ouput:
>>> root@ld4257:/etc/ceph# ceph auth get client.gbsadm
>>> exported keyring for client.gbsadm
>>> [client.gbsadm]
>>>     key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
>>>     caps mon = "allow r"
>>>     caps osd = "allow pool backup object_prefix
>>> rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
>>> rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
>>> rbd_id.rbd-image"
>>>
>>>
>>> Trying to map rbd "backup/gbs" now fails with this error; this
>>> operation
>>> should be permitted:
>>> ld7581:/etc/ceph # rbd map backup/gbs --user gbsadm -k
>>> /etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
>>> rbd: sysfs write failed
>>> 2019-01-25 12:15:19.786724 7fe4357fa700 -1 librbd::image::OpenRequest:
>>> failed to stat v2 image header: (1) Operation not permitted
>>> 2019-01-25 12:15:19.786962 7fe434ff9700 -1 librbd::ImageState:
>>> 0x55b6522177f0 failed to open image: (1) Operation not permitted
>>> rbd: e

Re: [ceph-users] Creating a block device user with restricted access to image

2019-01-25 Thread Thomas
Hi,

unfortunately it's not working, yet.

I have modified user gbsadm:
root@ld4257:/etc/ceph# ceph auth get client.gbsadm
exported keyring for client.gbsadm
[client.gbsadm]
    key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
    caps mon = "allow r"
    caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix rbd_id.gbs"

But mapping fails with same error:
ld7581:/etc/ceph # rbd map backup/gbs --user gbsadm -k
/etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
rbd: sysfs write failed
2019-01-25 13:19:29.158211 7fc629ffb700 -1 librbd::image::OpenRequest:
failed to stat v2 image header: (1) Operation not permitted
2019-01-25 13:19:29.158476 7fc6297fa700 -1 librbd::ImageState:
0x55b623a91f70 failed to open image: (1) Operation not permitted
rbd: error opening image gbs: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted


Regards
Thomas

Am 25.01.2019 um 12:31 schrieb Eugen Block:
> You can check all objects of that pool to see if your caps match:
>
> rados -p backup ls | grep rbd_id
>
>
> Zitat von Eugen Block :
>
>>> caps osd = "allow pool backup object_prefix
>>> rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
>>> rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
>>> rbd_id.rbd-image"
>>
>> I think your caps are not entirely correct, the part "[...]
>> object_prefix rbd_id.rbd-image" should contain the
>> actual image name, so in your case it should be "[...] rbd_id.gbs".
>>
>> Regards,
>> Eugen
>>
>> Zitat von Thomas <74cmo...@gmail.com>:
>>
>>> Thanks.
>>>
>>> Unfortunately this is still not working.
>>>
>>> Here's the info of my image:
>>> root@ld4257:/etc/ceph# rbd info backup/gbs
>>> rbd image 'gbs':
>>>     size 500GiB in 128000 objects
>>>     order 22 (4MiB objects)
>>>     block_name_prefix: rbd_data.18102d6b8b4567
>>>     format: 2
>>>     features: layering
>>>     flags:
>>>     create_timestamp: Thu Jan 24 16:01:55 2019
>>>
>>> And here's the user caps ouput:
>>> root@ld4257:/etc/ceph# ceph auth get client.gbsadm
>>> exported keyring for client.gbsadm
>>> [client.gbsadm]
>>>     key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
>>>     caps mon = "allow r"
>>>     caps osd = "allow pool backup object_prefix
>>> rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
>>> rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
>>> rbd_id.rbd-image"
>>>
>>>
>>> Trying to map rbd "backup/gbs" now fails with this error; this
>>> operation
>>> should be permitted:
>>> ld7581:/etc/ceph # rbd map backup/gbs --user gbsadm -k
>>> /etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
>>> rbd: sysfs write failed
>>> 2019-01-25 12:15:19.786724 7fe4357fa700 -1 librbd::image::OpenRequest:
>>> failed to stat v2 image header: (1) Operation not permitted
>>> 2019-01-25 12:15:19.786962 7fe434ff9700 -1 librbd::ImageState:
>>> 0x55b6522177f0 failed to open image: (1) Operation not permitted
>>> rbd: error opening image gbs: (1) Operation not permitted
>>> In some cases useful info is found in syslog - try "dmesg | tail".
>>> rbd: map failed: (1) Operation not permitted
>>>
>>> The same error is shown when I try to map rbd "backup/isa"; this
>>> operation must be prohibited:
>>> ld7581:/etc/ceph # rbd map backup/isa --user gbsadm -k
>>> /etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
>>> rbd: sysfs write failed
>>> 2019-01-25 12:15:04.850151 7f8041ffb700 -1 librbd::image::OpenRequest:
>>> failed to stat v2 image header: (1) Operation not permitted
>>> 2019-01-25 12:15:04.850350 7f80417fa700 -1 librbd::ImageState:
>>> 0x5643668a5700 failed to open image: (1) Operation not permitted
>>> rbd: error opening image isa: (1) Operation not permitted
>>> In some cases useful info is found in syslog - try "dmesg | tail".
>>> rbd: map failed: (1) Operation not permitted
>>>
>>>
>>> Regards
>>> Thomas
>>>
>>> Am 25.01.2019 um 11:52 schrieb Eugen Block:
>>>> osd 'allow rwx
>>>> pool  object_prefix rbd_data.2b36cf238e1f29; allow rwx pool
>>>> 
>>>> object_prefix rbd_header.2b36cf238e1f29
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating a block device user with restricted access to image

2019-01-25 Thread Thomas
Thanks.

Unfortunately this is still not working.

Here's the info of my image:
root@ld4257:/etc/ceph# rbd info backup/gbs
rbd image 'gbs':
    size 500GiB in 128000 objects
    order 22 (4MiB objects)
    block_name_prefix: rbd_data.18102d6b8b4567
    format: 2
    features: layering
    flags:
    create_timestamp: Thu Jan 24 16:01:55 2019

And here's the user caps ouput:
root@ld4257:/etc/ceph# ceph auth get client.gbsadm
exported keyring for client.gbsadm
[client.gbsadm]
    key = AQBd0klcFknvMRAAwuu30bNG7L7PHk5d8cSVvg==
    caps mon = "allow r"
    caps osd = "allow pool backup object_prefix
rbd_data.18102d6b8b4567; allow rwx pool backup object_prefix
rbd_header.18102d6b8b4567; allow rx pool backup object_prefix
rbd_id.rbd-image"


Trying to map rbd "backup/gbs" now fails with this error; this operation
should be permitted:
ld7581:/etc/ceph # rbd map backup/gbs --user gbsadm -k
/etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
rbd: sysfs write failed
2019-01-25 12:15:19.786724 7fe4357fa700 -1 librbd::image::OpenRequest:
failed to stat v2 image header: (1) Operation not permitted
2019-01-25 12:15:19.786962 7fe434ff9700 -1 librbd::ImageState:
0x55b6522177f0 failed to open image: (1) Operation not permitted
rbd: error opening image gbs: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted

The same error is shown when I try to map rbd "backup/isa"; this
operation must be prohibited:
ld7581:/etc/ceph # rbd map backup/isa --user gbsadm -k
/etc/ceph/ceph.client.gbsadm.keyring -c /etc/ceph/ceph.conf
rbd: sysfs write failed
2019-01-25 12:15:04.850151 7f8041ffb700 -1 librbd::image::OpenRequest:
failed to stat v2 image header: (1) Operation not permitted
2019-01-25 12:15:04.850350 7f80417fa700 -1 librbd::ImageState:
0x5643668a5700 failed to open image: (1) Operation not permitted
rbd: error opening image isa: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted


Regards
Thomas

Am 25.01.2019 um 11:52 schrieb Eugen Block:
> osd 'allow rwx
> pool  object_prefix rbd_data.2b36cf238e1f29; allow rwx pool 
> object_prefix rbd_header.2b36cf238e1f29


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-21 Thread Thomas
Hi,
 
my use case for Ceph is serving a central backup storage.
This means I will backup multiple databases in Ceph storage cluster.
 
This is my question:
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?
 
This is the security demand that should be considered:
DB-owner A can only modify the files that belong to A; other files
(owned by B, C or D) are accessible for A.

And there's another issue:
How can I identify a backup created by client A that I want to restore
on another client Z?
I mean typically client A would write a backup file identified by the
filename.
Would it be possible on client Z to identify this backup file by
filename? If yes, how?
 
 
THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-21 Thread Thomas
Hi,
 
my use case for Ceph is serving a central backup storage.
This means I will backup multiple databases in Ceph storage cluster.
 
This is my question:
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?
 
This is the security demand that should be considered:
DB-owner A can only modify the files that belong to A; other files
(owned by B, C or D) are accessible for A.

And there's another issue:
How can I identify a backup created by client A that I want to restore
on another client Z?
I mean typically client A would write a backup file identified by the
filename.
Would it be possible on client Z to identify this backup file by
filename? If yes, how?
 
 
THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best practice creating pools / rbd images

2019-01-15 Thread Thomas
Hi,
 
my use case for Ceph is serving a central backup storage.
This means I will backup multiple databases in Ceph storage cluster.
 
This is my question:
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and multiple rbd images, means
one rbd image per database?
Or should I create a single pool "backup" and single rbd image "db"?
 
This is the security demand that should be considered:
DB-owner A can only modify the files that belong to A; other files
(owned by B, C or D) are accessible for A.
 
 
THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to increase Ceph Mon store?

2019-01-08 Thread Thomas Byrne - UKRI STFC
For what it's worth, I think the behaviour Pardhiv and Bryan are describing is 
not quite normal, and sounds similar to something we see on our large luminous 
cluster with elderly (created as jewel?) monitors. After large operations which 
result in the mon stores growing to 20GB+, leaving the cluster with all PGs 
active+clean for days/weeks will usually not result in compaction, and the 
store sizes will slowly grow. 

I've played around with restarting monitors with and without 
mon_compact_on_start set, and using 'ceph tell mon.[id] compact'. For this 
cluster, I found the most reliable way to trigger a compaction was to restart 
all monitors daemons, one at a time, *without* compact_on_start set. The stores 
rapidly compact down to ~1GB in a minute or less after the last mon restarts.

It's worth noting that occasionally (1 out of every 10 times, or fewer) the 
stores will compact without prompting after all PGs become active+clean. 

I haven't put much time into this as I am planning on reinstalling the monitors 
to get rocksDB mon stores. If the problem persists with the new monitors I'll 
have another look at it.

Cheers
Tom

> -Original Message-
> From: ceph-users  On Behalf Of Wido
> den Hollander
> Sent: 08 January 2019 08:28
> To: Pardhiv Karri ; Bryan Stillwell
> 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Is it possible to increase Ceph Mon store?
> 
> 
> 
> On 1/7/19 11:15 PM, Pardhiv Karri wrote:
> > Thank you Bryan, for the information. We have 816 OSDs of size 2TB each.
> > The mon store too big popped up when no rebalancing happened in that
> > month. It is slightly above the 15360 threshold around 15900 or 16100
> > and stayed there for more than a week. We ran the "ceph tell mon.[ID]
> > compact" to get it back earlier this week. Currently the mon store is
> > around 12G on each monitor. If it doesn't grow then I won't change the
> > value but if it grows and gives the warning then I will increase it
> > using "mon_data_size_warn".
> >
> 
> This is normal. The MONs will keep a history of OSDMaps if one or more PGs
> are not active+clean
> 
> They will trim after all the PGs are clean again, nothing to worry about.
> 
> You can increase the setting for the warning, but that will not shrink the
> database.
> 
> Just make sure your monitors have enough free space.
> 
> Wido
> 
> > Thanks,
> > Pardhiv Karri
> >
> >
> >
> > On Mon, Jan 7, 2019 at 1:55 PM Bryan Stillwell  > > wrote:
> >
> > I believe the option you're looking for is mon_data_size_warn.  The
> > default is set to 16106127360.
> >
> > __ __
> >
> > I've found that sometimes the mons need a little help getting
> > started with trimming if you just completed a large expansion.
> > Earlier today I had a cluster where the mon's data directory was
> > over 40GB on all the mons.  When I restarted them one at a time with
> > 'mon_compact_on_start = true' set in the '[mon]' section of
> > ceph.conf, they stayed around 40GB in size.   However, when I was
> > about to hit send on an email to the list about this very topic, the
> > warning cleared up and now the data directory is now between 1-3GB
> > on each of the mons.  This was on a cluster with >1900 OSDs.
> >
> > __ __
> >
> > Bryan
> >
> > __ __
> >
> > *From: *ceph-users  > > on behalf of Pardhiv
> > Karri mailto:meher4in...@gmail.com>>
> > *Date: *Monday, January 7, 2019 at 11:08 AM
> > *To: *ceph-users  > >
> > *Subject: *[ceph-users] Is it possible to increase Ceph Mon
> > store?
> >
> > __ __
> >
> > Hi, __ __
> >
> > __ __
> >
> > We have a large Ceph cluster (Hammer version). We recently saw its
> > mon store growing too big > 15GB on all 3 monitors without any
> > rebalancing happening for quiet sometime. We have compacted the DB
> > using  "#ceph tell mon.[ID] compact" for now. But is there a way to
> > increase the size of the mon store to 32GB or something to avoid
> > getting the Ceph health to warning state due to Mon store growing
> > too big?
> >
> > __ __
> >
> > -- 
> >
> > Thanks,
> >
> > *P**ardhiv **K**arri*
> >
> >
> > 
> >
> > __ __
> >
> >
> >
> > --
> > *Pardhiv Karri*
> > "Rise and Rise again untilLAMBSbecome LIONS"
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed

2019-01-02 Thread Thomas Byrne - UKRI STFC
>   In previous versions of Ceph, I was able to determine which PGs had
> scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
> provided that they were not already being scrubbed. In Luminous, the bad
> PG is not visible in "ceph --status" anywhere. Should I use something like
> "ceph health detail -f json-pretty" instead?

'ceph pg ls inconsistent' lists all inconsistent PGs.

>   Also, is it possible to configure Ceph to attempt repairing the bad PGs
> itself, as soon as the scrub fails? I run most of my OSDs on top of a bunch of
> old spinning disks, and a scrub error almost always means that there is a bad
> sector somewhere, which can easily be fixed by rewriting the lost data using
> "ceph pg repair".

I don't know of a good way to repair inconsistencies automatically from within 
Ceph. However, I seem to remember someone saying that with BlueStore OSDs, read 
errors are attempted to be fixed (by rewriting the unreadable replica/shard) 
when they are discovered during client reads. And there was a potential plan to 
do the same if they are discovered during scrubbing. I can't remember the 
details (this was a while ago, at Cephalocon APAC), so I may be completely off 
the mark here. 

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed sync?

2019-01-02 Thread Thomas Byrne - UKRI STFC
I recently spent some time looking at this, I believe the 'summary' and 
'overall_status' sections are now deprecated. The 'status' and 'checks' fields 
are the ones to use now.

The 'status' field gives you the OK/WARN/ERR, but returning the most severe 
error condition from the 'checks' section is less trivial. AFAIK all 
health_warn states are treated as equally severe, and same for health_err. We 
ended up formatting our single line human readable output as something like:

"HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN: 20 
large omap objects"

To make it obvious which check is causing which state. We needed to supress 
specific checks for callouts, so had to look at each check and the resulting 
state. If you're not trying to do something similar there may be a more 
lightweight way to go about it.

Cheers,
Tom

> -Original Message-
> From: ceph-users  On Behalf Of Jan
> Kasprzak
> Sent: 02 January 2019 09:29
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] ceph health JSON format has changed sync?
> 
>   Hello, Ceph users,
> 
> I am afraid the following question is a FAQ, but I still was not able to find 
> the
> answer:
> 
> I use ceph --status --format=json-pretty as a source of CEPH status for my
> Nagios monitoring. After upgrading to Luminous, I see the following in the
> JSON output when the cluster is not healthy:
> 
> "summary": [
> {
> "severity": "HEALTH_WARN",
> "summary": "'ceph health' JSON format has changed in 
> luminous. If
> you see this your monitoring system is scraping the wrong fields. Disable this
> with 'mon health preluminous compat warning = false'"
> }
> ],
> 
> Apart from that, the JSON data seems reasonable. My question is which part
> of JSON structure are the "wrong fields" I have to avoid. Is it just the
> "summary" section, or some other parts as well? Or should I avoid the whole
> ceph --status and use something different instead?
> 
> What I want is a single machine-readable value with OK/WARNING/ERROR
> meaning, and a single human-readable text line, describing the most severe
> error condition which is currently present. What is the preferred way to get
> this data in Luminous?
> 
>   Thanks,
> 
> -Yenya
> 
> --
> | Jan "Yenya" Kasprzak  |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google  
> the
> symptoms, and hope that you don't have to watch a video. --P. Zaitcev
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2019-01-02 Thread Thomas Byrne - UKRI STFC
Assuming I understand it correctly:

"pg_upmap_items 6.0 [40,20]" refers to replacing (upmapping?) osd.40 with 
osd.20 in the acting set of the placement group '6.0'. Assuming it's a 3 
replica PG, the other two OSDs in the set remain unchanged from the CRUSH 
calculation.

"pg_upmap_items 6.6 [45,46,59,56]" describes two upmap replacements for the PG 
6.6, replacing 45 with 46, and 59 with 56.

Hope that helps.

Cheers,
Tom

> -Original Message-
> From: ceph-users  On Behalf Of
> jes...@krogh.cc
> Sent: 30 December 2018 22:04
> To: Konstantin Shalygin 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
> 
> >> I would still like to have a log somewhere to grep and inspect what
> >> balancer/upmap actually does - when in my cluster. Or some ceph
> >> commands that deliveres some monitoring capabilityes .. any
> >> suggestions?
> > Yes, on ceph-mgr log, when log level is DEBUG.
> 
> Tried the docs .. something like:
> 
> ceph tell mds ... does not seem to work.
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/
> 
> > You can get your cluster upmap's in via `ceph osd dump | grep upmap`.
> 
> Got it -- but I really need the README .. it shows the map ..
> ...
> pg_upmap_items 6.0 [40,20]
> pg_upmap_items 6.1 [59,57,47,48]
> pg_upmap_items 6.2 [59,55,75,9]
> pg_upmap_items 6.3 [22,13,40,39]
> pg_upmap_items 6.4 [23,9]
> pg_upmap_items 6.5 [25,17]
> pg_upmap_items 6.6 [45,46,59,56]
> pg_upmap_items 6.8 [60,54,16,68]
> pg_upmap_items 6.9 [61,69]
> pg_upmap_items 6.a [51,48]
> pg_upmap_items 6.b [43,71,41,29]
> pg_upmap_items 6.c [22,13]
> 
> ..
> 
> But .. I dont have any pg's that should only have 2 replicas.. neither any 
> with 4
> .. how should this be interpreted?
> 
> Thanks.
> 
> --
> Jesper
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Some pgs stuck unclean in active+remapped state

2018-11-19 Thread Thomas Klute
Hi,

we have a production cluster (3 nodes) stuck unclean after we had to
replace one osd.
Cluster recovered fine except some pgs that are stuck unclean for about
2-3 days now:

[root@ceph1 ~]# ceph health detail
HEALTH_WARN 7 pgs stuck unclean; recovery 8/8565617 objects degraded
(0.000%); recovery 38790/8565617 objects misplaced (0.453%)
pg 3.19 is stuck unclean for 324141.349243, current state
active+remapped, last acting [8,1,12]
pg 3.17f is stuck unclean for 324093.413743, current state
active+remapped, last acting [7,10,14]
pg 3.15e is stuck unclean for 324072.637573, current state
active+remapped, last acting [9,11,12]
pg 3.1cc is stuck unclean for 324141.437666, current state
active+remapped, last acting [6,4,9]
pg 3.47 is stuck unclean for 324014.795713, current state
active+remapped, last acting [4,7,14]
pg 3.1d6 is stuck unclean for 324019.903078, current state
active+remapped, last acting [8,0,4]
pg 3.83 is stuck unclean for 324024.970570, current state
active+remapped, last acting [5,11,13]
recovery 8/8565617 objects degraded (0.000%)
recovery 38790/8565617 objects misplaced (0.453%)

Grep on pg dump shows:
[root@ceph1 ~]# fgrep remapp /tmp/pgdump.txt
3.83    5423    0   0   5423    0   22046870528 3065   
3065    active+remapped 2018-11-16 04:08:22.365825  85711'8469810  
85711:8067280   [5,11]  5   [5,11,13]   5   83827'8450839  
2018-11-14 14:01:20.330322   81079'8422114   2018-11-11 05:10:57.628147
3.47    5487    0   0   5487    0   22364503552 3010   
3010    active+remapped 2018-11-15 18:24:24.047889  85711'9511787  
85711:9975900   [4,7]   4   [4,7,14]    4   84165'9471676  
2018-11-14 23:46:23.149867   80988'9434392   2018-11-11 02:00:23.427834
3.1d6   5567    0   2   5567    0   22652505618 3093   
3093    active+remapped 2018-11-16 23:26:06.136037  85711'6730858  
85711:6042914   [8,0]   8   [8,0,4] 8   83682'6673939  
2018-11-14 09:15:37.810103  80664'6608489    2018-11-09 09:21:00.431783
3.1cc   5656    0   0   5656    0   22988533760 3088   
3088    active+remapped 2018-11-17 09:18:42.263108  85711'9795820  
85711:8040672   [6,4]   6   [6,4,9] 6   80670'9756755  
2018-11-10 13:07:35.097811  80664'9742234    2018-11-09 04:33:10.497507
3.15e   5564    0   6   5564    0   22675107328 3007   
3007    active+remapped 2018-11-17 02:47:44.282884  85711'9000186  
85711:8021053   [9,11]  9   [9,11,12]   9   83502'8957026  
2018-11-14 03:31:18.592781   80664'8920925   2018-11-09 22:15:54.478402
3.17f   5601    0   0   5601    0   22861908480 3077   
3077    active+remapped 2018-11-17 01:16:34.016231  85711'31880220 
85711:30659045  [7,10]  7   [7,10,14]   7   83668'31705772 
2018-11-14 08:35:10.952368   80664'31649045  2018-11-09 04:40:28.644421
3.19    5492    0   0   5492    0   22460691985 3016   
3016    active+remapped 2018-11-15 18:54:32.268758  85711'16782496 
85711:15483621  [8,1]   8   [8,1,12]    8   84542'16774356 
2018-11-15 09:40:41.713627   82163'16760520  2018-11-12 13:13:29.764191

We running  Jewel (10.2.11) on Centos 7:

rpm -qa |grep ceph
ceph-radosgw-10.2.11-0.el7.x86_64
libcephfs1-10.2.11-0.el7.x86_64
ceph-mds-10.2.11-0.el7.x86_64
ceph-release-1-1.el7.noarch
ceph-common-10.2.11-0.el7.x86_64
ceph-selinux-10.2.11-0.el7.x86_64
python-cephfs-10.2.11-0.el7.x86_64
ceph-base-10.2.11-0.el7.x86_64
ceph-osd-10.2.11-0.el7.x86_64
ceph-mon-10.2.11-0.el7.x86_64
ceph-deploy-1.5.39-0.noarch
ceph-10.2.11-0.el7.x86_64

Could please someone help how to proceed?

Thanks and kind regards,
Thomas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 12.2.9 release

2018-11-07 Thread Thomas White
One of the Ceph clusters my team manages is on 12.2.9 on a Proxmox
environment seems to be running fine with simple x3 replication and RBD.
Would be interesting to know what issues have been encountered so far. All
our OSDs are simple filestore at present and our path to 12.2.9 was 10.2.7
-> 10.2.10 -> 12.2.9 in the past 2 weeks with no issues.

That said, it is disappointing these packages are making their way into
repositories without the proper announcements for an LTS release, especially
given this is enterprise orientated software.

Thomas

-Original Message-
From: ceph-users  On Behalf Of Simon
Ironside
Sent: 07 November 2018 13:58
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph 12.2.9 release



On 07/11/2018 10:59, Konstantin Shalygin wrote:
>> I wonder if there is any release announcement for ceph 12.2.9 that I
missed.
>> I just found the new packages on download.ceph.com, is this an 
>> official release?
> 
> This is because 12.2.9 have a several bugs. You should avoid to use 
> this release and wait for 12.2.10

Argh! What's it doing in the repos then?? I've just upgraded to it!
What are the bugs? Is there a thread about them?

Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Ceph Meetup Cape Town

2018-10-30 Thread Thomas Bennett
Hi,

SARAO <http://www.ska.ac.za/> is excited to announce that it will be
hosting a Ceph Meetup in Cape Town.

Date: Wednesday 28'th November
Time: 5pm to 8pm
Venue: Workshop 17 <https://goo.gl/maps/UKKsn1aGZ2n> at the V Waterfront

Space is limited, so if you would like to attend, please complete the
following form to register: https://goo.gl/forms/imuP47iCYssNMqHA2

Kind regards,
SARAO storage team

-- 
Thomas Bennett

SARAO
Science Data Processing
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] getattr - failed to rdlock waiting

2018-10-02 Thread Thomas Sumpter
Hi Folks,

I am looking for advice on how to troubleshoot some long operations found in 
MDS. Most of the time performance is fantastic, but occasionally and to no real 
pattern or trend, a gettattr op will take up to ~30 seconds to complete in MDS 
which is stuck on "event": "failed to rdlock, waiting"

E.g.
"description": "client_request(client.84183:54794012 getattr pAsLsXsFs 
#0x1038585 2018-10-02 07:56:27.554282 caller_uid=48, caller_gid=48{})",
"duration": 28.987992,
{
"time": "2018-09-25 07:56:27.552511",
"event": "failed to rdlock, waiting"
},
{
"time": "2018-09-25 07:56:56.529748",
"event": "failed to rdlock, waiting"
},
{
"time": "2018-09-25 07:56:56.540386",
"event": "acquired locks"
}

I can find no corresponding long op on any of the OSDs and no other op in MDS 
which this one could be waiting for.
Nearly all configuration will be the default. Currently have a small amount of 
data which is constantly being updated. 1 data pool and 1 metadata pool.
How can I track down what is holding up this op and try to stop it happening?

# rados df
...
total_objects191
total_used   5.7 GiB
total_avail  367 GiB
total_space  373 GiB


Cephfs version 13.2.1 on CentOs 7.5
Kernel: 3.10.0-862.11.6.el7.x86_64
1x Active MDS, 1x Replay Standby MDS
3x MON
4x OSD
Bluestore FS

Ceph kernel client on CentOs 7.4
Kernel: 4.18.7-1.el7.elrepo.x86_64  (almost the latest, should be good?)

Many Thanks!
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How many objects to expect?

2018-09-26 Thread Thomas Sumpter
Hello,

I have two independent but almost identical systems, one of them (A) the total 
number of objects stays around 200, the other (B) has been steadily increasing 
and now seems to have levelled off at around 4000 objects.
The total used data remains roughly the same, but this data is continuously 
being updated.
I was wondering what determines how many objects are created to store a given 
amount of data and what is better, more or fewer objects?

Configuration is the same between A and B, 200PGs, "osd_max_object_size": 
"134217728".
A has about half the pgs with 1-2 objects the rest with 0
B has about half the pgs with 20-30 objects the rest with 0-2

The main difference between the two setups, I am using the kernel client on A 
and testing the fuse client on the B
Is that significant?

Cephfs version 13.2.1 on CentOs 7.5
Kernel: 3.10.0-862.11.6.el7.x86_64

Ceph kernel client on CentOs 7.4
Kernel: 4.18.7-1.el7.elrepo.x86_64

Ceph fuse client: 13.2.1 on CentOs 7.4
Kernel: 4.18.7-1.el7.elrepo.x86_64

A
# rados df
...
total_objects191
total_used   5.7 GiB
total_avail  367 GiB
total_space  373 GiB

B
# rados df
...
total_objects3901
total_used   9.8 GiB
total_avail  363 GiB
total_space  373 GiB

Side note, just in case it could be related. I came to this question while 
trying to hunt down a problem on system A (the one with the fewer objects) 
where "getattr pAsLsXsFs" client requests are often taking a long time to 
complete (20-30 seconds) delayed by "failed to rdlock, waiting"

Any pointers or advice is much appreciated!

Many Thanks for your time!
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] All shards of PG missing object and inconsistent

2018-09-21 Thread Thomas White
Hi all,

 

I have recently performed a few tasks, namely purging several buckets from
our RGWs and added additional hosts into Ceph causing some data movement for
a rebalance. As this is now almost completed, I kicked off some deep scrubs
and one PG is now returning the following information:

 

2018-09-21 23:17:59.717286 7f2f16796700 -1 log_channel(cluster) log [ERR] :
14.1b18 shard 313 missing
14:18daa344:::default.162489536.28__shadow_24TB%2f24TB%2fDESIGNTEAM%2fPROJEC
TS%2fDLA Piper%2f_DLA_4033_ Global Thought Leadership%2fFilm%2f04
Assets%2fFootage %ef%80%a2 Audio
Sync%2fDLA_Thought_Leadership_NYC%2fCam_1%2fA13I1483.MOV.2~5v3nJDNLONBYszy54
ZXZZQgos1D4Ywp.359_6:head

2018-09-21 23:17:59.717292 7f2f16796700 -1 log_channel(cluster) log [ERR] :
14.1b18 shard 665 missing
14:18daa344:::default.162489536.28__shadow_24TB%2f24TB%2fDESIGNTEAM%2fPROJEC
TS%2fDLA Piper%2f_DLA_4033_ Global Thought Leadership%2fFilm%2f04
Assets%2fFootage %ef%80%a2 Audio
Sync%2fDLA_Thought_Leadership_NYC%2fCam_1%2fA13I1483.MOV.2~5v3nJDNLONBYszy54
ZXZZQgos1D4Ywp.359_6:head

2018-09-21 23:17:59.885884 7f2f16796700 -1 log_channel(cluster) log [ERR] :
14.1b18 shard 385 missing
14:18daa344:::default.162489536.28__shadow_24TB%2f24TB%2fDESIGNTEAM%2fPROJEC
TS%2fDLA Piper%2f_DLA_4033_ Global Thought Leadership%2fFilm%2f04
Assets%2fFootage %ef%80%a2 Audio
Sync%2fDLA_Thought_Leadership_NYC%2fCam_1%2fA13I1483.MOV.2~5v3nJDNLONBYszy54
ZXZZQgos1D4Ywp.359_6:head

2018-09-21 23:20:24.954402 7f2f16796700 -1 log_channel(cluster) log [ERR] :
14.1b18 scrub stat mismatch, got 44026/44025 objects, 0/0 clones,
44026/44025 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts,
45423386817/45419192513 bytes, 0/0 hit_set_archive bytes.

2018-09-21 23:20:24.954418 7f2f16796700 -1 log_channel(cluster) log [ERR] :
14.1b18 scrub 1 missing, 0 inconsistent objects

2018-09-21 23:20:24.954421 7f2f16796700 -1 log_channel(cluster) log [ERR] :
14.1b18 scrub 4 errors

 

The object I recognise by name as belonging to a bucket purged earlier in
the day, and is meant to be deleted. What would be the best means to resolve
this inconsistency when the object is supposed to be absent?

 

Kind Regards,

 

Thomas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delay Between Writing Data and that Data being available for reading?

2018-09-20 Thread Thomas Sumpter
In case you or anyone else reading is interested, I tried using the latest fuse 
client instead of the kernel client and my problem seems to be gone.
I think our kernel is recent enough that it should include the bug fix you 
mentioned? So maybe some else going on there...

Regards,
Tom

From: Thomas Sumpter
Sent: Wednesday, September 19, 2018 4:31 PM
To: 'Gregory Farnum' 
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Delay Between Writing Data and that Data being 
available for reading?

Linux version 4.18.4-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 
4.8.5 20150623 (Red Hat 4.8.5-28) (GCC))
CentOS 7

From: Gregory Farnum mailto:gfar...@redhat.com>>
Sent: Wednesday, September 19, 2018 4:27 PM
To: Thomas Sumpter mailto:thomas.sump...@irdeto.com>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Delay Between Writing Data and that Data being 
available for reading?

Okay, so you’re using the kernel client. What kernel version is it? I think 
this was one of a few known bugs there a while ago that have since been fixed.
On Wed, Sep 19, 2018 at 7:24 AM Thomas Sumpter 
mailto:thomas.sump...@irdeto.com>> wrote:
Hi Gregory,

Thanks for your reply.

Yes, the file is stored on CephFS.
Accessed using ceph client
Everything is a basic install following the ceph-deploy guide

Note sure what details would be helpful…
The file is written to by a webserver (apache)
The file is accessed by the webserver on request of some specific data within 
the file. The webserver will fetch the file from the local OS i.e. 
file://some_file, which in turn is fetched from the cephfs.
An observation, if I tell the webserver to make a web request to itself to 
fetch the file (http://localhost/etc), then I don’t have the problem.

# mount | grep ceph
mon-1:6789,mon-2:6789,mon-3:6789:/ on /mnt/ceph type ceph 
(rw,noatime,name=admin,secret=,acl)

# rpm -qa | grep ceph
libcephfs2-13.2.1-0.el7.x86_64
python-cephfs-13.2.1-0.el7.x86_64
ceph-common-13.2.1-0.el7.x86_64

# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)

Regards,
Tom

From: Gregory Farnum mailto:gfar...@redhat.com>>
Sent: Wednesday, September 19, 2018 4:04 PM
To: Thomas Sumpter mailto:thomas.sump...@irdeto.com>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Delay Between Writing Data and that Data being 
available for reading?

You're going to need to tell us *exactly* what you're doing. I presume this 
uses CephFS somehow? Are you accessing via NFS or something? Using what client 
versions?

CephFS certainly isn't supposed to allow this, and I don't think there are any 
currently known bugs which could leak it. But there are lots of things you can 
stack on top of it which won't provide the same guarantees.

On Wed, Sep 19, 2018 at 6:45 AM Thomas Sumpter 
mailto:thomas.sump...@irdeto.com>> wrote:
Hello,

We have Mimic version 13.2.1 using Bluestore. OSDs are using NVMe disks for 
data storage (in AWS).
Four OSDs are active in replicated mode.
Further information on request, since there are so many config options I am not 
sure where to focus my attention yet. Assume we have default options.

We have a scenario where one file is continuously been written to and read from.
Very occasionally the write operation is completed but then the subsequent read 
op on that file does not contain this new data for a brief period.
Does anyone know a reason for the delay between write operations being 
completed and the new data that was written to be present in the file?
It is not very easily reproducible, but possible with fast scripted attempts. 
We do not have this problem when using other filesystems.

Note: there is one client writing data and 2 clients reading data to/from this 
file.

Many Thanks!
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delay Between Writing Data and that Data being available for reading?

2018-09-19 Thread Thomas Sumpter
Linux version 4.18.4-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 
4.8.5 20150623 (Red Hat 4.8.5-28) (GCC))
CentOS 7

From: Gregory Farnum 
Sent: Wednesday, September 19, 2018 4:27 PM
To: Thomas Sumpter 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Delay Between Writing Data and that Data being 
available for reading?

Okay, so you’re using the kernel client. What kernel version is it? I think 
this was one of a few known bugs there a while ago that have since been fixed.
On Wed, Sep 19, 2018 at 7:24 AM Thomas Sumpter 
mailto:thomas.sump...@irdeto.com>> wrote:
Hi Gregory,

Thanks for your reply.

Yes, the file is stored on CephFS.
Accessed using ceph client
Everything is a basic install following the ceph-deploy guide

Note sure what details would be helpful…
The file is written to by a webserver (apache)
The file is accessed by the webserver on request of some specific data within 
the file. The webserver will fetch the file from the local OS i.e. 
file://some_file, which in turn is fetched from the cephfs.
An observation, if I tell the webserver to make a web request to itself to 
fetch the file (http://localhost/etc), then I don’t have the problem.

# mount | grep ceph
mon-1:6789,mon-2:6789,mon-3:6789:/ on /mnt/ceph type ceph 
(rw,noatime,name=admin,secret=,acl)

# rpm -qa | grep ceph
libcephfs2-13.2.1-0.el7.x86_64
python-cephfs-13.2.1-0.el7.x86_64
ceph-common-13.2.1-0.el7.x86_64

# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)

Regards,
Tom

From: Gregory Farnum mailto:gfar...@redhat.com>>
Sent: Wednesday, September 19, 2018 4:04 PM
To: Thomas Sumpter mailto:thomas.sump...@irdeto.com>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Delay Between Writing Data and that Data being 
available for reading?

You're going to need to tell us *exactly* what you're doing. I presume this 
uses CephFS somehow? Are you accessing via NFS or something? Using what client 
versions?

CephFS certainly isn't supposed to allow this, and I don't think there are any 
currently known bugs which could leak it. But there are lots of things you can 
stack on top of it which won't provide the same guarantees.

On Wed, Sep 19, 2018 at 6:45 AM Thomas Sumpter 
mailto:thomas.sump...@irdeto.com>> wrote:
Hello,

We have Mimic version 13.2.1 using Bluestore. OSDs are using NVMe disks for 
data storage (in AWS).
Four OSDs are active in replicated mode.
Further information on request, since there are so many config options I am not 
sure where to focus my attention yet. Assume we have default options.

We have a scenario where one file is continuously been written to and read from.
Very occasionally the write operation is completed but then the subsequent read 
op on that file does not contain this new data for a brief period.
Does anyone know a reason for the delay between write operations being 
completed and the new data that was written to be present in the file?
It is not very easily reproducible, but possible with fast scripted attempts. 
We do not have this problem when using other filesystems.

Note: there is one client writing data and 2 clients reading data to/from this 
file.

Many Thanks!
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delay Between Writing Data and that Data being available for reading?

2018-09-19 Thread Thomas Sumpter
Hi Gregory,

Thanks for your reply.

Yes, the file is stored on CephFS.
Accessed using ceph client
Everything is a basic install following the ceph-deploy guide

Note sure what details would be helpful…
The file is written to by a webserver (apache)
The file is accessed by the webserver on request of some specific data within 
the file. The webserver will fetch the file from the local OS i.e. 
file://some_file, which in turn is fetched from the cephfs.
An observation, if I tell the webserver to make a web request to itself to 
fetch the file (http://localhost/etc), then I don’t have the problem.

# mount | grep ceph
mon-1:6789,mon-2:6789,mon-3:6789:/ on /mnt/ceph type ceph 
(rw,noatime,name=admin,secret=,acl)

# rpm -qa | grep ceph
libcephfs2-13.2.1-0.el7.x86_64
python-cephfs-13.2.1-0.el7.x86_64
ceph-common-13.2.1-0.el7.x86_64

# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)

Regards,
Tom

From: Gregory Farnum 
Sent: Wednesday, September 19, 2018 4:04 PM
To: Thomas Sumpter 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Delay Between Writing Data and that Data being 
available for reading?

You're going to need to tell us *exactly* what you're doing. I presume this 
uses CephFS somehow? Are you accessing via NFS or something? Using what client 
versions?

CephFS certainly isn't supposed to allow this, and I don't think there are any 
currently known bugs which could leak it. But there are lots of things you can 
stack on top of it which won't provide the same guarantees.

On Wed, Sep 19, 2018 at 6:45 AM Thomas Sumpter 
mailto:thomas.sump...@irdeto.com>> wrote:
Hello,

We have Mimic version 13.2.1 using Bluestore. OSDs are using NVMe disks for 
data storage (in AWS).
Four OSDs are active in replicated mode.
Further information on request, since there are so many config options I am not 
sure where to focus my attention yet. Assume we have default options.

We have a scenario where one file is continuously been written to and read from.
Very occasionally the write operation is completed but then the subsequent read 
op on that file does not contain this new data for a brief period.
Does anyone know a reason for the delay between write operations being 
completed and the new data that was written to be present in the file?
It is not very easily reproducible, but possible with fast scripted attempts. 
We do not have this problem when using other filesystems.

Note: there is one client writing data and 2 clients reading data to/from this 
file.

Many Thanks!
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Delay Between Writing Data and that Data being available for reading?

2018-09-19 Thread Thomas Sumpter
Hello,

We have Mimic version 13.2.1 using Bluestore. OSDs are using NVMe disks for 
data storage (in AWS).
Four OSDs are active in replicated mode.
Further information on request, since there are so many config options I am not 
sure where to focus my attention yet. Assume we have default options.

We have a scenario where one file is continuously been written to and read from.
Very occasionally the write operation is completed but then the subsequent read 
op on that file does not contain this new data for a brief period.
Does anyone know a reason for the delay between write operations being 
completed and the new data that was written to be present in the file?
It is not very easily reproducible, but possible with fast scripted attempts. 
We do not have this problem when using other filesystems.

Note: there is one client writing data and 2 clients reading data to/from this 
file.

Many Thanks!
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Installing ceph 12.2.4 via Ubuntu apt

2018-08-29 Thread Thomas Bennett
Hi David,

Thanks for your reply. That's how I'm currently handling it.

Kind regards,
Tom

On Tue, Aug 28, 2018 at 4:36 PM David Turner  wrote:

> That is the expected behavior of the ceph repo. In the past when I needed
> a specific version I would download the packages for the version to a
> folder and you can create a repo file that reads from a local directory.
> That's how I would re-install my test lab after testing an upgrade
> procedure to try it over again.
>
> On Tue, Aug 28, 2018, 1:01 AM Thomas Bennett  wrote:
>
>> Hi,
>>
>> I'm wanting to pin to an older version of Ceph Luminous (12.2.4) and I've
>> noticed that https://download.ceph.com/debian-luminous/ does not support
>> this via apt install:
>> apt install ceph works for 12.2.7 but
>> apt install ceph=12.2.4-1xenial does not work
>>
>> The deb file are there, they're just not included in the package
>> distribution. Is this the desired behaviour or a misconfiguration?
>>
>> Cheers,
>> Tom
>>
>> --
>> Thomas Bennett
>>
>> SARAO
>> Science Data Processing
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

-- 
Thomas Bennett

SARAO
Science Data Processing
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SAN or DAS for Production ceph

2018-08-28 Thread Thomas White
Hi James,

 

I can see where some of the confusion has arisen, hopefully I can put at least 
some of it to rest. In the Tumblr post from Yahoo, the keyword to look out for 
is “nodes”, which is distinct from individual hard drives which in Ceph is an 
OSD in most cases. So you would have multiple OSDs per node.

 

My quick napkin math would suggest that they are using 54 storage nodes, each 
holding 16 drives/OSDs (this doesn’t count the OS drives which aren’t specified 
in the post), as with the below math:

 

54 storage nodes providing 3.2PB of raw store requires ~59.25TB of storage per 
node

59.25TB / 12 = 4.94TB per OSD

59.25TB / 14 = 4.32TB per OSD

59.25TB / 16 = 3.70TB per OSD

 

Total OSDs per cluster = 864

EC Calculation: 8 / (8+3) = 72.73%

 

As they are using an 8/3 erasure coding configuration, that would provide an 
efficiency of 72.73% (see EC Calculation), so the usable capacity per storage 
cluster is around 2.33PB.

 

I haven’t included the calculation for anything below 12 as while it is 
possible, I find the 16 drive configuration most probable. As Ceph crush weight 
is shown using TiB, but most hard drives are marketed in TB due to the higher 
value, that would mean that 4TB drives are in use providing 3.63TiB of usable 
space on the drive. The math isn’t perfect here as you can see, but I’d think 
it is a safe assumption that they have at least a few higher capacity drives in 
there, or a wider mix of such standard commodity drive sizes with 4TB simply 
being a decent average.

 

For object storage clusters, particularly in use cases of high volumes of small 
objects, a standard OSD/node density is preferable which hovers between 10 and 
16 OSDs per server depending who you ask (some reading on the subject courtesy 
of RedHat 
https://www.redhat.com/cms/managed-files/st-ceph-storage-qct-object-storage-reference-architecture-f7901-201706-v2-en.pdf).
  As Yahoo’s workload is noting consistency and latency are some important 
metrics, they are also likely to use this density profile rather than something 
higher – this has the added benefit of quicker recovery times in the event of 
an individual OSD/host failure which is a parameter they tuned quite 
extensively.

 

For hashing algorithms and load balancing, I am not quite sure I understand 
your question, but RGW which implements object storage in Ceph has the ability 
to configure multiple zones/groups/regions, it might be best to have a read 
through the docs first:

http://docs.ceph.com/docs/luminous/radosgw/multisite/

 

Ceph is quite different from a SAN or DAS, and gives a great deal more 
flexibility too. If you are unsure on getting started and you need to hit the 
ground running strongly (ie a multi-PB production system), I’d really recommend 
getting a reliable consultant or taking out professional support services for 
it. Ceph is a piece of cake to manage when everything is working well, and very 
often this will be the case for a long time, but you will really value good 
planning and experience when you hit those rough patches.

 

Hope that helps,

 

Tom

 

 

From: ceph-users  On Behalf Of James Watson
Sent: 28 August 2018 21:05
To: ceph-users@lists.ceph.com
Subject: [ceph-users] SAN or DAS for Production ceph

 

Dear cephers, 

 

I am new to the storage domain. 

Trying to get my head around the enterprise - production-ready setup. 

 

The following article helps a lot here: (Yahoo ceph implementation)

https://yahooeng.tumblr.com/tagged/object-storage

 

But a couple of questions:

 

What HDD would they have used here? NVMe / SATA /SAS etc (with just 52 storage 
node they got 3.2 PB of capacity !! )

I try to calculate a similar setup with HGST Ultrastar He12 (12TB and it's more 
recent ) and would need 86 HDDs that adds up to 1 PB only!!

 

How is the HDD drive attached is it DAS or a SAN (using Fibre Channel Switches, 
Host Bus Adapters etc)?

 

Do we need a proprietary hashing algorithm to implement multi-cluster based 
setup of ceph to contain CPU/Memory usage within the cluster when rebuilding 
happens during device failure?

 

If proprietary hashing algorithm is required to setup multi-cluster ceph using 
load balancer - then what could be the alternative setup we can deploy to 
address the same issue?

 

The aim is to design a similar architecture but with upgraded products and 
higher performance. - Any suggestions or thoughts are welcome 

 

 

 

Thanks in advance

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Installing ceph 12.2.4 via Ubuntu apt

2018-08-28 Thread Thomas Bennett
Hi,

I'm wanting to pin to an older version of Ceph Luminous (12.2.4) and I've
noticed that https://download.ceph.com/debian-luminous/ does not support
this via apt install:
apt install ceph works for 12.2.7 but
apt install ceph=12.2.4-1xenial does not work

The deb file are there, they're just not included in the package
distribution. Is this the desired behaviour or a misconfiguration?

Cheers,
Tom

--
Thomas Bennett

SARAO
Science Data Processing
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG could not be repaired

2018-08-14 Thread Thomas White
Hi Arvydas,

 

The error seems to suggest this is not an issue with your object data, but the 
expected object digest data. I am unable to access where I stored my very hacky 
diagnosis process for this, but our eventual fix was to locate the bucket or 
files affected and then rename an object within it, forcing a recalculation of 
the digest. Depending on the size of the pool perhaps it would be possible to 
randomly rename a few files to cause this recalculation to occur to see if this 
remedies it?

 

Kind Regards,

 

Tom

 

From: ceph-users  On Behalf Of Arvydas 
Opulskis
Sent: 14 August 2018 12:33
To: Brent Kennedy 
Cc: Ceph Users 
Subject: Re: [ceph-users] Inconsistent PG could not be repaired

 

Thanks for suggestion about restarting OSD's, but this doesn't work either.

 

Anyway, I managed to fix second unrepairing PG by getting object from OSD and 
saving it again via rados, but still no luck with first one. 

I think, I found main problem why this doesn't work. Seems, object is not 
overwritten, even rados command returns no errors. I tried to delete object, 
but it still stays in pool untouched. There is an example of what I see:

 

# rados -p .rgw.buckets ls | grep -i 
"sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

# rados -p .rgw.buckets get 
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
 testfile
error getting 
.rgw.buckets/default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d:
 (2) No such file or directory

# rados -p .rgw.buckets rm 
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

# rados -p .rgw.buckets ls | grep -i 
"sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

 

I've never seen this in our Ceph clusters before. Should I report a bug about 
it? If any of you guys need more diagnostic info - let me know.

 

Thanks,

Arvydas

 

On Tue, Aug 7, 2018 at 5:49 PM, Brent Kennedy mailto:bkenn...@cfl.rr.com> > wrote:

Last time I had an inconsistent PG that could not be repaired using the repair 
command, I looked at which OSDs hosted the PG, then restarted them one by 
one(usually stopping, waiting a few seconds, then starting them back up ).  You 
could also stop them, flush the journal, then start them back up.  

 

If that didn’t work, it meant there was data loss and I had to use the 
ceph-objectstore-tool repair tool to export the objects from a location that 
had the latest data and import into the one that had no data.  The 
ceph-objectstore-tool is not a simple thing though and should not be used 
lightly.  When I say data loss, I mean that ceph thinks the last place written 
has the data, that place being the OSD that doesn’t actually have the 
data(meaning it failed to write there).

 

If you want to go that route, let me know, I wrote a how to on it.  Should be 
the last resort though.  I also don’t know your setup, so I would hate to 
recommend something so drastic.

 

-Brent

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 ] On Behalf Of Arvydas Opulskis
Sent: Monday, August 6, 2018 4:12 AM
To: ceph-users@lists.ceph.com  
Subject: Re: [ceph-users] Inconsistent PG could not be repaired

 

Hi again,

 

after two weeks I've got another inconsistent PG in same cluster. OSD's are 
different from first PG, object can not be GET as well: 


# rados list-inconsistent-obj 26.821 --format=json-pretty

{

"epoch": 178472,

"inconsistents": [

{

"object": {

"name": 
"default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7",

"nspace": "",

"locator": "",

"snap": "head",

"version": 118920

},

"errors": [],

"union_shard_errors": [

"data_digest_mismatch_oi"

],

"selected_object_info": 
"26:8411bae4:::default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920
 client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv 118920 
dd cd142aaa od  alloc_hint [0 0])",

"shards": [

{

"osd": 20,

"errors": [


Re: [ceph-users] Ceph upgrade Jewel to Luminous

2018-08-14 Thread Thomas White
Hi Jaime,

 

Upgrading directly should not be a problem. It is usually recommended to go to 
the latest minor release before upgrading major versions, but my own migration 
from 10.2.10 to 12.2.5 went seamlessly and I can’t see of any technical 
limitation which would hinder or prevent this process.

 

Kind Regards,

 

Tom

 

From: ceph-users  On Behalf Of Jaime Ibar
Sent: 14 August 2018 10:00
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph upgrade Jewel to Luminous

 

Hi all,

we're running Ceph Jewel 10.2.10 in our cluster and we plan to upgrade to 
latest Luminous

release(12.2.7). Jewel 10.2.11 was released one month ago and ours plans were 
upgrade to

this release first and then upgrade to Luminous, but as someone reported osd's 
crashes after

upgrading to Jewel 10.2.11, we wonder if would be possible to skip this Jewel 
release and

upgrade directly to Luminous 12.2.7.

Thanks

Jaime


Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie   | ja...@tchpc.tcd.ie 
 
Tel: +353-1-896-3725 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-13 Thread Thomas White
Hi Steven,

 

Just to somewhat clarify my previous post, I mention OSDs in the sense that the 
OS is installed on the OSD server using the SD card, I would absolutely 
recommend against using SD cards as the actual OSD media. This of course misses 
another point, which is for the Mons or other such services, I would also 
strongly advise against an SD card setup much more explicitly due to the mon’s 
store.db, which during tunables being altered can grow dramatically and incur 
excessive wear on an SD card rapidly.

 

Kind Regards,

 

Tom

 

From: ceph-users  On Behalf Of 
tho...@thomaswhite.se
Sent: 13 August 2018 20:27
To: 'Steven Vacaroaia' ; 'ceph-users' 

Subject: Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

 

Hi Steven,

 

If you are running OSDs on the SD card, there would be nothing technically 
stopping this setup, but the main factors against would be the simple endurance 
and performance of SD cards and the potential fallout when they inevitably 
fail. If you factor time and maintenance as a cost just as much as the 
hardware, you would be far better off getting a small NVME drive as mentioned. 
With regarding to booting from the NVME, I don’t know the answer to that 
unfortunately as we don’t use R620’s (our preference is the Dell 730XD), but 
I’ll check with some of our DC to see if they have that info to hand and let 
you know.

 

In my view, keep OS activity entirely separate from OSD operations too. Mixing 
the two can introduce a host of other problems which will cause a lot of pain 
along the way, such as receiving a hanging messages from the kernel trying to 
perform OS writes if the osd process is in a full recovery state. Always 
account for your failure redundancy in the worst case scenario and not the 
ideal one, I found this out personally on a very large cluster and is not a fun 
experience.

 

Kind Regards,

 

Tom

 

From: ceph-users mailto:ceph-users-boun...@lists.ceph.com> > On Behalf Of Steven Vacaroaia
Sent: 13 August 2018 19:53
To: ceph-users mailto:ceph-users@lists.ceph.com> >
Subject: [ceph-users] limited disk slots - should I ran OS on SD card ?

 

Hi,

 

I am in the process of deploying CEPPH mimic on a few DELL R620/630 servers

These servers have only 8 disk slots and the larger disk they accept is 2 TB 

 

Should I share a SSD drive between OS and WAL/DB or ran OS on internal SD cards 
and dedicated SSD to DB/WAL only ?

 

Another option is to use NVME ( can still get  DC P3700) but not sure if DELL 
R620 allows booting from it 

 

Any advice would be greatly appreciated 

 

I am still waiting for budget approval to buy disks

In my small test environment  I did not notice any performance hit when OS was 
on SD card

 

Many thanks

Steven

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] limited disk slots - should I ran OS on SD card ?

2018-08-13 Thread thomas
Hi Steven,

 

If you are running OSDs on the SD card, there would be nothing technically 
stopping this setup, but the main factors against would be the simple endurance 
and performance of SD cards and the potential fallout when they inevitably 
fail. If you factor time and maintenance as a cost just as much as the 
hardware, you would be far better off getting a small NVME drive as mentioned. 
With regarding to booting from the NVME, I don’t know the answer to that 
unfortunately as we don’t use R620’s (our preference is the Dell 730XD), but 
I’ll check with some of our DC to see if they have that info to hand and let 
you know.

 

In my view, keep OS activity entirely separate from OSD operations too. Mixing 
the two can introduce a host of other problems which will cause a lot of pain 
along the way, such as receiving a hanging messages from the kernel trying to 
perform OS writes if the osd process is in a full recovery state. Always 
account for your failure redundancy in the worst case scenario and not the 
ideal one, I found this out personally on a very large cluster and is not a fun 
experience.

 

Kind Regards,

 

Tom

 

From: ceph-users  On Behalf Of Steven 
Vacaroaia
Sent: 13 August 2018 19:53
To: ceph-users 
Subject: [ceph-users] limited disk slots - should I ran OS on SD card ?

 

Hi,

 

I am in the process of deploying CEPPH mimic on a few DELL R620/630 servers

These servers have only 8 disk slots and the larger disk they accept is 2 TB 

 

Should I share a SSD drive between OS and WAL/DB or ran OS on internal SD cards 
and dedicated SSD to DB/WAL only ?

 

Another option is to use NVME ( can still get  DC P3700) but not sure if DELL 
R620 allows booting from it 

 

Any advice would be greatly appreciated 

 

I am still waiting for budget approval to buy disks

In my small test environment  I did not notice any performance hit when OS was 
on SD card

 

Many thanks

Steven

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore OSD Segfaults (12.2.5/12.2.7)

2018-08-07 Thread Thomas White
Hi all,

We have recently begun switching over to Bluestore on our Ceph cluster, 
currently on 12.2.7. We first began encountering segfaults on Bluestore during 
12.2.5, but strangely these segfaults apply exclusively to our SSD pools and 
not the PCIE/HDD disks. We upgraded to 12.2.7 last week to get clear of the 
issues known within 12.2.6 and hoping it may address our bluestore issues, but 
to no avail, and upgrading to mimic is not feasible for us right away as this 
is a production environment.

I have attached one of the OSD logs which are experiencing the segfault, as 
well as the recommended command to interpret the debug information. 
Unfortunately at present due to the 403 I am unable to open a bug tracker for 
this.

OSD Log: https://transfer.sh/AYQ8Y/ceph-osd.123.log
OSD Binary debug: https://transfer.sh/FOiLv/ceph-osd-123-binary.txt.tar.gz

The disks in use are Intel DC S3710s 800G.

These OSDs were previously filestore and fully operational, and the procedure 
for migrating these was to as usual mark as out, await recovery, zap and 
redeploy. We further used DD to ensure the disk was fully wiped and performed 
smartctl tests to rule out errors with the disk performance, but were unable to 
find any faults.

What may be unusual is only some of the SSDs are encountering this segfault so 
far. On one host where we have 8 OSDs, only 2 of these are hitting the 
segfaults so far. However, we have noticed the new OSDs are considerably more 
temperamental to be marked as down despite minimal load.

Any advice anyone could offer on this would be great.

Kind Regards,

Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reset Object ACLs in RGW

2018-08-02 Thread thomas
Hi Casey,

Thanks for the tip in the right direction. I originally tried creating an
admin user to accomplish this and I didn't realise the different between an
admin and system user. Using a system user I was able to iterate over the
contents of the buckets and reset the object ownership back to the bucket
owner. Below is a very very ugly bash script I used to achieve this which I
don't recommend anyone using, but for reference of anyone else in a similar
predicament to me:

IFS=$'\n'; for i in $(aws s3api --endpoint-url
https://ceph-rgw-endpoint-here list-objects --bucket "bucketname" --output
json | jq -r '.Contents[] | (.Key)') ; do echo restoring ownership on $i ;
aws s3api --endpoint-url https://ceph-rgw-endpoint-here put-object-acl
--grant-full-control id=idhere --bucket "bucketname" --key $i ; done ; unset
IFS

You'll need to install the aws toolkit and jq of course and configure them.

Thanks again,

Tom


-Original Message-
From: ceph-users  On Behalf Of Casey
Bodley
Sent: 02 August 2018 17:08
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Reset Object ACLs in RGW


On 08/02/2018 07:35 AM, Thomas White wrote:
> Hi all,
>
> At present I have a cluster with a user on the RGW who has lost access to
many of his files. The bucket has the correct ACL to be accessed by the
account and so with their access and secret key many items can be listed,
but are unable to be downloaded.
>
> Is there a way of using the radosgw-admin tool to reset (or set) ACLs on
individual files or recursively across bucket objects to restore access for
them?
>
> Kind Regards,
>
> Tom
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Tom,

I don't think radosgw-admin can do this. But you can create a system user
(radosgw-admin user create --system ...) which overrides permission checks,
and use it to issue s3 operations to manipulate the acls.

Casey
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reset Object ACLs in RGW

2018-08-02 Thread Thomas White
Hi all,

At present I have a cluster with a user on the RGW who has lost access to many 
of his files. The bucket has the correct ACL to be accessed by the account and 
so with their access and secret key many items can be listed, but are unable to 
be downloaded.

Is there a way of using the radosgw-admin tool to reset (or set) ACLs on 
individual files or recursively across bucket objects to restore access for 
them?

Kind Regards,

Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client.bootstrap-osd authentication error - which keyrin

2018-07-09 Thread Thomas Roth
Thanks, but doesn't work.

It is always the subcommand
/usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json

(also 'ceph ... osd tree -i - osd new NEWID')

which fails with client.bootstrap-osd authentication error

Of course, 'ceph osd tree' works just fine.


Must be something I have missed when upgrading from Jewel. In fact, there where 
no
boostrap-xxx/keyrings anywhere, I just have my /etc/ceph/ceph.mon.keyring which 
seems to have managed
the magic before.

Cheers,
Thommas

On 07/06/2018 09:36 PM, Paul Emmerich wrote:
> Hi,
> 
> both ceph-disk and ceph-volume need a keyring in the file
> 
> /var/lib/ceph/bootstrap-osd/ceph.keyring
> 
> The key should look like this:
> 
> [client.bootstrap-osd]
> key = 
> caps mon = "allow profile bootstrap-osd"
> 
> 
> Paul
> 
> 
> 2018-07-06 16:47 GMT+02:00 Thomas Roth :
> 
>> Hi all,
>>
>> I wonder which is the correct key to create/recreate an additional OSD
>> with 12.2.5.
>>
>> Following
>> http://docs.ceph.com/docs/master/rados/operations/
>> bluestore-migration/#convert-existing-osds, I took
>> one of my old OSD out of the cluster, but failed subsequently recreating
>> it as a BlueStor OSD.
>>
>> I tried "ceph-volume" at first, now got one step further using "ceph-disk"
>> with
>> "ceph-disk prepare --bluestore /dev/sdh", which completed, I assume
>> successfully.
>>
>> However, "ceph-disk activate" fails with basically the same error as
>> "ceph-volume" before,
>>
>>
>> ~# ceph-disk activate /dev/sdh1
>> command_with_stdin: 2018-07-06 16:23:18.677429 7f905de45700  0 librados:
>> client.bootstrap-osd
>> authentication error (1) Operation not permitted
>> [errno 1] error connecting to the cluster
>>
>>
>>
>> Now this test cluster was created under Jewel, where I created OSDs by
>> "ceph-osd -i $ID --mkfs --mkkey --osd-uuid $UUID"
>> and
>> "ceph auth add osd.#{ID} osd 'allow *' mon 'allow profile osd' -i
>> /var/lib/ceph/osd/ceph-#{ID}/keyring"
>>
>> This did not produce any "/var/lib/ceph/bootstrap-osd/ceph.keyring", but
>> I found them on my mon hosts.
>> "ceph-volume" and "ceph-disk" go looking for that file, so I put it there,
>> to no avail.
>>
>>
>>
>> Btw, the target server has still several "up" and "in" OSDs running, so
>> this is not a question of
>> network or general authentication issues.
>>
>> Cheers,
>> Thomas
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 

-- 

Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführung: Ursula Weyrich
Professor Dr. Paolo Giubellino
Jörg Blaurock

Vorsitzende des Aufsichtsrates: St Dr. Georg Schütte
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] client.bootstrap-osd authentication error - which keyrin

2018-07-06 Thread Thomas Roth
Hi all,

I wonder which is the correct key to create/recreate an additional OSD with 
12.2.5.

Following
http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds,
 I took
one of my old OSD out of the cluster, but failed subsequently recreating it as 
a BlueStor OSD.

I tried "ceph-volume" at first, now got one step further using "ceph-disk" with
"ceph-disk prepare --bluestore /dev/sdh", which completed, I assume 
successfully.

However, "ceph-disk activate" fails with basically the same error as 
"ceph-volume" before,


~# ceph-disk activate /dev/sdh1
command_with_stdin: 2018-07-06 16:23:18.677429 7f905de45700  0 librados: 
client.bootstrap-osd
authentication error (1) Operation not permitted
[errno 1] error connecting to the cluster



Now this test cluster was created under Jewel, where I created OSDs by
"ceph-osd -i $ID --mkfs --mkkey --osd-uuid $UUID"
and
"ceph auth add osd.#{ID} osd 'allow *' mon 'allow profile osd' -i 
/var/lib/ceph/osd/ceph-#{ID}/keyring"

This did not produce any "/var/lib/ceph/bootstrap-osd/ceph.keyring", but I 
found them on my mon hosts.
"ceph-volume" and "ceph-disk" go looking for that file, so I put it there, to 
no avail.



Btw, the target server has still several "up" and "in" OSDs running, so this is 
not a question of
network or general authentication issues.

Cheers,
Thomas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pre-sharding s3 buckets

2018-06-27 Thread Thomas Bennett
Hi Matthew,

Thanks for your reply, much appreciated.

Sorry, I meant to say that we're running on Luminous, so I'm aware of
dynamic resharding - however, I'm worried that this does not suit our
particular use case.

What I also forgot to mention is that we could be resharding a bucket 30
times in 8 hours as we will write ~3 million objects in ~8 hours.

Hence the idea that we should preshard to avoid any undesirable workloads.

Cheers,
Tom

On Wed, Jun 27, 2018 at 3:16 PM, Matthew Vernon  wrote:

> Hi,
>
> On 27/06/18 11:18, Thomas Bennett wrote:
>
> > We have a particular use case that we know that we're going to be
> > writing lots of objects (up to 3 million) into a bucket. To take
> > advantage of sharding, I'm wanting to shard buckets, without the
> > performance hit of resharding.
>
> I assume you're running Jewel (Luminous has dynamic resharding); you can
> set rgw_override_bucket_index_max_shards = X in your ceph.conf, which
> will cause all new buckets to have X shards for the indexes.
>
> HTH,
>
> Matthew
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>



-- 
Thomas Bennett

SRAO
Storage Engineer - Science Data Processing
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pre-sharding s3 buckets

2018-06-27 Thread Thomas Bennett
Hi,

We have a particular use case that we know that we're going to be writing
lots of objects (up to 3 million) into a bucket. To take advantage of
sharding, I'm wanting to shard buckets, without the performance hit of
resharding.

So far I've created an empty bucket and then used the radosgw-admin:

radosgw-admin bucket reshard --bucket test --num-shards 10

Is there another way to do this?

Or, is there a s3/ceph configuration option that will set the number of
shards when a bucket is created - with the hit that we might have buckets
with multiple empty shards, a sacrifice that I'm willing to take for the
convenience of it preconfigured.

Cheers,
Tom

-- 
Thomas Bennett

SRAO
Storage Engineer - Science Data Processing
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph_vms performance

2018-05-23 Thread Thomas Bennett
Hi,

I'm testing out ceph_vms vs a cephfs mount with a cifs export.

I currently have 3 active ceph mds servers to maximise throughput  and
when I have configured a cephfs mount with a cifs export,  I'm getting
a reasonable benchmark results.

However, when I tried some benchmarking with the ceph_vms module, I
only got a 3rd of the comparable write throughput.

I'm just wondering if this is expected, or if there is an obvious
configuration setup that I'm missing?

Configuration:

I've compiled git branch samba 4_8_test.

I'm using ceph 12.2.5

Kind regards,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-21 Thread Thomas Byrne - UKRI STFC
mon_compact_on_start was not changed from default (false). From the logs, it 
looks like the monitor with the excessive resource usage (mon1) was up and 
winning the majority of elections throughout the period of unresponsiveness, 
with other monitors occasionally winning an election without mon1 participating 
(I’m guessing as it failed to respond).

That’s interesting about the false map updates. We had a short networking blip 
(caused by me) on some monitors shortly before the trouble started, which 
caused some monitors to start calling frequent (every few seconds) elections. 
Could this rapid creation of new monmaps have the same effect as updating pool 
settings? Thus causing the monitor to try and clean up in one go, causing the 
observed resource usage and unresponsiveness.

I’ve been bringing in the storage as you described, I’m in the process of 
adding 6PB of new storage to a ~10PB (raw) cluster (with ~8PB raw utilisation), 
so I’m feeling around for the largest backfills we can safely do. I had been 
weighting up storage in steps that take ~5 days to finish, but have been 
starting the next reweight as we get to the tail end of the previous, so not 
giving the mons time to compact their stores. Although it’s far from ideal 
(from a total time to get new storage weighted up), I’ll be letting the mons 
compact between every backfill until I have a better idea of what went on last 
week.

From: David Turner <drakonst...@gmail.com>
Sent: 17 May 2018 18:57
To: Byrne, Thomas (STFC,RAL,SC) <tom.by...@stfc.ac.uk>
Cc: Wido den Hollander <w...@42on.com>; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] A question about HEALTH_WARN and monitors holding 
onto cluster maps

Generally they clean up slowly by deleting 30 maps every time the maps update.  
You can speed that up by creating false map updates with something like 
updating a pool setting to what it already is.  What it sounds like happened to 
you is that your mon crashed and restarted.  If it crashed and has the setting 
to compact the mon store on start, then it would cause it to forcibly go 
through and clean everything up in 1 go.

I generally plan my backfilling to not take longer than a week.  Any longer 
than that is pretty rough on the mons.  You can achieve that by bringing in new 
storage with a weight of 0.0 and increase it appropriately as opposed to just 
adding it with it's full weight and having everything move at once.

On Thu, May 17, 2018 at 12:56 PM Thomas Byrne - UKRI STFC 
<tom.by...@stfc.ac.uk<mailto:tom.by...@stfc.ac.uk>> wrote:
That seems like a sane way to do it, thanks for the clarification Wido.

As a follow-up, do you have any feeling as to whether the trimming a 
particularly intensive task? We just had a fun afternoon where the monitors 
became unresponsive (no ceph status etc) for several hours, seemingly due to 
the leaders monitor process consuming all available ram+swap (64GB+32GB) on 
that monitor. This was then followed by the actual trimming of the stores 
(26GB->11GB), which took a few minutes and happened simultaneously across the 
monitors.

If this is something to be expected, it'll be a good reason to plan our long 
backfills much more carefully in the future!

> -Original Message-
> From: ceph-users 
> <ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
> On Behalf Of Wido
> den Hollander
> Sent: 17 May 2018 15:40
> To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] A question about HEALTH_WARN and monitors
> holding onto cluster maps
>
>
>
> On 05/17/2018 04:37 PM, Thomas Byrne - UKRI STFC wrote:
> > Hi all,
> >
> >
> >
> > As far as I understand, the monitor stores will grow while not
> > HEALTH_OK as they hold onto all cluster maps. Is this true for all
> > HEALTH_WARN reasons? Our cluster recently went into HEALTH_WARN
> due to
> > a few weeks of backfilling onto new hardware pushing the monitors data
> > stores over the default 15GB threshold. Are they now prevented from
> > shrinking till I increase the threshold above their current size?
> >
>
> No, monitors will trim their data store with all PGs are active+clean, not 
> when
> they are HEALTH_OK.
>
> So a 'noout' flag triggers a WARN, but that doesn't prevent the MONs from
> trimming for example.
>
> Wido
>
> >
> >
> > Cheers
> >
> > Tom
> >
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
&g

Re: [ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-17 Thread Thomas Byrne - UKRI STFC
That seems like a sane way to do it, thanks for the clarification Wido.

As a follow-up, do you have any feeling as to whether the trimming a 
particularly intensive task? We just had a fun afternoon where the monitors 
became unresponsive (no ceph status etc) for several hours, seemingly due to 
the leaders monitor process consuming all available ram+swap (64GB+32GB) on 
that monitor. This was then followed by the actual trimming of the stores 
(26GB->11GB), which took a few minutes and happened simultaneously across the 
monitors.

If this is something to be expected, it'll be a good reason to plan our long 
backfills much more carefully in the future!

> -Original Message-
> From: ceph-users <ceph-users-boun...@lists.ceph.com> On Behalf Of Wido
> den Hollander
> Sent: 17 May 2018 15:40
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] A question about HEALTH_WARN and monitors
> holding onto cluster maps
> 
> 
> 
> On 05/17/2018 04:37 PM, Thomas Byrne - UKRI STFC wrote:
> > Hi all,
> >
> >
> >
> > As far as I understand, the monitor stores will grow while not
> > HEALTH_OK as they hold onto all cluster maps. Is this true for all
> > HEALTH_WARN reasons? Our cluster recently went into HEALTH_WARN
> due to
> > a few weeks of backfilling onto new hardware pushing the monitors data
> > stores over the default 15GB threshold. Are they now prevented from
> > shrinking till I increase the threshold above their current size?
> >
> 
> No, monitors will trim their data store with all PGs are active+clean, not 
> when
> they are HEALTH_OK.
> 
> So a 'noout' flag triggers a WARN, but that doesn't prevent the MONs from
> trimming for example.
> 
> Wido
> 
> >
> >
> > Cheers
> >
> > Tom
> >
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-17 Thread Thomas Byrne - UKRI STFC
Hi all,

As far as I understand, the monitor stores will grow while not HEALTH_OK as 
they hold onto all cluster maps. Is this true for all HEALTH_WARN reasons? Our 
cluster recently went into HEALTH_WARN due to a few weeks of backfilling onto 
new hardware pushing the monitors data stores over the default 15GB threshold. 
Are they now prevented from shrinking till I increase the threshold above their 
current size?

Cheers
Tom


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Too many active mds servers

2018-05-15 Thread Thomas Bennett
Hi Patric,

Thanks! Much appreciate.

On Tue, 15 May 2018 at 14:52, Patrick Donnelly <pdonn...@redhat.com> wrote:

> Hello Thomas,
>
> On Tue, May 15, 2018 at 2:35 PM, Thomas Bennett <tho...@ska.ac.za> wrote:
> > Hi,
> >
> > I'm running Luminous 12.2.5 and I'm testing cephfs.
> >
> > However, I seem to have too many active mds servers on my test cluster.
> >
> > How do I set one of my mds servers to become standby?
> >
> > I've run ceph fs set cephfs max_mds 2 which set the max_mds from 3 to 2
> but
> > has no effect on my running configuration.
>
>
> http://docs.ceph.com/docs/luminous/cephfs/multimds/#decreasing-the-number-of-ranks
>
> Note: the behavior is changing in Mimic to be automatic after reducing
> max_mds.
>
> --
> Patrick Donnelly
>
-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341
Mobile: +27 79 5237105
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Too many active mds servers

2018-05-15 Thread Thomas Bennett
Hi,

I'm running Luminous 12.2.5 and I'm testing cephfs.

However, I seem to have too many active mds servers on my test cluster.

How do I set one of my mds servers to become standby?

I've run ceph fs set cephfs max_mds 2 which set the max_mds from 3 to 2 but
has no effect on my running configuration.

$ ceph status
  cluster:
id:
health: HEALTH_WARN
*insufficient standby MDS daemons available*

  services:
mon: 3 daemons, quorum mon1-c2-vm,mon2-c2-vm,mon3-c2-vm
mgr: mon2-c2-vm(active), standbys: mon1-c2-vm
mds: *cephfs-3/3/2 up
{0=mon1-c2-vm=up:active,1=mon3-c2-vm=up:active,2=mon2-c2-vm=up:active}*
osd: 250 osds: 250 up, 250 in
rgw: 2 daemons active

  data:
pools:   4 pools, 8456 pgs
objects: 13492 objects, 53703 MB
usage:   427 GB used, 1750 TB / 1751 TB avail
pgs: 8456 active+clean

$ ceph fs get cephfs
Filesystem 'cephfs' (1)
fs_name cephfs
epoch 187
flags c
created 2018-05-03 10:25:21.733597
modified 2018-05-03 10:25:21.733597
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 1369
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
uses versioned encoding,6=dirfrag is stored in omap,8=no anchor
table,9=file layout v2}
*max_mds 2*
*in 0,1,2*
*up {0=43808,1=43955,2=27318}*
failed
damaged
stopped
data_pools [1,11]
metadata_pool 2
inline_data disabled
balancer
standby_count_wanted 1
43808: xx.xx.xx.xx:6800/3009065437 'mon1-c2-vm' mds.0.171 up:active seq 45
43955: xx.xx.xx.xx:6800/2947700655 'mon2-c2-vm' mds.1.174 up:active seq 28
27318: xx.xx.xx.xx:6800/652878628 'mon3-c2-vm' mds.2.177 up:active seq 8

Thanks,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What do you use to benchmark your rgw?

2018-04-05 Thread Thomas Bennett
Hi Mathew,

We approached the problem by first running swift-bench for performance
tuning and configuration. Since it was the easiest to get up and running
and test the gateway.

Then we wrote a python script using python boto and python futures to model
our usecase and test s3.

We found the most effective performance improvement was to place the
*.rgw.buckets.index’s on nvme backed osds.

Cheers,
Tom

On Wed, 28 Mar 2018 at 11:18, Matthew Vernon <m...@sanger.ac.uk> wrote:

> Hi,
>
> What are people here using to benchmark their S3 service (i.e. the rgw)?
> rados bench is great for some things, but doesn't tell me about what
> performance I can get from my rgws.
>
> It seems that there used to be rest-bench, but that isn't in Jewel
> AFAICT; I had a bit of a look at cosbench but it looks fiddly to set up
> and a bit under-maintained (the most recent version doesn't work out of
> the box, and the PR to fix that has been languishing for a while).
>
> This doesn't seem like an unusual thing to want to do, so I'd like to
> know what other ceph folk are using (and, if you like, the numbers you
> get from the benchmarkers)...?
>
> Thanks,
>
> Matthew
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE
> <https://maps.google.com/?q=215+Euston+Road,+London,+NW1+2BE=gmail=g>
> .
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341
Mobile: +27 79 5237105
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW default.rgw.meta pool

2018-02-05 Thread Thomas Bennett
Hi Orit,

Thanks for the reply, much appreciated.

 You cannot see the omap size using rados ls but need to use rados omap
> commands.

You can use this script to calculate the bucket index size:
> https://github.com/mkogan1/ceph-utils/blob/master/
> scripts/get_omap_kv_size.sh


Great. I had not even thought of that - thanks for the script!


> you probably meant default.rgw.meta.
> It is a namespace not a pool try using:
> rados ls -p default.rgw.meta --all
>

I see that the '-all' switch does the trick. (sorry, I meant
default.rgw.meta)

I now see the 'POOL NAMESPACES' in the rgw docs and I get what it's trying
to describe. It's all starting to fall in place.

Thanks for the info :)

Regards,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW default.rgw.meta pool

2018-02-05 Thread Thomas Bennett
Hi,

In trying to understand RGW pool usage I've noticed the pool called
*default.rgw.meta* pool has a large number of objects in it. Suspiciously
about twice as many objects in my *default.rgw.buckets.index* pool.

As I delete and add buckets, the number of objects in both pools decrease
and increase proportionally.

However when I try to list the objects in the default.rgw.meta pool, it
returns nothing.

I.e  '*rados -p default.rgw.buckets.index** ls*' returns nothing.

Is this expected behaviour for this pool?

What are all those objects and why can I not list them?

>From my understanding *default.rgw.buckets.index* should contain thinks
like: domain_root, user_keys_pool, user_email_pool, user_swift_pool,
user_uid_pool.

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-31 Thread Thomas Bennett
Hi Peter,

Relooking at your problem, you might want to keep track of this issue:
http://tracker.ceph.com/issues/22440

Regards,
Tom

On Wed, Jan 31, 2018 at 11:37 AM, Thomas Bennett <tho...@ska.ac.za> wrote:

> Hi Peter,
>
> From your reply, I see that:
>
>1. pg 3.12c is part of pool 3.
>2. The osd's in the "up"  for pg 3.12c  are: 6, 0, 12.
>
>
> I suggest to check on this 'activating' issue do the following:
>
>1. What is the rule that pool 3 should follow, 'hybrid', 'nvme' or
>'hdd'? (Use the *ceph osd pool ls detail* command and look at pool 3's
>crush rule)
>2. Then check are osds 6, 0, 12 backed by nvme's or hdd's? (Use *ceph
>osd tree | grep nvme *command to find your nvme backed osds.)
>
>
> If your problem is similar to mine, you will have osds that are nvme
> backed in a pool that should only be backed by hdds, which was causing a pg
> to go into 'activating' state and staying there.
>
> Cheers,
> Tom
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-31 Thread Thomas Bennett
Hi Peter,

>From your reply, I see that:

   1. pg 3.12c is part of pool 3.
   2. The osd's in the "up"  for pg 3.12c  are: 6, 0, 12.


I suggest to check on this 'activating' issue do the following:

   1. What is the rule that pool 3 should follow, 'hybrid', 'nvme' or
   'hdd'? (Use the *ceph osd pool ls detail* command and look at pool 3's
   crush rule)
   2. Then check are osds 6, 0, 12 backed by nvme's or hdd's? (Use *ceph
   osd tree | grep nvme *command to find your nvme backed osds.)


If your problem is similar to mine, you will have osds that are nvme backed
in a pool that should only be backed by hdds, which was causing a pg to go
into 'activating' state and staying there.

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-26 Thread Thomas Bennett
Hi Peter,

Just to check if your problem is similar to mine:

   - Do you have any pools that follow a crush rule to only use osds that
   are backed by hdds (i.e not nvmes)?
   - Do these pools obey that rule? i.e do they maybe have pgs that are on
   nvmes?

Regards,
Tom

On Fri, Jan 26, 2018 at 11:48 AM, Peter Linder <peter.lin...@fiberdirekt.se>
wrote:

> Hi Thomas,
>
> No, we haven't gotten any closer to resolving this, in fact we had another
> issue again when we added a new nvme drive to our nvme servers (storage11,
> storage12 and storage13) that had weight 1.7 instead of the usual 0.728
> size. This (see below) is what a nvme and hdd server pair at a site looks
> like, and it broke when adding osd.10 (adding the nvme drive to storage12
> and storage13 worked, it failed when adding the last one to storage11).
> Changing osd.10's weight to 1.0 instead and recompiling crushmap allowed
> all PGs to activate.
>
> Unfortunately this is a production cluster that we were hoping to expand
> as needed, so if there is a problem we quickly have to revert to the last
> working crushmap, so no time to debug :(
>
> We are currently building a copy of the environment though virtualized and
> I hope that we will be able to re-create the issue there as we will be able
> to break it at will :)
>
>
> host storage11 {
> id -5   # do not change unnecessarily
> id -6 class nvme# do not change unnecessarily
> id -10 class hdd# do not change unnecessarily
> # weight 4.612
> alg straw2
> hash 0  # rjenkins1
> item osd.0 weight 0.728
> item osd.3 weight 0.728
> item osd.6 weight 0.728
> item osd.7 weight 0.728
> item osd.10 weight 1.700
> }
> host storage21 {
> id -13  # do not change unnecessarily
> id -14 class nvme   # do not change unnecessarily
> id -15 class hdd# do not change unnecessarily
> # weight 65.496
> alg straw2
> hash 0  # rjenkins1
> item osd.12 weight 5.458
> item osd.13 weight 5.458
> item osd.14 weight 5.458
> item osd.15 weight 5.458
> item osd.16 weight 5.458
> item osd.17 weight 5.458
> item osd.18 weight 5.458
> item osd.19 weight 5.458
> item osd.20 weight 5.458
> item osd.21 weight 5.458
> item osd.22 weight 5.458
> item osd.23 weight 5.458
> }
>
>
> Den 2018-01-26 kl. 08:45, skrev Thomas Bennett:
>
> Hi Peter,
>
> Not sure if you have got to the bottom of your problem,  but I seem to
> have found what might be a similar problem. I recommend reading below,  as
> there could be a potential hidden problem.
>
> Yesterday our cluster went into *HEALTH_WARN* state and I noticed that
> one of my pg's was listed as '*activating*' and marked as '*inactive*'
> and '*unclean*'.
>
> We also have a mixed OSD system - 768 HDDs and 16 NVMEs with three crush
> rules for object placement: the default *replicated_rule* (I never
> deleted it) and then two new ones for *replicate_rule_hdd* and
> *replicate_rule_nvme.*
>
> Running a query on the pg (in my case pg 15.792) did not yield anything
> out of place, except for it telling me that that it's state was '
> *activating*' (that's not even a pg state: pg states
> <http://docs.ceph.com/docs/master/rados/operations/pg-states/>) and made
> me slightly alarmed.
>
> The bits of information that alerted me to the issue where:
>
> 1. Running 'ceph pg dump' and finding the 'activating' pg showed the
> following information:
>
> 15.792 activating [4,724,242] #for pool 15 pg there are osds 4,724,242
>
>
> 2. Running 'ceph osd tree | grep 'osd.4 ' and getting the following
> information:
>
> 4 nvme osd.4
>
> 3. Now checking what pool 15 is by running 'ceph osd pool ls detail':
>
> pool 15 'default.rgw.data' replicated size 3 min_size 2 crush_rule 1
>
>
> These three bits of information made me realise what was going on:
>
>- OSD 4,724,242 are all nvmes
>- Pool 15 should obey crush_rule 1 (*replicate_rule_hdd)*
>- Pool 15 has pgs that use nvmes!
>
> I found the following really useful tool online which showed me the depth
> of the problem: Get the Number of Placement Groups Per Osd
> <http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd>
>
> So it turns out in my case pool 15 has osds in all the nvmes!
>
> To test a fix to mimic the problem again - I executed the following
> command: 'ceph osd pg-upmap-items 15.792 4 22 724 67 76 242'
>
> It remap the osds used by the 'activating

Re: [ceph-users] Weird issues related to (large/small) weights in mixed nvme/hdd pool

2018-01-25 Thread Thomas Bennett
c3 weight 97.196
> }
>
> # rules
> rule hybrid {
> id 1
> type replicated
> min_size 1
> max_size 10
> step take ldc
> step choose firstn 1 type datacenter
> step chooseleaf firstn 0 type hostgroup
> step emit
> }
>
>
> Ok, so there are 9 hostgroups (i changed "type 2"). Each hostgroup
> currently holds 1 server, but may in the future hold more. These are
> grouped in 3, and called a "datacenter" even though the set is spread out
> onto 3 physical data centers. These are then put in a separate root called
> "ldc".
>
> The "hybrid" rule then proceeds to select 1 datacenter, and then 3 osds
> from that datacenter. The end result is that 3 OSDs from different physical
> datacenters are selected, with 1 nvme and 2 hdd (hdds have reduced primary
> affinity to 0.00099, and yes this might be a problem?). If one datacenter
> is lost, only 1/3'rd of the nvmes are in fact offline so capacity loss is
> manageable compared to having all nvme's in one datacenter.
>
> Because nvmes are much smaller, after adding one the "datacenter" looks
> like this:
>
> item hg1-1 weight 2.911
> item hg1-2 weight 65.789
> item hg1-3 weight 65.789
>
> This causes PGs to go into "active+clean+remapped" state forever. If I
> manually change the weights so that they are all almost the same, the
> problem goes away! I would have though that the weights does not matter,
> since we have to choose 3 of these anyways. So I'm really confused over
> this.
>
> Today I also had to change
>
> item ldc1 weight 197.489
> item ldc2 weight 197.196
> item ldc3 weight 197.196
> to
> item ldc1 weight 97.489
> item ldc2 weight 97.196
> item ldc3 weight 97.196
>
> or some PGs wouldn't activate at all! I'm really not aware how the
> hashing/selection process works though, it does somehow seem that if the
> values are too far apart, things seem to break. crushtool --test seems to
> correctly calculate my PGs.
>
> Basically when this happens I just randomly change some weights and most
> of the time it starts working. Why?
>
> Regards,
> Peter
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341 <021%20506%207341>
Mobile: +27 79 5237105 <079%20523%207105>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to remove deactivated cephFS

2018-01-25 Thread Thomas Bennett
Hey Eugen,

Pleasure. Glad you problem resolved itself.

Regards,
Tom

On Wed, Jan 24, 2018 at 5:29 PM, Eugen Block <ebl...@nde.ag> wrote:

> Hi Tom,
>
> thanks for the detailed steps.
>
> I think our problem literally vanished. A couple of days after my email I
> noticed that the web interface suddenly  listed only one cephFS. Also the
> command "ceph fs status" doesn't return an error anymore but shows the
> corret output.
> I guess Ceph is indeed a self-healing storage solution! :-)
>
> Regards,
> Eugen
>
>
> Zitat von Thomas Bennett <tho...@ska.ac.za>:
>
> Hi Eugen,
>>
>> From my experiences, to truely delete and recreate  the Ceph FS *cephfs*
>>
>> file system I've done the following:
>>
>> 1. Remove the file system:
>> ceph fs rm cephfs --yes-i-really-mean-it
>> ceph fs rm_data_pool cephfs_data
>> ceph fs rm_data_pool cephfs cephfs_data
>>
>> 2. Remove the associated pools:
>> ceph osd pool delete cephfs_data cephfs_data --yes-i-really-really-mean-it
>> ceph osd pool delete cephfs_metadata
>> cephfs_metadata --yes-i-really-really-mean-it
>>
>> 3. Create a new default ceph file system:
>> ceph osd pool create cephfs_data <int[0-]> {<int[0-]>} {}
>> ceph osd pool create cephfs_metadata <int[0-]> {<int[0-]>} {}
>> ceph fs new cephfs cephfs_metadata cephfs_data
>> ceph fs set_default cephfs
>>
>> Not sure if this helps, as you may need to repeat the whole process from
>> the start.
>>
>> Regards,
>> Tom
>>
>> On Mon, Jan 8, 2018 at 2:19 PM, Eugen Block <ebl...@nde.ag> wrote:
>>
>> Hi list,
>>>
>>> all this is on Ceph 12.2.2.
>>>
>>> An existing cephFS (named "cephfs") was backed up as a tar ball, then
>>> "removed" ("ceph fs rm cephfs --yes-i-really-mean-it"), a new one created
>>> ("ceph fs new cephfs cephfs-metadata cephfs-data") and the content
>>> restored
>>> from the tar ball. According to the output of "ceph fs rm",  the old
>>> cephFS
>>> has only been deactivated, not deleted.  Looking at the Ceph manager's
>>> web
>>> interface, it now lists two entries "cephfs", one with id 0 (the "old"
>>> FS)
>>> and id "1" (the currently active FS).
>>>
>>> When we try to run "ceph fs status", we get an error with a traceback:
>>>
>>> ---cut here---
>>> ceph3:~ # ceph fs status
>>> Error EINVAL: Traceback (most recent call last):
>>>   File "/usr/lib64/ceph/mgr/status/module.py", line 301, in
>>> handle_command
>>> return self.handle_fs_status(cmd)
>>>   File "/usr/lib64/ceph/mgr/status/module.py", line 219, in
>>> handle_fs_status
>>> stats = pool_stats[pool_id]
>>> KeyError: (29L,)
>>> ---cut here---
>>>
>>> while this works:
>>>
>>> ---cut here---
>>> ceph3:~ # ceph fs ls
>>> name: cephfs, metadata pool: cephfs-metadata, data pools: [cephfs-data ]
>>> ---cut here---
>>>
>>> We see the new id 1 when we run
>>>
>>> ---cut here---
>>> ceph3:~ #  ceph fs get cephfs
>>> Filesystem 'cephfs' (1)
>>> fs_name cephfs
>>> [...]
>>> data_pools  [35]
>>> metadata_pool   36
>>> inline_data disabled
>>> balancer
>>> standby_count_wanted1
>>> [...]
>>> ---cut here---
>>>
>>> The new FS seems to work properly and can be mounted from the clients,
>>> just like before removing and rebuilding it. I'm not sure which other
>>> commands would fail with this traceback, for now "ceph fs status" is the
>>> only one.
>>>
>>> So it seems that having one deactivated cephFS has an impact on some of
>>> the functions/commands. Is there any way to remove it properly? Most of
>>> the
>>> commands work with the name, not the id of the FS, so it's difficult to
>>> access the data from the old FS. Has anyone some insights on how to clean
>>> this up?
>>>
>>> Regards,
>>> Eugen
>>>
>>> --
>>> Eugen Block voice   : +49-40-559 51 75
>>> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
>>> Postfach 61 03 15
>>> D-22423 Hamburg     e-mail  : ebl...@nde.ag
>>>
>>> Vorsitzende des Aufsi

Re: [ceph-users] How to remove deactivated cephFS

2018-01-24 Thread Thomas Bennett
Hi Eugen,

>From my experiences, to truely delete and recreate  the Ceph FS *cephfs*
file system I've done the following:

1. Remove the file system:
ceph fs rm cephfs --yes-i-really-mean-it
ceph fs rm_data_pool cephfs_data
ceph fs rm_data_pool cephfs cephfs_data

2. Remove the associated pools:
ceph osd pool delete cephfs_data cephfs_data --yes-i-really-really-mean-it
ceph osd pool delete cephfs_metadata
cephfs_metadata --yes-i-really-really-mean-it

3. Create a new default ceph file system:
ceph osd pool create cephfs_data <int[0-]> {<int[0-]>} {}
ceph osd pool create cephfs_metadata <int[0-]> {<int[0-]>} {}
ceph fs new cephfs cephfs_metadata cephfs_data
ceph fs set_default cephfs

Not sure if this helps, as you may need to repeat the whole process from
the start.

Regards,
Tom

On Mon, Jan 8, 2018 at 2:19 PM, Eugen Block <ebl...@nde.ag> wrote:

> Hi list,
>
> all this is on Ceph 12.2.2.
>
> An existing cephFS (named "cephfs") was backed up as a tar ball, then
> "removed" ("ceph fs rm cephfs --yes-i-really-mean-it"), a new one created
> ("ceph fs new cephfs cephfs-metadata cephfs-data") and the content restored
> from the tar ball. According to the output of "ceph fs rm",  the old cephFS
> has only been deactivated, not deleted.  Looking at the Ceph manager's web
> interface, it now lists two entries "cephfs", one with id 0 (the "old" FS)
> and id "1" (the currently active FS).
>
> When we try to run "ceph fs status", we get an error with a traceback:
>
> ---cut here---
> ceph3:~ # ceph fs status
> Error EINVAL: Traceback (most recent call last):
>   File "/usr/lib64/ceph/mgr/status/module.py", line 301, in handle_command
> return self.handle_fs_status(cmd)
>   File "/usr/lib64/ceph/mgr/status/module.py", line 219, in
> handle_fs_status
> stats = pool_stats[pool_id]
> KeyError: (29L,)
> ---cut here---
>
> while this works:
>
> ---cut here---
> ceph3:~ # ceph fs ls
> name: cephfs, metadata pool: cephfs-metadata, data pools: [cephfs-data ]
> ---cut here---
>
> We see the new id 1 when we run
>
> ---cut here---
> ceph3:~ #  ceph fs get cephfs
> Filesystem 'cephfs' (1)
> fs_name cephfs
> [...]
> data_pools  [35]
> metadata_pool   36
> inline_data disabled
> balancer
> standby_count_wanted1
> [...]
> ---cut here---
>
> The new FS seems to work properly and can be mounted from the clients,
> just like before removing and rebuilding it. I'm not sure which other
> commands would fail with this traceback, for now "ceph fs status" is the
> only one.
>
> So it seems that having one deactivated cephFS has an impact on some of
> the functions/commands. Is there any way to remove it properly? Most of the
> commands work with the name, not the id of the FS, so it's difficult to
> access the data from the old FS. Has anyone some insights on how to clean
> this up?
>
> Regards,
> Eugen
>
> --
> Eugen Block voice   : +49-40-559 51 75
> NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
> Postfach 61 03 15
> D-22423 Hamburg e-mail  : ebl...@nde.ag
>
> Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>   Sitz und Registergericht: Hamburg, HRB 90934
>   Vorstand: Jens-U. Mozdzen
>USt-IdNr. DE 814 013 983
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341
Mobile: +27 79 5237105
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unable to join additional mon servers (luminous)

2018-01-11 Thread Thomas Gebhardt
Hello,

I'm running a ceph-12.2.2 cluster on debian/stretch with three mon
servers, unsuccessfully trying to add another (or two additional) mon
servers. While the new mon server keeps in state "synchronizing", the
old mon servers get out of quorum, endlessly changing state from "peon"
to "electing" or "probing", and eventually back to "peon" or "leader".

On a small test cluster everthing works as expected, the new mons
painlessly join the cluster. But on my production cluster I always run
into trouble, both with ceph-deploy and manual intervention. Probably
I'm missing some fundamental factor. Maybe anyone can give me a hint?

These are the existing mons:

my-ceph-mon-3: IP AAA.BBB.CCC.23
my-ceph-mon-4: IP AAA.BBB.CCC.24
my-ceph-mon-5: IP AAA.BBB.CCC.25

Trying to add

my-ceph-mon-1: IP AAA.BBB.CCC.31

Here is a (hopefully) relevant and representative part of the logs on
my-ceph-mon-5 when my-ceph-mon-1 tries to join:

2018-01-11 15:16:08.340741 7f69ba8db700  0
mon.my-ceph-mon-5@2(peon).data_health(6128) update_stats avail 57% total
19548 MB, used 8411 MB, avail 11149 MB
2018-01-11 15:16:16.830566 7f69b48cf700  0 -- AAA.BBB.CCC.18:6789/0 >>
AAA.BBB.CCC.31:6789/0 conn(0x55d19cac2000 :6789
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 0 vs existing csq=1
existing_state=STATE_STANDBY
2018-01-11 15:16:16.830582 7f69b48cf700  0 -- AAA.BBB.CCC.18:6789/0 >>
AAA.BBB.CCC.31:6789/0 conn(0x55d19cac2000 :6789
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept peer reset, then tried to connect to us,
replacing
2018-01-11 15:16:16.831864 7f69b80d6700  1 mon.my-ceph-mon-5@2(peon) e15
 adding peer AAA.BBB.CCC.31:6789/0 to list of hints
2018-01-11 15:16:16.833701 7f69b50d0700  0 -- AAA.BBB.CCC.18:6789/0 >>
AAA.BBB.CCC.31:6789/0 conn(0x55d19c8ca000 :6789
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 0 vs existing csq=1
existing_state=STATE_STANDBY
2018-01-11 15:16:16.833713 7f69b50d0700  0 -- AAA.BBB.CCC.18:6789/0 >>
AAA.BBB.CCC.31:6789/0 conn(0x55d19c8ca000 :6789
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept peer reset, then tried to connect to us,
replacing
2018-01-11 15:16:16.834843 7f69b80d6700  1 mon.my-ceph-mon-5@2(peon) e15
 adding peer AAA.BBB.CCC.31:6789/0 to list of hints
2018-01-11 15:16:35.907962 7f69ba8db700  1
mon.my-ceph-mon-5@2(peon).paxos(paxos active c 9653210..9653763)
lease_timeout -- calling new election
2018-01-11 15:16:35.908589 7f69b80d6700  0 mon.my-ceph-mon-5@2(probing)
e15 handle_command mon_command({"prefix": "status"} v 0) v1
2018-01-11 15:16:35.908630 7f69b80d6700  0 log_channel(audit) log [DBG]
: from='client.? 172.25.24.15:0/1078983440' entity='client.admin'
cmd=[{"prefix": "status"}]: dispatch
2018-01-11 15:16:35.909124 7f69b80d6700  0 log_channel(cluster) log
[INF] : mon.my-ceph-mon-5 calling new monitor election
2018-01-11 15:16:35.909284 7f69b80d6700  1
mon.my-ceph-mon-5@2(electing).elector(6128) init, last seen epoch 6128
2018-01-11 15:16:50.132414 7f69ba8db700  1
mon.my-ceph-mon-5@2(electing).elector(6129) init, last seen epoch 6129,
mid-election, bumping
2018-01-11 15:16:55.209177 7f69b80d6700 -1
mon.my-ceph-mon-5@2(peon).paxos(paxos recovering c 9653210..9653777)
lease_expire from mon.0 AAA.BBB.CCC.23:6789/0 is 0.032801 seconds in the
past; mons are probably laggy (or possibly clocks are too skewed)
2018-01-11 15:17:09.316472 7f69ba8db700  1
mon.my-ceph-mon-5@2(peon).paxos(paxos updating c 9653210..9653778)
lease_timeout -- calling new election
2018-01-11 15:17:09.316597 7f69ba8db700  0
mon.my-ceph-mon-5@2(probing).data_health(6134) update_stats avail 57%
total 19548 MB, used 8411 MB, avail 11149 MB
2018-01-11 15:17:09.317414 7f69b80d6700  0 log_channel(cluster) log
[INF] : mon.my-ceph-mon-5 calling new monitor election
2018-01-11 15:17:09.317517 7f69b80d6700  1
mon.my-ceph-mon-5@2(electing).elector(6134) init, last seen epoch 6134
2018-01-11 15:17:22.059573 7f69ba8db700  1
mon.my-ceph-mon-5@2(peon).paxos(paxos updating c 9653210..9653779)
lease_timeout -- calling new election
2018-01-11 15:17:22.060021 7f69b80d6700  1
mon.my-ceph-mon-5@2(probing).data_health(6138) service_dispatch_op not
in quorum -- drop message
2018-01-11 15:17:22.060279 7f69b80d6700  1
mon.my-ceph-mon-5@2(probing).data_health(6138) service_dispatch_op not
in quorum -- drop message
2018-01-11 15:17:22.060499 7f69b80d6700  0 log_channel(cluster) log
[INF] : mon.my-ceph-mon-5 calling new monitor election
2018-01-11 15:17:22.060612 7f69b80d6700  1
mon.my-ceph-mon-5@2(electing).elector(6138) init, last seen epoch 6138
...

As far as I can see clock skew is not a problem (tested with "ntpq -p").

Any idea what might go wrong?

Thanks, Thomas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple OSD crashing on 12.2.0. Bluestore / EC pool / rbd

2017-09-06 Thread Thomas Coelho
Hi,

I have the same problem. A bug [1] is reported since months, but
unfortunately this is not fixed yet. I hope, if more people are having
this problem the developers can reproduce and fix it.

I was using Kernel-RBD with a Cache Tier.

so long
Thomas Coelho

[1] http://tracker.ceph.com/issues/20222


Am 06.09.2017 um 15:41 schrieb Henrik Korkuc:
> On 17-09-06 16:24, Jean-Francois Nadeau wrote:
>> Hi, 
>>
>> On a 4 node / 48 OSDs Luminous cluster Im giving a try at RBD on EC
>> pools + Bluestore.  
>>
>> Setup went fine but after a few bench runs several OSD are failing and
>> many wont even restart.
>>
>> ceph osd erasure-code-profile set myprofile \
>>k=2\
>>m=1 \
>>crush-failure-domain=host
>> ceph osd pool create mypool 1024 1024 erasure myprofile   
>> ceph osd pool set mypool allow_ec_overwrites true
>> rbd pool init mypool
>> ceph -s
>> ceph health detail
>> ceph osd pool create metapool 1024 1024 replicated
>> rbd create --size 1024G --data-pool mypool --image metapool/test1
>> rbd bench -p metapool test1 --io-type write --io-size 8192
>> --io-pattern rand --io-total 10G
>> ...
>>
>>
>> One of many OSD failing logs
>>
>> Sep 05 17:02:54 r72-k7-06-01.k8s.ash1.cloudsys.tmcs systemd[1]:
>> Started Ceph object storage daemon osd.12.
>> Sep 05 17:02:54 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
>> starting osd.12 at - osd_data /var/lib/ceph/osd/ceph-12
>> /var/lib/ceph/osd/ceph-12/journal
>> Sep 05 17:02:56 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
>> 2017-09-05 17:02:56.627301 7fe1a2e42d00 -1 osd.12 2219 log_to_monitors
>> {default=true}
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
>> 2017-09-05 17:02:58.686723 7fe1871ac700 -1
>> bluestore(/var/lib/ceph/osd/ceph-12) _txc_add_transac
>> tion error (2) No such file or directory not handled on operation 15
>> (op 0, counting from 0)
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
>> 2017-09-05 17:02:58.686742 7fe1871ac700 -1
>> bluestore(/var/lib/ceph/osd/ceph-12) unexpected error
>>  code
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/
>> centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc:
>> In function 'void BlueStore::_txc_add_transaction(Blu
>> eStore::TransContext*, ObjectStore::Transaction*)' thread 7fe1871ac700
>> time 2017-09-05 17:02:58.686821
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/
>> centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.0/rpm/el7/BUILD/ceph-12.2.0/src/os/bluestore/BlueStore.cc:
>> 9282: FAILED assert(0 == "unexpected error")
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]:
>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
>> luminous (rc)
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 1:
>> (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x110) [0x7fe1a38bf510]
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 2:
>> (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
>> ObjectStore::Transaction*)+0x1487)
>>  [0x7fe1a3796057]
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 3:
>> (BlueStore::queue_transactions(ObjectStore::Sequencer*,
>> std::vector<ObjectStore::Transaction,
>>  std::allocator >&,
>> boost::intrusive_ptr, ThreadPool::TPHandle*)+0x3a0)
>> [0x7fe1a37970a0]
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 4:
>> (PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, 
>> std::allocator> Store::Transaction> >&, boost::intrusive_ptr)+0x65)
>> [0x7fe1a3508745]
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 5:
>> (ECBackend::handle_sub_write(pg_shard_t,
>> boost::intrusive_ptr, ECSubWrite&, ZTrace
>> r::Trace const&, Context*)+0x631) [0x7fe1a3628711]
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 6:
>> (ECBackend::_handle_message(boost::intrusive_ptr)+0x327)
>> [0x7fe1a36392b7]
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.cloudsys.tmcs ceph-osd[4775]: 7:
>> (PGBackend::handle_message(boost::intrusive_ptr)+0x50)
>> [0x7fe1a353da10]
>> Sep 05 17:02:58 r72-k7-06-01.k8s.ash1.clou

Re: [ceph-users] osd heartbeat protocol issue on upgrade v12.1.0 ->v12.2.0

2017-09-01 Thread Thomas Gebhardt
Hello,

thank you very much for the hint, you are right!

Kind regards, Thomas

Marc Roos schrieb am 30.08.2017 um 14:26:
>  
> I had this also once. If you update all nodes and then systemctl restart 
> 'ceph-osd@*' on all nodes, you should be fine. But first the monitors of 
> course
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd heartbeat protocol issue on upgrade v12.1.0 ->v12.2.0

2017-08-30 Thread Thomas Gebhardt
Hello,

when I upgraded (yet a single osd node) from v12.1.0 -> v12.2.0 its osds
start flapping and finally got all marked as down.

As far as I can see, this is due to an incompatibility of the osd
heartbeat protocol between the two versions:

v12.2.0 node:
7f4f7b6e6700 -1 osd.X 3879 heartbeat_check: no reply from x.x.x.x:
osd.Y ever on either front or back, first ping sent ...

v12.1.0 node:
7fd854ebf700 -1 failed to decode message of type 70 v4:
buffer::malformed_input: void
osd_peer_stat_t::decode(ceph::buffer::list::iterator&) no longer
understand old encoding version 1 < struct_compat

( it is puzzling that the *older* v12.1.0 node complains about the *old*
encoding version of the *newer* v12.2.0 node.)

Any idea how I can go ahead?

Kind regards, Thomas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reaching aio-max-nr on Ubuntu 16.04 with Luminous

2017-08-30 Thread Thomas Bennett
Hi Dan,

Great! Thanks for the feedback. Much appreciated.

For completion here is my ansible role:

- name: Increase aio-max-nr for bluestore
  sysctl:
name: fs.aio-max-nr
value: 1048576
sysctl_file: /etc/sysctl.d/ceph-tuning.conf
sysctl_set: yes

Cheers,
Tom

On Wed, Aug 30, 2017 at 10:49 AM, Dan van der Ster <d...@vanderster.com>
wrote:

> Hi Thomas,
>
> Yes we set it to a million.
> From our puppet manifest:
>
>   # need to increase aio-max-nr to allow many bluestore devs
>   sysctl { 'fs.aio-max-nr': val  => '1048576' }
>
> Cheers, Dan
>
>
> On Aug 30, 2017 9:53 AM, "Thomas Bennett" <tho...@ska.ac.za> wrote:
> >
> > Hi,
> >
> > I've been testing out Luminous and I've noticed that at some point the
> number of osds per nodes was limited by aio-max-nr. By default its set to
> 65536 in Ubuntu 16.04
> >
> > Has anyone else experienced this issue?
> >
> > fs.aio-nr currently sitting at 196608 with 48 osds.
> >
> > I have 48 osd's per node so I've added fs.aio-max-nr = 262144 in my
> /etc/systctl.d/ceph-tuning.conf file.
> >
> > Cheers,
> > Tom
> >
> > _______
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341
Mobile: +27 79 5237105
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reaching aio-max-nr on Ubuntu 16.04 with Luminous

2017-08-30 Thread Thomas Bennett
Hi,

I've been testing out Luminous and I've noticed that at some point the
number of osds per nodes was limited by aio-max-nr. By default its set to
65536 in Ubuntu 16.04

Has anyone else experienced this issue?

fs.aio-nr currently sitting at 196608 with 48 osds.

I have 48 osd's per node so I've added fs.aio-max-nr = 262144 in my
/etc/systctl.d/ceph-tuning.conf file.

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MON daemons fail after creating bluestore osd with block.db partition (luminous 12.1.0-1~bpo90+1 )

2017-07-10 Thread Thomas Gebhardt
Hello,

Thomas Gebhardt schrieb am 07.07.2017 um 17:21:
> ( e.g.,
> ceph-deploy osd create --bluestore --block-db=/dev/nvme0bnp1 node1:/dev/sdi
> )

just noticed that there was typo in the block-db device name
(/dev/nvme0bnp1 -> /dev/nvme0n1p1). After fixing that misspelling my
cookbook worked fine and the mons are running.

Kind regards, Thomas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MON daemons fail after creating bluestore osd with block.db partition (luminous 12.1.0-1~bpo90+1 )

2017-07-07 Thread Thomas Gebhardt
Hello,

just testing the latest luminous rc packages on debian stretch with
bluestore OSDs.

OSDs without a separate block.db partition do fine.

But when I try to create an OSD with a separate block.db partition:

( e.g.,
ceph-deploy osd create --bluestore --block-db=/dev/nvme0bnp1 node1:/dev/sdi
)

then all MON daemons fail and the cluster stops running
(cf. appended journalctl logs)

Do you have any idea how to narrow down the problem? (objdump?)

(please note that I faked /etc/debian_version to jessie, since
ceph-deploy 1.5.38 from
https://download.ceph.com/debian-luminous/dists/stretch/
does not yet support stretch - but I suppose that's not related to my
problem).

Kind regards, Thomas
Jul 07 09:58:54 node1 systemd[1]: Started Ceph cluster monitor daemon.
Jul 07 09:58:54 node1 ceph-mon[550]: starting mon.node1 rank 0 at 
x.x.x.x:6789/0 mon_data /var/lib/ceph/mon/ceph-node1 fsid 
1e50b861-c10f-4356-9af6-3a90441ee694
Jul 07 16:38:44 node1 ceph-mon[550]: /build/ceph-12.1.0/src/mon/OSDMonitor.cc: 
In function 'void OSDMonitor::check_pg_creates_subs()' thread 7f678fe61700 time 
2017-07-07 16:38:44.576052
Jul 07 16:38:44 node1 ceph-mon[550]: /build/ceph-12.1.0/src/mon/OSDMonitor.cc: 
2977: FAILED assert(osdmap.get_up_osd_features() & 
CEPH_FEATURE_MON_STATEFUL_SUB)
Jul 07 16:38:44 node1 ceph-mon[550]:  ceph version 12.1.0 
(262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
Jul 07 16:38:44 node1 ceph-mon[550]:  1: (ceph::__ceph_assert_fail(char const*, 
char const*, int, char const*)+0x102) [0x55fc324c8802]
Jul 07 16:38:44 node1 ceph-mon[550]:  2: (()+0x474ed0) [0x55fc323c8ed0]
Jul 07 16:38:44 node1 ceph-mon[550]:  3: 
(OSDMonitor::update_from_paxos(bool*)+0x1a4d) [0x55fc323f168d]
Jul 07 16:38:44 node1 ceph-mon[550]:  4: (PaxosService::refresh(bool*)+0x3ff) 
[0x55fc323b673f]
Jul 07 16:38:44 node1 ceph-mon[550]:  5: 
(Monitor::refresh_from_paxos(bool*)+0x1a3) [0x55fc322767e3]
Jul 07 16:38:44 node1 ceph-mon[550]:  6: (Paxos::do_refresh()+0x47) 
[0x55fc323a0227]
Jul 07 16:38:44 node1 ceph-mon[550]:  7: (Paxos::commit_finish()+0x703) 
[0x55fc323b1ad3]
Jul 07 16:38:44 node1 ceph-mon[550]:  8: (C_Committed::finish(int)+0x2b) 
[0x55fc323b55bb]
Jul 07 16:38:44 node1 ceph-mon[550]:  9: (Context::complete(int)+0x9) 
[0x55fc322b2c19]
Jul 07 16:38:44 node1 ceph-mon[550]:  10: 
(MonitorDBStore::C_DoTransaction::finish(int)+0xa0) [0x55fc323b33c0]
Jul 07 16:38:44 node1 ceph-mon[550]:  11: (Context::complete(int)+0x9) 
[0x55fc322b2c19]
Jul 07 16:38:44 node1 ceph-mon[550]:  12: 
(Finisher::finisher_thread_entry()+0x4c0) [0x55fc324c7950]
Jul 07 16:38:44 node1 ceph-mon[550]:  13: (()+0x7494) [0x7f679dd95494]
Jul 07 16:38:44 node1 ceph-mon[550]:  14: (clone()+0x3f) [0x7f679b27baff]
Jul 07 16:38:44 node1 ceph-mon[550]:  NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
Jul 07 16:38:44 node1 ceph-mon[550]: 2017-07-07 16:38:44.581790 7f678fe61700 -1 
/build/ceph-12.1.0/src/mon/OSDMonitor.cc: In function 'void 
OSDMonitor::check_pg_creates_subs()' thread 7f678f
Jul 07 16:38:44 node1 ceph-mon[550]: /build/ceph-12.1.0/src/mon/OSDMonitor.cc: 
2977: FAILED assert(osdmap.get_up_osd_features() & 
CEPH_FEATURE_MON_STATEFUL_SUB)
Jul 07 16:38:44 node1 ceph-mon[550]:  ceph version 12.1.0 
(262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
Jul 07 16:38:44 node1 ceph-mon[550]:  1: (ceph::__ceph_assert_fail(char const*, 
char const*, int, char const*)+0x102) [0x55fc324c8802]
Jul 07 16:38:44 node1 ceph-mon[550]:  2: (()+0x474ed0) [0x55fc323c8ed0]
Jul 07 16:38:44 node1 ceph-mon[550]:  3: 
(OSDMonitor::update_from_paxos(bool*)+0x1a4d) [0x55fc323f168d]
Jul 07 16:38:44 node1 ceph-mon[550]:  4: (PaxosService::refresh(bool*)+0x3ff) 
[0x55fc323b673f]
Jul 07 16:38:44 node1 ceph-mon[550]:  5: 
(Monitor::refresh_from_paxos(bool*)+0x1a3) [0x55fc322767e3]
Jul 07 16:38:44 node1 ceph-mon[550]:  6: (Paxos::do_refresh()+0x47) 
[0x55fc323a0227]
Jul 07 16:38:44 node1 ceph-mon[550]:  7: (Paxos::commit_finish()+0x703) 
[0x55fc323b1ad3]
Jul 07 16:38:44 node1 ceph-mon[550]:  8: (C_Committed::finish(int)+0x2b) 
[0x55fc323b55bb]
Jul 07 16:38:44 node1 ceph-mon[550]:  9: (Context::complete(int)+0x9) 
[0x55fc322b2c19]
Jul 07 16:38:44 node1 ceph-mon[550]:  10: 
(MonitorDBStore::C_DoTransaction::finish(int)+0xa0) [0x55fc323b33c0]
Jul 07 16:38:44 node1 ceph-mon[550]:  11: (Context::complete(int)+0x9) 
[0x55fc322b2c19]
Jul 07 16:38:44 node1 ceph-mon[550]:  12: 
(Finisher::finisher_thread_entry()+0x4c0) [0x55fc324c7950]
Jul 07 16:38:44 node1 ceph-mon[550]:  13: (()+0x7494) [0x7f679dd95494]
Jul 07 16:38:44 node1 ceph-mon[550]:  14: (clone()+0x3f) [0x7f679b27baff]
Jul 07 16:38:44 node1 ceph-mon[550]:  NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
Jul 07 16:38:44 node1 ceph-mon[550]:  0> 2017-07-07 16:38:44.581790 
7f678fe61700 -1 /build/ceph-12.1.0/src/mon/OSDMonitor.cc: In function 'void 
OSDMonitor::check_pg_creates_subs()' threa
Jul 07 16:38:44 node1 ceph-mon[550]: /build/ceph-12.

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-12-04 Thread Thomas Danan
Hi Nick,

We have recently increase osd op threads from 2 (default value) to 16 because 
CPU usage on DN was very low.
We have the impression it has increased overall ceph cluster performances and 
reduced block ops occurrences.

I don’t think this is the end of our issue but it seems it helped to limit its 
impact.

Thomas

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 23 novembre 2016 14:09
To: Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

I’m afraid I can’t offer anymore advice, they isn’t anything that I can see 
which could be the trigger. I know we spoke about downgrading the kernel, did 
you manage to try that?

Nick

From: Thomas Danan [mailto:thomas.da...@mycom-osi.com]
Sent: 23 November 2016 11:29
To: n...@fisk.me.uk; 'Peter Maloney' <peter.malo...@brockmann-consult.de>
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi all,

Still not able to find any explanation to this issue.

I recently tested the network and I am seeing some retransmit being done in the 
output of iperf. But overall the bandwidth durin a test of 10sec is around 7 to 
8 Gbps.
I was not sure to understand if it was the test itself who was overloading the 
network or if my network switches that were having an issue.
Switches have been checked and they are showing no congestion issues or other 
errors.

I really don’t know what to check or test, any idea is more than welcomed …

Thomas

From: Thomas Danan
Sent: vendredi 18 novembre 2016 17:12
To: 'n...@fisk.me.uk'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 
c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227 >> 
192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647 cs=5 l=0 
c=0x42798c0).fault, initiating reconnect

I do not manage to identify anything obvious in the logs.

Thanks for your help …

Thomas


From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t c

Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size?

2016-12-02 Thread Thomas Bennett
Hi Steve and Kate,

Again - thanks again for the great suggestions.

Increasing the allocsize did not help us in the situation relating to my
current testing (poor read performance). However, allocsize is a great for
parameter for overall performance tuning and I intend to use it. :)

After discussion with colleagues and reading this article - ubuntu drive io
scheduler
<http://askubuntu.com/questions/784442/why-does-ubuntu-16-04-set-all-drive-io-schedulers-to-deadline>,
I decided to try out the cfq io schedular - ubuntu now defaults to deadline.

This made a significant difference - it actually double the overall read
performance.

I suggest anyone using ubuntu 14.04 or higher and high density osd nodes
(we have 48 osds per osd node) might like to test out cfq. It's also a
pretty easy test to perform :) and can be done on the fly.

Cheers,
Tom

On Wed, Nov 30, 2016 at 5:50 PM, Steve Taylor <steve.tay...@storagecraft.com
> wrote:

> We’re using Ubuntu 14.04 on x86_64. We just added ‘osd mount options xfs =
> rw,noatime,inode64,allocsize=1m’ to the [osd] section of our ceph.conf so
> XFS allocates 1M blocks for new files. That only affected new files, so
> manual defragmentation was still necessary to clean up older data, but once
> that was done everything got better and stayed better.
>
>
>
> You can use the xfs_db command to check fragmentation on an XFS volume and
> xfs_fsr to perform a defragmentation. The defragmentation can run on a
> mounted filesystem too, so you don’t even have to rely on Ceph to avoid
> downtime. I probably wouldn’t run it everywhere at once though for
> performance reasons. A single OSD at a time would be ideal, but that’s a
> matter of preference.
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Thomas Bennett
> *Sent:* Wednesday, November 30, 2016 5:58 AM
>
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Is there a setting on Ceph that we can use to
> fix the minimum read size?
>
>
>
> Hi Kate and Steve,
>
>
>
> Thanks for the replies. Always good to hear back from a community :)
>
>
>
> I'm using Linux on x86_64 architecture and the block size is limited to
> the page size which is 4k. So it looks like I'm hitting hard limits in any
> changes. to increase the block size.
>
>
>
> I found this out by running the following command:
>
>
>
> $ mkfs.xfs -f -b size=8192 /dev/sda1
>
>
>
> $ mount -v /dev/sda1 /tmp/disk/
>
> mount: Function not implemented #huh???
>
>
>
> Checking out the man page:
>
>
>
> $ man mkfs.xfs
>
>  -b block_size_options
>
>   ... XFS  on  Linux  currently  only  supports pagesize or smaller
> blocks.
>
>
>
> I'm hesitant to implement btrfs as its still experimental and ext4 seems
> to have the same current limitation.
>
>
>
> Our current approach is to exclude the hard drive that we're getting the
> poor read rates from our procurement process, but it would still be nice to
> find out how much control we have over how ceph-osd  daemons read from the
> drives. I may attempts a strace on an osd daemon as we read to see what the
> actual read request size is being asked to the kernel.
>
>
>
> Cheers,
>
> Tom
>
>
>
>
>
> On Tue, Nov 29, 2016 at 11:53 PM, Steve Taylor <
> steve.tay...@storagecraft.com> wrote:
>
> We configured XFS on our OSDs to use 1M blocks (our use case is RBDs with
> 1M blocks) due to massive fragmentation in our filestores a while back. We
> were having to defrag all the time and cluster performance was noticeably
> degraded. We also create and delete lots of RBD snapshots on a daily basis,
> so that likely contributed to the fragmentation as well. It’s been MUCH
> better since we switched XFS to use 1M allocations. Virtually no
> fragmentation and performance is consistently good.
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size?

2016-11-30 Thread Thomas Bennett
Hi Kate and Steve,

Thanks for the replies. Always good to hear back from a community :)

I'm using Linux on x86_64 architecture and the block size is limited to the
page size which is 4k. So it looks like I'm hitting hard limits in any
changes. to increase the block size.

I found this out by running the following command:

$ mkfs.xfs -f -b size=8192 /dev/sda1

$ mount -v /dev/sda1 /tmp/disk/
mount: Function not implemented #huh???

Checking out the man page:

$ man mkfs.xfs
 -b block_size_options
  ... XFS  on  Linux  currently  only  supports pagesize or smaller
blocks.

I'm hesitant to implement btrfs as its still experimental and ext4 seems to
have the same current limitation.

Our current approach is to exclude the hard drive that we're getting the
poor read rates from our procurement process, but it would still be nice to
find out how much control we have over how ceph-osd  daemons read from the
drives. I may attempts a strace on an osd daemon as we read to see what the
actual read request size is being asked to the kernel.

Cheers,
Tom


On Tue, Nov 29, 2016 at 11:53 PM, Steve Taylor <
steve.tay...@storagecraft.com> wrote:

> We configured XFS on our OSDs to use 1M blocks (our use case is RBDs with
> 1M blocks) due to massive fragmentation in our filestores a while back. We
> were having to defrag all the time and cluster performance was noticeably
> degraded. We also create and delete lots of RBD snapshots on a daily basis,
> so that likely contributed to the fragmentation as well. It’s been MUCH
> better since we switched XFS to use 1M allocations. Virtually no
> fragmentation and performance is consistently good.
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Kate Ward
> *Sent:* Tuesday, November 29, 2016 2:02 PM
> *To:* Thomas Bennett <tho...@ska.ac.za>
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Is there a setting on Ceph that we can use to
> fix the minimum read size?
>
>
>
> I have no experience with XFS, but wouldn't expect poor behaviour with it.
> I use ZFS myself and know that it would combine writes, but btrfs might be
> an option.
>
>
>
> Do you know what block size was used to create the XFS filesystem? It
> looks like 4k is the default (reasonable) with a max of 64k. Perhaps a
> larger block size will give better performance for your particular use
> case. (I use a 1M block size with ZFS.)
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-
> US/html/ch04s02.html
>
>
>
>
>
> On Tue, Nov 29, 2016 at 10:23 AM Thomas Bennett <tho...@ska.ac.za> wrote:
>
> Hi Kate,
>
>
>
> Thanks for your reply. We currently use xfs as created by ceph-deploy.
>
>
>
> What would you recommend we try?
>
>
>
> Kind regards,
>
> Tom
>
>
>
>
>
> On Tue, Nov 29, 2016 at 11:14 AM, Kate Ward <kate.w...@forestent.com>
> wrote:
>
> What filesystem do you use on the OSD? Have you considered a different
> filesystem that is better at combining requests before they get to the
> drive?
>
>
>
> k8
>
>
>
> On Tue, Nov 29, 2016 at 9:52 AM Thomas Bennett <tho...@ska.ac.za> wrote:
>
> Hi,
>
>
>
> We have a use case where we are reading 128MB objects off spinning disks.
>
>
>
> We've benchmarked a number of different hard drive and have noticed that
> for a particular hard drive, we're experiencing slow reads by comparison.
>
>
>
> This occurs when we have multiple readers (even just 2) reading objects
> off the OSD.
>
>
>
> We've recreated the effect using iozone and have noticed that once the
> record size drops to 4k, the hard drive miss behaves.
>
>
>
> Is there a setting on Ceph that we can change to fix the minimum read size
> when the ceph-osd daemon reads the object of the hard drives, to see if we
> can overcome the overall slow read rate.
>
>
>
> Cheers,
>
> Tom
>
> --
>
> <https://storagecraft.com> Steve Taylor | Senior Software Engineer | 
> StorageCraft
> Technology Corporation <https://storagecraft.com>
> 380 Data Drive Suite 300 | Draper | Utah | 84020
> Office: 801.871.2799 |
>
> --
>
> If you are not the intended recipient of this message or received it
> erroneously, please notify the sender and delete it, together with any
> attachments, and be advised that any dissemination or copying of this
> message is prohibited.
>
> --
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> --
>
> Thomas Bennett
>
>
>
> SKA South Africa
>
> Science Processing Team
>
>
>
> Office: +27 21 5067341 <+27%2021%20506%207341>
>
> Mobile: +27 79 5237105 <+27%2079%20523%207105>
>
>


-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341
Mobile: +27 79 5237105
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is there a setting on Ceph that we can use to fix the minimum read size?

2016-11-29 Thread Thomas Bennett
Hi Kate,

Thanks for your reply. We currently use xfs as created by ceph-deploy.

What would you recommend we try?

Kind regards,
Tom


On Tue, Nov 29, 2016 at 11:14 AM, Kate Ward <kate.w...@forestent.com> wrote:

> What filesystem do you use on the OSD? Have you considered a different
> filesystem that is better at combining requests before they get to the
> drive?
>
> k8
>
> On Tue, Nov 29, 2016 at 9:52 AM Thomas Bennett <tho...@ska.ac.za> wrote:
>
>> Hi,
>>
>> We have a use case where we are reading 128MB objects off spinning disks.
>>
>> We've benchmarked a number of different hard drive and have noticed that
>> for a particular hard drive, we're experiencing slow reads by comparison.
>>
>> This occurs when we have multiple readers (even just 2) reading objects
>> off the OSD.
>>
>> We've recreated the effect using iozone and have noticed that once the
>> record size drops to 4k, the hard drive miss behaves.
>>
>> Is there a setting on Ceph that we can change to fix the minimum read
>> size when the ceph-osd daemon reads the object of the hard drives, to see
>> if we can overcome the overall slow read rate.
>>
>> Cheers,
>> Tom
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>


-- 
Thomas Bennett

SKA South Africa
Science Processing Team

Office: +27 21 5067341
Mobile: +27 79 5237105
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance laggy (requests blocked > 32) on OpenStack

2016-11-25 Thread Thomas Danan
Hi Kévin,  I am currently having a similar issue. in my env I have around 16 
Linux vms (vmware) more or less equaly loaded accessing a 1PB ceph hammer 
cluster (40 dn, 800 osds) through rbd.

Very often we have IO freeze on the VM xfs FS and we also continuously have 
slow requests on osd ( up to 10/20 minutes sometime ).
In my case the slow requests / blocked ops are because primary osd is waiting 
for subops i.e. waiting for replication to happen on secondary osd. In my case 
not all the VM are  blocked at the same time . ..

I still do not have explanation, root cause, nor wa 

Will keep you Inormed if I find something ...



Sent from my Samsung device


 Original message 
From: Kevin Olbrich 
Date: 11/25/16 19:19 (GMT+05:30)
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph performance laggy (requests blocked > 32) on 
OpenStack

Hi,

we are running 80 VMs using KVM in OpenStack via RBD in Ceph Jewel on a total 
of 53 disks (RAID parity already excluded).
Our nodes are using Intel P3700 DC-SSDs for journaling.

Most VMs are linux based and load is low to medium. There are also about 10 VMs 
running Windows 2012R2, two of them run remote services (terminal).

My question is: Are 80 VMs hosted on 53 disks (mostly 7.2k SATA) to much? We 
sometime experience lags where nearly all servers suffer from "blocked IO > 32" 
seconds.

What are your experiences?

Mit freundlichen Grüßen / best regards,
Kevin Olbrich.



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: RadosGW not responding if ceph cluster in state health_error

2016-11-24 Thread Thomas


Sorry to bring this up again - any ideas? Or should I try the IRC channel?

Cheers,
Thomas

 Original Message 
Subject:RadosGW not responding if ceph cluster in state health_error
Date:   Mon, 21 Nov 2016 17:22:20 +1300
From:   Thomas <tho...@tgmedia.co.nz>
To: ceph-users@lists.ceph.com



Hi All,

I have a cluster setup with 16 OSDs on 4 nodes, standard RGW install 
with standard rgw pools, replication on those pools is set to 2 (size 2, 
min_size 1).


We've had the situation before where one node totally dropped out (so 4 
OSDs) and the cluster health was warning and rgw as well as other pools 
were working fine.


I now had a problem where we added a test pool with replication 1 (size 
1, min_size 1), the node died again and 4 OSDs dropped out resulting in 
health_error and RGW not responding at all which I'm not sure why that 
would be the case.


I understand that with a pool that uses size 1 and one OSD dropping out 
(unrecoverable), you'll loose all that data (pretty much), and it was 
only set to do some benchmarking, however, I didn't know that it was 
affecting the entire cluster. Restarting the rados-gw service would 
work, however, it wouldn't listen to requests as well as showing errors 
like this in the logs:


2016-11-18 11:13:47.231827 7f0aaadb2a00 10  cannot find current period 
zonegroup using local zonegroup
2016-11-18 11:13:47.231860 7f0aaadb2a00 20 get_system_obj_state: 
rctx=0x7fffb14242c0 obj=.rgw.root:default.realm state=0x564c3fa99858 
s->prefetch_data=0
2016-11-18 11:13:47.232754 7f0aaadb2a00 10 could not read realm id: (2) 
No such file or directory

2016-11-18 11:13:47.232772 7f0aaadb2a00 10 Creating default zonegroup
2016-11-18 11:13:47.233376 7f0aaadb2a00 10 couldn't find old data 
placement pools config, setting up new ones for the zone


...

2016-11-18 11:13:47.251629 7f0aaadb2a00 10 ERROR: name default already 
in use for obj id 712c74f9-baf4-4d74-956b-022c67e4a5bb
2016-11-18 11:13:47.251631 7f0aaadb2a00 10 create_default() returned 
-EEXIST, we raced with another zonegroup creation


Full log here: http://pastebin.com/iYpiF9wP

Once we removed the pool with size = 1 via 'rados rmpool', the cluster 
started recovering and RGW served requests!


Any ideas?

Cheers,
Thomas


--

Thomas Gross
TGMEDIA Ltd.
p. +64 211 569080 | i...@tgmedia.co.nz

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread Thomas Danan
Hi David,

Actually this pg subfolder splitting was not explored yet, I will have a look. 
In our setup OSD are never marked as down, probably because of the following 
settings:

mon_osd_adjust_heartbeat_grace = false
mon_osd_adjust_down_out_interval = false
mon_osd_min_down_reporters = 5
mon_osd_min_down_reports = 10

Thomas

From: David Turner [mailto:david.tur...@storagecraft.com]
Sent: mercredi 23 novembre 2016 21:27
To: n...@fisk.me.uk; Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

This thread has gotten quite large and I haven't read most of it, so I 
apologize if this is a duplicate idea/suggestion.  100% of the time our cluster 
has blocked requests and we aren't increasing pg_num, adding storage, or having 
a disk failing... it is pg subfolder splitting.  100% of the time, every time, 
this is our cause of blocked requests.  It is often accompanied by drives being 
marked down by the cluster even though the osd daemon is still running.

The settings that govern this are filestore merge threshold and filestore split 
multiple 
(http://docs.ceph.com/docs/giant/rados/configuration/filestore-config-ref/).

[cid:image001.jpg@01D24656.1AB7CA70]<https://storagecraft.com>

David Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation<https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.



From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Nick Fisk 
[n...@fisk.me.uk]
Sent: Wednesday, November 23, 2016 6:09 AM
To: 'Thomas Danan'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently
Hi Thomas,

I’m afraid I can’t offer anymore advice, they isn’t anything that I can see 
which could be the trigger. I know we spoke about downgrading the kernel, did 
you manage to try that?

Nick

From: Thomas Danan [mailto:thomas.da...@mycom-osi.com]
Sent: 23 November 2016 11:29
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi all,

Still not able to find any explanation to this issue.

I recently tested the network and I am seeing some retransmit being done in the 
output of iperf. But overall the bandwidth durin a test of 10sec is around 7 to 
8 Gbps.
I was not sure to understand if it was the test itself who was overloading the 
network or if my network switches that were having an issue.
Switches have been checked and they are showing no congestion issues or other 
errors.

I really don’t know what to check or test, any idea is more than welcomed …

Thomas

From: Thomas Danan
Sent: vendredi 18 novembre 2016 17:12
To: 'n...@fisk.me.uk'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand w

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread Thomas Danan
Hi Tomasz,

For some reasons, I am still having Snapshot context even if no more snapshot 
is available in my system. Probably because not all objects have been deleted 
(I still see 4 object clones).

However, I was said that from the logs I was showing, that we are not trying to 
replicate 4MB objects but just few bytes of that object so probably not related 
to Snapshots (cf write_size 4194304,write 1449984~524288)
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771
Thomas

From: Tomasz Kuzemko [mailto:tom...@kuzemko.net]
Sent: jeudi 24 novembre 2016 01:42
To: Thomas Danan
Cc: n...@fisk.me.uk; Peter Maloney; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,
do you have any RBD created as clone from another snapshot? If yes then this 
would mean you still have some protected snapshots and only way to get rid of 
them is to flatten the cloned RBD, unprotect snapshot and delete it.

2016-11-18 12:42 GMT+01:00 Thomas Danan 
<thomas.da...@mycom-osi.com<mailto:thomas.da...@mycom-osi.com>>:
Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 
10.137.81.18:6840/2962227<http://10.137.81.18:6840/2962227> >> 
10.137.81.135:0/748393319<http://10.137.81.135:0/748393319> pipe(0x293bd000 
sd=35 :6840 s=0 pgs=0 cs=0 l=0 c=0x21405020).accept peer addr is really 
10.137.81.135:0/748393319<http://10.137.81.135:0/748393319> (socket is 
10.137.81.135:26749/0<http://10.137.81.135:26749/0>)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 
192.168.228.36:6841/2962227<http://192.168.228.36:6841/2962227> >> 
192.168.228.28:6805/3721728<http://192.168.228.28:6805/3721728> pipe(0x1271b000 
sd=34 :50711 s=2 pgs=647 cs=5 l=0 c=0x42798c0).fault, initiating reconnect

I do not manage to identify anything obvious in the logs.

Thanks for your help …

Thomas


From: Nick Fisk [mailto:n...@fisk.me.uk<mailto:n...@fisk.me.uk>]
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t communicate with each other.

From: ceph-users [mailto:ceph-users-boun

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-23 Thread Thomas Danan
Hi all,

Still not able to find any explanation to this issue.

I recently tested the network and I am seeing some retransmit being done in the 
output of iperf. But overall the bandwidth durin a test of 10sec is around 7 to 
8 Gbps.
I was not sure to understand if it was the test itself who was overloading the 
network or if my network switches that were having an issue.
Switches have been checked and they are showing no congestion issues or other 
errors.

I really don’t know what to check or test, any idea is more than welcomed …

Thomas

From: Thomas Danan
Sent: vendredi 18 novembre 2016 17:12
To: 'n...@fisk.me.uk'; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 
c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227 >> 
192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647 cs=5 l=0 
c=0x42798c0).fault, initiating reconnect

I do not manage to identify anything obvious in the logs.

Thanks for your help …

Thomas


From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t communicate with each other.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 17 November 2016 08:59
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi,

I have recheck the pattern when slow request are detected.

I have an example with following (primary: 411, secondary: 176, 594)
On primary slow requests detected: waiting for subops (176, 594)  during 16 
minutes 

2016-11-17 13:29:27.209754 7f001d414700 0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 480.477315 secs
2016-11-17 13:29:27.209777 7f001d414700 0 log_channel(cluster) log [WRN] : slow 
request 480.477315 seconds old, received at 2016-11-17 13:21:26.732303: 
osd_op(client.2407558.1:206455044 rb

[ceph-users] RadosGW not responding if ceph cluster in state health_error

2016-11-20 Thread Thomas

Hi All,

I have a cluster setup with 16 OSDs on 4 nodes, standard RGW install 
with standard rgw pools, replication on those pools is set to 2 (size 2, 
min_size 1).


We've had the situation before where one node totally dropped out (so 4 
OSDs) and the cluster health was warning and rgw as well as other pools 
were working fine.


I now had a problem where we added a test pool with replication 1 (size 
1, min_size 1), the node died again and 4 OSDs dropped out resulting in 
health_error and RGW not responding at all which I'm not sure why that 
would be the case.


I understand that with a pool that uses size 1 and one OSD dropping out 
(unrecoverable), you'll loose all that data (pretty much), and it was 
only set to do some benchmarking, however, I didn't know that it was 
affecting the entire cluster. Restarting the rados-gw service would 
work, however, it wouldn't listen to requests as well as showing errors 
like this in the logs:


2016-11-18 11:13:47.231827 7f0aaadb2a00 10  cannot find current period 
zonegroup using local zonegroup
2016-11-18 11:13:47.231860 7f0aaadb2a00 20 get_system_obj_state: 
rctx=0x7fffb14242c0 obj=.rgw.root:default.realm state=0x564c3fa99858 
s->prefetch_data=0
2016-11-18 11:13:47.232754 7f0aaadb2a00 10 could not read realm id: (2) 
No such file or directory

2016-11-18 11:13:47.232772 7f0aaadb2a00 10 Creating default zonegroup
2016-11-18 11:13:47.233376 7f0aaadb2a00 10 couldn't find old data 
placement pools config, setting up new ones for the zone


...

2016-11-18 11:13:47.251629 7f0aaadb2a00 10 ERROR: name default already 
in use for obj id 712c74f9-baf4-4d74-956b-022c67e4a5bb
2016-11-18 11:13:47.251631 7f0aaadb2a00 10 create_default() returned 
-EEXIST, we raced with another zonegroup creation


Full log here: http://pastebin.com/iYpiF9wP

Once we removed the pool with size = 1 via 'rados rmpool', the cluster 
started recovering and RGW served requests!


Any ideas?

Cheers,
Thomas


--

Thomas Gross
TGMEDIA Ltd.
p. +64 211 569080 | i...@tgmedia.co.nz

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-18 Thread Thomas Danan
I often read that small IO write and RBD are working better with bigger 
filestore_max_sync_interval than default value.
Default value is 5 sec and I saw many post saying they are using 30 sec.
Also the slow request symptom is often linked to this parameter.
My journals are 10GB ( collocated with OSD storage ) and overall Client IO 
write throughput is around 500MB/s at peak time.
I understand that the journal is flushed when it is half filled or when it 
reach the filestore_max_sync_interval value.
I guess I can safely give it a try ? just to see the impact ?
Not sure if I can change this online ?

Thanks

Thomas

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: vendredi 18 novembre 2016 12:42
To: n...@fisk.me.uk; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 
c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227 >> 
192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647 cs=5 l=0 
c=0x42798c0).fault, initiating reconnect

I do not manage to identify anything obvious in the logs.

Thanks for your help …

Thomas


From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t communicate with each other.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 17 November 2016 08:59
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi,

I have recheck the pattern when slow request are detected.

I have an example with following (primary: 411, secondary: 176, 594)
On primary slow requests detected: waiting for subops (176, 594)  during 16 
minutes 

2016-11-17 13:29:27.209754 7f001d414700 0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 480.477315 secs
2016-11-17 13:29:27.209777 7f001d414700 0 log_channel

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-18 Thread Thomas Danan
Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.00111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 
c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227 >> 
192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647 cs=5 l=0 
c=0x42798c0).fault, initiating reconnect

I do not manage to identify anything obvious in the logs.

Thanks for your help …

Thomas


From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t communicate with each other.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 17 November 2016 08:59
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi,

I have recheck the pattern when slow request are detected.

I have an example with following (primary: 411, secondary: 176, 594)
On primary slow requests detected: waiting for subops (176, 594)  during 16 
minutes 

2016-11-17 13:29:27.209754 7f001d414700 0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 480.477315 secs
2016-11-17 13:29:27.209777 7f001d414700 0 log_channel(cluster) log [WRN] : slow 
request 480.477315 seconds old, received at 2016-11-17 13:21:26.732303: 
osd_op(client.2407558.1:206455044 rbd_data.66ea12ae8944a.001acbbc 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1257472~368640] 
0.61fe279f snapc 3fd=[3fd,3de] ondisk+write e210553) currently waiting for 
subops from 176,594

So the primary OSD is waiting for subops since 13:21 (13:29 - 480 seconds)

2016-11-17 13:36:33.039691 7efffd8ee700 0 -- 192.168.228.23:6800/694486 >> 
192.168.228.7:6819/3611836 pipe(0x13ffd000 sd=33 :17791 s=2 pgs=131 cs=7 l=0 
c=0x13251de0).fault, initiating reconnect
2016-11-17 13:36:39.570692 7efff6784700 0 -- 192.168.228.23:6800/694486 >> 
192.168.228.13:6858/2033854 pipe(0x17009000 sd=60 :52188 s=2 pgs=147 cs=7 l=0 
c=0x141159c0).fault, initiating reconnect

After this log, it seems the ops are unblocked and I do not have any more the 
“currently waiting for subops from 176,594”

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-17 Thread Thomas Danan
Actually forgot to say that the following issue is describing very close 
symptoms :

http://tracker.ceph.com/issues/9844

Thomas

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: jeudi 17 novembre 2016 09:59
To: n...@fisk.me.uk; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi,

I have recheck the pattern when slow request are detected.

I have an example with following (primary: 411, secondary: 176, 594)
On primary slow requests detected: waiting for subops (176, 594)  during 16 
minutes 

2016-11-17 13:29:27.209754 7f001d414700 0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 480.477315 secs
2016-11-17 13:29:27.209777 7f001d414700 0 log_channel(cluster) log [WRN] : slow 
request 480.477315 seconds old, received at 2016-11-17 13:21:26.732303: 
osd_op(client.2407558.1:206455044 rbd_data.66ea12ae8944a.001acbbc 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1257472~368640] 
0.61fe279f snapc 3fd=[3fd,3de] ondisk+write e210553) currently waiting for 
subops from 176,594

So the primary OSD is waiting for subops since 13:21 (13:29 - 480 seconds)

2016-11-17 13:36:33.039691 7efffd8ee700 0 -- 192.168.228.23:6800/694486 >> 
192.168.228.7:6819/3611836 pipe(0x13ffd000 sd=33 :17791 s=2 pgs=131 cs=7 l=0 
c=0x13251de0).fault, initiating reconnect
2016-11-17 13:36:39.570692 7efff6784700 0 -- 192.168.228.23:6800/694486 >> 
192.168.228.13:6858/2033854 pipe(0x17009000 sd=60 :52188 s=2 pgs=147 cs=7 l=0 
c=0x141159c0).fault, initiating reconnect

After this log, it seems the ops are unblocked and I do not have any more the 
“currently waiting for subops from 176,594”

So primary OSD was blocked during 15 minutes in total 


On the secondary OSD, I can see the following messages during the same period 
(but also after and before)

secondary:
2016-11-17 13:34:58.125076 7fbcc7517700 0 -- 192.168.228.7:6819/3611836 >> 
192.168.228.42:6832/2873873 pipe(0x12d2e000 sd=127 :6819 s=2 pgs=86 cs=5 l=0 
c=0xef18c00).fault with nothing to send, going to sandby

In some other example and with some DEBUG messages activated  I was also able 
to see the many of the following messages on secondary OSDs.
2016-11-15 03:53:04.298502 7ff9c434f700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7ff9bdb42700' had timed out after 15

Thomas

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 16 novembre 2016 22:13
To: Thomas Danan; n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi,

Yes, I can’t think of anything else at this stage. Could you maybe repost some 
dump historic op dumps  now that you have turned off snapshots. I wonder if 
they might reveal anything.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 16 November 2016 17:38
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

We have deleted all Snapshots and observed the system for several hours.
From what I see this did not help to reduce the blocked ops and IO freeze on 
Client ceph side.

We have also tried to increase a little bit the PGs (by 8 than 128) because 
this is something we should do and we wanted to see how the cluster was 
behaving.
During recovery, the number of blocked ops and associated duration increased 
significantly. Also the number of impacted OSDs was much more important.

Don’t really know what to conclude from all of this …

Again we have checked Disk / network / and everything seems fine …

Thomas
From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 16 novembre 2016 14:01
To: Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

The snapshot works by using Copy On Write. If you dirty even a 4kb section of a 
4MB object in the primary RBD, that entire 4MB object then needs to be read and 
then written into the snapshot RBD.

From: Thomas Danan [mailto:thomas.da...@mycom-osi.com]
Sent: 16 November 2016 12:58
To: Thomas Danan 
<thomas.da...@mycom-osi.com<mailto:thomas.da...@mycom-osi.com>>; 
n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Actually I was wondering, is the

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-16 Thread Thomas Danan
Hi Nick,

We have deleted all Snapshots and observed the system for several hours.
From what I see this did not help to reduce the blocked ops and IO freeze on 
Client ceph side.

We have also tried to increase a little bit the PGs (by 8 than 128) because 
this is something we should do and we wanted to see how the cluster was 
behaving.
During recovery, the number of blocked ops and associated duration increased 
significantly. Also the number of impacted OSDs was much more important.

Don’t really know what to conclude from all of this …

Again we have checked Disk / network / and everything seems fine …

Thomas
From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 16 novembre 2016 14:01
To: Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

The snapshot works by using Copy On Write. If you dirty even a 4kb section of a 
4MB object in the primary RBD, that entire 4MB object then needs to be read and 
then written into the snapshot RBD.

From: Thomas Danan [mailto:thomas.da...@mycom-osi.com]
Sent: 16 November 2016 12:58
To: Thomas Danan 
<thomas.da...@mycom-osi.com<mailto:thomas.da...@mycom-osi.com>>; 
n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Actually I was wondering, is there any difference between Snapshot or simple 
RBD image ?
With simple RBD image when doing a random IO, we are asking Ceph cluster to 
update one or several 4MB objects no ?
So Snapshotting is multiplying the load by 2 but not more, Am I wrong ?

Thomas

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: mercredi 16 novembre 2016 13:52
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Yes our application is doing small Random IO and I did not realize that the 
snapshotting feature could so much degrade performances in that case.

We have just deactivated it and deleted all snapshots. Will notify you if it 
drastically reduce the blocked ops and consequently the IO freeze on client 
side.

Thanks

Thomas

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 16 novembre 2016 13:25
To: Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 15 November 2016 21:14
To: Peter Maloney 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Very interesting ...

Any idea why optimal tunable would help here ?  on our cluster we have 500TB of 
data, I am a bit concerned about changing it without taking lot of precautions 
. ...
I am curious to know how much time it takes you to change tunable, size of your 
cluster and observed impacts on client IO ...

Yes We do have daily rbd snapshot from 16 different ceph RBD clients. 
Snapshoting the RBD image is quite immediate while we are seing the issue 
continuously during the day...

Just to point out that when you take a snapshot any writes to the original RBD 
will mean that the full 4MB object is copied into the snapshot. If you have a 
lot of small random IO going on the original RBD this can lead to massive write 
amplification across the cluster and may cause issues such as what you describe.

Also be aware that deleting large snapshots also puts significant strain on the 
OSD’s as they try and delete hundreds of thousands of objects.


Will check all of this tomorrow . ..

Thanks again

Thomas



Sent from my Samsung device


 Original message 
From: Peter Maloney 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Date: 11/15/16 21:27 (GMT+01:00)
To: Thomas Danan <thomas.da...@mycom-osi.com<mailto:thomas.da...@mycom-osi.com>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently
On 11/15/16 14:05, Thomas Danan wrote:
> Hi Peter,
>
> Ceph cluster version is 0.94.5 and we are running with Firefly tunables and 
> also we have 10KPGs instead of the 30K / 40K we should have.
> The linux kernel version is 3.10.0-327.36.1.el7.x86_64 with RHEL 7.2
>
> On our side we havethe following settings:
> mon_osd_adjust_heartbeat_grace = false

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-16 Thread Thomas Danan
Hi Nick,

Yes our application is doing small Random IO and I did not realize that the 
snapshotting feature could so much degrade performances in that case.

We have just deactivated it and deleted all snapshots. Will notify you if it 
drastically reduce the blocked ops and consequently the IO freeze on client 
side.

Thanks

Thomas

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 16 novembre 2016 13:25
To: Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 15 November 2016 21:14
To: Peter Maloney 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Very interesting ...

Any idea why optimal tunable would help here ?  on our cluster we have 500TB of 
data, I am a bit concerned about changing it without taking lot of precautions 
. ...
I am curious to know how much time it takes you to change tunable, size of your 
cluster and observed impacts on client IO ...

Yes We do have daily rbd snapshot from 16 different ceph RBD clients. 
Snapshoting the RBD image is quite immediate while we are seing the issue 
continuously during the day...

Just to point out that when you take a snapshot any writes to the original RBD 
will mean that the full 4MB object is copied into the snapshot. If you have a 
lot of small random IO going on the original RBD this can lead to massive write 
amplification across the cluster and may cause issues such as what you describe.

Also be aware that deleting large snapshots also puts significant strain on the 
OSD's as they try and delete hundreds of thousands of objects.


Will check all of this tomorrow . ..

Thanks again

Thomas



Sent from my Samsung device


 Original message 
From: Peter Maloney 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Date: 11/15/16 21:27 (GMT+01:00)
To: Thomas Danan <thomas.da...@mycom-osi.com<mailto:thomas.da...@mycom-osi.com>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently
On 11/15/16 14:05, Thomas Danan wrote:
> Hi Peter,
>
> Ceph cluster version is 0.94.5 and we are running with Firefly tunables and 
> also we have 10KPGs instead of the 30K / 40K we should have.
> The linux kernel version is 3.10.0-327.36.1.el7.x86_64 with RHEL 7.2
>
> On our side we havethe following settings:
> mon_osd_adjust_heartbeat_grace = false
> mon_osd_adjust_down_out_interval = false
> mon_osd_min_down_reporters = 5
> mon_osd_min_down_reports = 10
>
> explaining why the OSDs are not flapping but still they are behaving wrongly 
> and generate the slow requests I am describing.
>
> The osd_op_complaint_time is with the default value (30 sec), not sure I want 
> to change it base on your experience
I wasn't saying you should set the complaint time to 5, just saying
that's why I have complaints logged with such low block times.
> Thomas

And now I'm testing this:
osd recovery sleep = 0.5
osd snap trim sleep = 0.5

(or fiddling with it as low as 0.1 to make it rebalance faster)

While also changing tunables to optimal (which will rebalance 75% of the
objects)
Which has very good results so far (a few <14s blocks right at the
start, and none since, over an hour ago).

And I'm somehow hoping that will fix my rbd export-diff issue too... but
it at least appears to fix the rebalance causing blocks.

Do you use rbd snapshots? I think that may be causing my issues, based
on things like:

> "description": "osd_op(client.692201.0:20455419 4.1b5a5bc1
> rbd_data.94a08238e1f29.617b [] snapc 918d=[918d]
> ack+ondisk+write+known_if_redirected e40036)",
> "initiated_at": "2016-11-15 20:57:48.313432",
> "age": 409.634862,
> "duration": 3.377347,
> ...
> {
> "time": "2016-11-15 20:57:48.313767",
> "event": "waiting for subops from 0,1,8,22"
> },
> ...
> {
> "time": "2016-11-15 20:57:51.688530",
> "event": "sub_op_applied_rec from 22"
> },


Which says "snapc" in there (CoW?), and I think shows that just one osd
is delayed a few seconds and the rest are really fast, like you said.
(and not sure why I see 4 osds here when I have 

Re: [ceph-users] ceph cluster having blocke requests very frequently

2016-11-15 Thread Thomas Danan
Hi Peter,

Ceph cluster version is 0.94.5 and we are running with Firefly tunables and 
also we have 10KPGs instead of the 30K / 40K we should have.
The linux kernel version is 3.10.0-327.36.1.el7.x86_64 with RHEL 7.2

On our side we havethe following settings:
mon_osd_adjust_heartbeat_grace = false
mon_osd_adjust_down_out_interval = false
mon_osd_min_down_reporters = 5
mon_osd_min_down_reports = 10

explaining why the OSDs are not flapping but still they are behaving wrongly 
and generate the slow requests I am describing.

The osd_op_complaint_time is with the default value (30 sec), not sure I want 
to change it base on your experience

Thomas

-Original Message-
From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de]
Sent: mardi 15 novembre 2016 13:44
To: Thomas Danan
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Which kernel version are you using?

I have a similar issue..ubuntu 14.04 kernel 3.13.0-96-generic, and ceph jewel 
10.2.3.

I get logs like this:
2016-11-15 13:13:57.295067 osd.9 10.3.0.132:6817/24137 98 : cluster [WRN] 16 
slow requests, 5 included below; oldest blocked for > 7.957045 secs

I set osd_op_complaint_time=5 instead of default of 30. But also had lots of 
ones >30 even >60s, and with even small blocks like 5s, qemu can hang, and only 
SIGKILL will stop it.

(also I have firefly tunables set right now, and too few pgs...working on that)



This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >