[ceph-users] ansible 2.8 for Nautilus

2019-05-20 Thread solarflow99
Does anyone know the necessary steps to install ansible 2.8 in rhel7? I'm
assuming most people are doing it with pip?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large OMAP Objects in default.rgw.log pool

2019-05-20 Thread mr. non non
Does anyone have  this issue before? As research, many people have issue with 
rgw.index which related to small small number of index sharding (too many 
objects per index).
I also check on this thread 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033611.html but 
don't found any clues because number of data objects is below 100k per index 
and size of objects in rgw.log is 0.

Thanks.

From: ceph-users  on behalf of mr. non non 

Sent: Monday, May 20, 2019 7:32 PM
To: EDH - Manuel Rios Fernandez; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Large OMAP Objects in default.rgw.log pool

Hi Manuel,

I use version 12.2.8 with bluestore and also use manually index sharding 
(configured to 100).  As I checked, no buckets reach 100k of objects_per_shard.
here are health status and cluster log

# ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool 'default.rgw.log'
Search the cluster log for 'Large omap object found' for more details.

# cat ceph.log | tail -2
2019-05-19 17:49:36.306481 mon.MONNODE1 mon.0 10.118.191.231:6789/0 528758 : 
cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)
2019-05-19 17:49:34.535543 osd.38 osd.38 MONNODE1_IP:6808/3514427 12 : cluster 
[WRN] Large omap object found. Object: 4:b172cd59:usage::usage.26:head Key 
count: 8720830 Size (bytes): 1647024346

All objects size are 0.
$  for i in `rados ls -p default.rgw.log`; do rados stat -p default.rgw.log 
${i};done  | more
default.rgw.log/obj_delete_at_hint.78 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/meta.history mtime 2019-05-20 19:19:40.00, size 50
default.rgw.log/obj_delete_at_hint.70 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.000104 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.26 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.28 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.40 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.15 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.69 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.95 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.03 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.47 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.35 mtime 2019-05-20 19:31:45.00, 
size 0


Please kindly advise how to remove health_warn message.

Many thanks.
Arnondh


From: EDH - Manuel Rios Fernandez 
Sent: Monday, May 20, 2019 5:41 PM
To: 'mr. non non'; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Large OMAP Objects in default.rgw.log pool


Hi Arnondh,



Whats your ceph version?



Regards





De: ceph-users  En nombre de mr. non non
Enviado el: lunes, 20 de mayo de 2019 12:39
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Large OMAP Objects in default.rgw.log pool



Hi,



I found the same issue like above.

Does anyone know how to fix it?



Thanks.

Arnondh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Maged Mokhtar


Not sure. In general important fixes get backported, but will have to 
wait and see.


/Maged


On 20/05/2019 22:11, Frank Schilder wrote:

Dear Maged,

thanks for elaborating on this question. Is there already information in which 
release this patch will be deployed?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG stuck in Unknown after removing OSD - Help?

2019-05-20 Thread Tarek Zegar

Set 3 osd to "out", all were on the same host and should not impact the
pool because it's 3x replication and CRUSH is one osd per host.
However, now we have one PG stuck UKNOWN. Not sure why this is the case, I
do have background writes going on at the time of OSD out. Thoughts?

ceph osd tree
ID CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
-1   0.08817 root default
-5   0.02939 host hostosd1
 3   hdd 0.00980 osd.3 up  1.0 1.0
 4   hdd 0.00980 osd.4 up  1.0 1.0
 5   hdd 0.00980 osd.5 up  1.0 1.0
-7   0.02939 host hostosd2
 0   hdd 0.00980 osd.0 up  1.0 1.0
 6   hdd 0.00980 osd.6 up  1.0 1.0
 8   hdd 0.00980 osd.8 up  1.0 1.0
-3   0.02939 host hostosd3
 1   hdd 0.00980 osd.1 up0 1.0
 2   hdd 0.00980 osd.2 up0 1.0
 7   hdd 0.00980 osd.7 up0 1.0


ceph health detail
PG_AVAILABILITY Reduced data availability: 1 pg inactive
pg 1.e2 is stuck inactive for 1885.728547, current state unknown, last
acting [4,0]


ceph pg 1.e2 query
{
"state": "unknown",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 132,
"up": [
4,
0
],
"acting": [
4,
0
],
"info": {
"pgid": "1.e2",
"last_update": "34'3072",
"last_complete": "34'3072",
"log_tail": "0'0",
"last_user_version": 3072,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 29,
"epoch_pool_created": 29,
"last_epoch_started": 30,
"last_interval_started": 29,
"last_epoch_clean": 30,
"last_interval_clean": 29,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 70,
"same_interval_since": 70,
"same_primary_since": 70,
"last_scrub": "0'0",
"last_scrub_stamp": "2019-05-20 21:15:42.448125",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2019-05-20 21:15:42.448125",
"last_clean_scrub_stamp": "2019-05-20 21:15:42.448125"
},
"stats": {
"version": "34'3072",
"reported_seq": "3131",
"reported_epoch": "132",
"state": "unknown",
"last_fresh": "2019-05-20 22:52:07.898135",
"last_change": "2019-05-20 22:50:46.711730",
"last_active": "2019-05-20 22:50:26.109185",
"last_peered": "2019-05-20 22:02:01.008787",
"last_clean": "2019-05-20 22:02:01.008787",
"last_became_active": "2019-05-20 21:15:43.662550",
"last_became_peered": "2019-05-20 21:15:43.662550",
"last_unstale": "2019-05-20 22:52:07.898135",
"last_undegraded": "2019-05-20 22:52:07.898135",
"last_fullsized": "2019-05-20 22:52:07.898135",
"mapping_epoch": 70,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 29,
"last_epoch_clean": 30,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2019-05-20 21:15:42.448125",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2019-05-20 21:15:42.448125",
"last_clean_scrub_stamp": "2019-05-20 21:15:42.448125",
"log_size": 3072,
"ondisk_log_size": 3072,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 12582912,
"num_objects": 3,
"num_object_clones": 0,
"num_object_copies": 9,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 3,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 3072,
"num_write_kb": 12288,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,

[ceph-users] CephFS msg length greater than osd_max_write_size

2019-05-20 Thread Ryan Leimenstoll
Hi all, 

We recently encountered an issue where our CephFS filesystem unexpectedly was 
set to read-only. When we look at some of the logs from the daemons I can see 
the following: 

On the MDS:
...
2019-05-18 16:34:24.341 7fb3bd610700 -1 mds.0.89098 unhandled write error (90) 
Message too long, force readonly...
2019-05-18 16:34:24.341 7fb3bd610700  1 mds.0.cache force file system read-only
2019-05-18 16:34:24.341 7fb3bd610700  0 log_channel(cluster) log [WRN] : force 
file system read-only
2019-05-18 16:34:41.289 7fb3c0616700  1 heartbeat_map is_healthy 'MDSRank' had 
timed out after 15
2019-05-18 16:34:41.289 7fb3c0616700  0 mds.beacon.objmds00 Skipping beacon 
heartbeat to monitors (last acked 4.00101s ago); MDS internal heartbeat is not 
healthy!
...

On one of the OSDs it was most likely targeting:
...
2019-05-18 16:34:24.140 7f8134e6c700 -1 osd.602 pg_epoch: 682796 pg[49.20b( v 
682796'15706523 (682693'15703449,682796'15706523] local-lis/les=673041/673042 
n=10524 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 
673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682796'15706523 lcod 
682796'15706522 mlcod 682796'15706522 active+clean] do_op msg data len 95146005 
> osd_max_write_size 94371840 on osd_op(mds.0.89098:48609421 49.20b 
49:d0630e4c:::mds0_sessionmap:head [omap-set-header,omap-set-vals] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e682796) v8
2019-05-18 17:10:33.695 7f813466b700  0 log_channel(cluster) log [DBG] : 49.31c 
scrub starts
2019-05-18 17:10:34.980 7f813466b700  0 log_channel(cluster) log [DBG] : 49.31c 
scrub ok
2019-05-18 22:17:37.320 7f8134e6c700 -1 osd.602 pg_epoch: 683434 pg[49.20b( v 
682861'15706526 (682693'15703449,682861'15706526] local-lis/les=673041/673042 
n=10525 ec=245563/245563 lis/c 673041/673041 les/c/f 673042/673042/0 
673038/673041/668565) [602,530,558] r=0 lpr=673041 crt=682861'15706526 lcod 
682859'15706525 mlcod 682859'15706525 active+clean] do_op msg data len 95903764 
> osd_max_write_size 94371840 on osd_op(mds.0.91565:357877 49.20b 
49:d0630e4c:::mds0_sessionmap:head [omap-set-header,omap-set-vals,omap-rm-keys] 
snapc 0=[] ondisk+write+known_if_redirected+full_force e683434) v8
…

During this time there were some health concerns with the cluster. 
Significantly, since the error above seems to be related to the SessionMap, we 
had a client that had a few blocked requests for over 35948 secs (it’s a member 
of a compute cluster so we let the node drain/finish jobs before rebooting). We 
have also had some issues with certain OSDs running older hardware staying 
up/responding timely to heartbeats after upgrading to Nautilus, although that 
seems to be an iowait/load issue that we are actively working to mitigate 
separately.

We are running Nautilus 14.2.1 on RHEL7.6. There is only one MDS Rank, with an 
active/standby setup between two MDS nodes. MDS clients are mounted using the 
RHEL7.6 kernel driver. 

My read here would be that the MDS is sending too large a message to the OSD, 
however my understanding was that the MDS should be using osd_max_write_size to 
determine the size of that message [0]. Is this maybe a bug in how this is 
calculated on the MDS side?


Thanks!
Ryan Leimenstoll
rleim...@umiacs.umd.edu
University of Maryland Institute for Advanced Computer Studies



[0] https://www.spinics.net/lists/ceph-devel/msg11951.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests from bluestore osds / crashing rbd-nbd

2019-05-20 Thread Jason Dillaman
On Mon, May 20, 2019 at 2:17 PM Marc Schöchlin  wrote:
>
> Hello cephers,
>
> we have a few systems which utilize a rbd-bd map/mount to get access to a rbd 
> volume.
> (This problem seems to be related to "[ceph-users] Slow requests from 
> bluestore osds" (the original thread))
>
> Unfortunately the rbd-nbd device of a system crashes three mondays in series 
> at ~00:00 when the systemd fstrim timer executes "fstrim -av".
> (which runs in parallel to deep scrub operations)

That's probably not a good practice if you have lots of VMs doing this
at the same time *and* you are not using object-map. The reason is
that "fstrim" could discard huge extents that result around a thousand
concurrent remove/truncate/zero ops per image being thrown at your
cluster.

> After that the device constantly reports io errors every time a access to the 
> filesystem happens.
> Unmounting, remapping and mounting helped to get the filesystem/device back 
> into business :-)

If the cluster was being DDoSed by the fstrims, the VM OSes' might
have timed out thinking a controller failure.

> Manual 30 minute stresstests using the following fio command, did not produce 
> any problems on client side
> (Ceph storage reported some slow requests while testing).
>
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test 
> --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw 
> --rwmixread=50 --numjobs=50 --loops=10
>
> It seems that others also experienced this problem: 
> https://ceph-users.ceph.narkive.com/2FIfyx1U/rbd-nbd-timeout-and-crash
> The change for setting device timeouts by not seems to be merged to luminous.
> Experiments setting the timeout manually after mapping using 
> https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c haven't 
> change the situation.
>
> Do you have suggestions how to analyze/solve the situation?
>
> Regards
> Marc
> 
>
>
>
> The client kernel throws messages like this:
>
> May 19 23:59:01 int-nfs-001 CRON[836295]: (root) CMD (command -v debian-sa1 > 
> /dev/null && debian-sa1 60 2)
> May 20 00:00:30 int-nfs-001 systemd[1]: Starting Discard unused blocks...
> May 20 00:01:02 int-nfs-001 kernel: [1077851.623582] block nbd0: Connection 
> timed out
> May 20 00:01:02 int-nfs-001 kernel: [1077851.623613] block nbd0: shutting 
> down sockets
> May 20 00:01:02 int-nfs-001 kernel: [1077851.623617] print_req_error: I/O 
> error, dev nbd0, sector 84082280
> May 20 00:01:02 int-nfs-001 kernel: [1077851.623632] block nbd0: Connection 
> timed out
> May 20 00:01:02 int-nfs-001 kernel: [1077851.623636] print_req_error: I/O 
> error, dev nbd0, sector 92470887
> May 20 00:01:02 int-nfs-001 kernel: [1077851.623642] block nbd0: Connection 
> timed out
>
> Ceph throws messages like this:
>
> 2019-05-20 00:00:00.000124 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173572 
> : cluster [INF] overall HEALTH_OK
> 2019-05-20 00:00:54.249998 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173586 
> : cluster [WRN] Health check failed: 644 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:00.330566 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173587 
> : cluster [WRN] Health check update: 594 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:09.768476 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173591 
> : cluster [WRN] Health check update: 505 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:14.768769 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173592 
> : cluster [WRN] Health check update: 497 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:20.610398 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173593 
> : cluster [WRN] Health check update: 509 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:28.721891 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173594 
> : cluster [WRN] Health check update: 501 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:34.909842 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173596 
> : cluster [WRN] Health check update: 494 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:44.770330 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173597 
> : cluster [WRN] Health check update: 500 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:49.770625 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173599 
> : cluster [WRN] Health check update: 608 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:01:55.073734 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173600 
> : cluster [WRN] Health check update: 593 slow requests are blocked > 32 sec. 
> Implicated osds 51 (REQUEST_SLOW)
> 2019-05-20 00:02:04.771432 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173607 
> : cluster [WRN] Health 

Re: [ceph-users] ceph nautilus namespaces for rbd and rbd image access problem

2019-05-20 Thread Jason Dillaman
On Mon, May 20, 2019 at 11:14 AM Rainer Krienke  wrote:
>
> Am 20.05.19 um 09:06 schrieb Jason Dillaman:
>
> >> $ rbd --namespace=testnamespace map rbd/rbdtestns --name client.rainer
> >> --keyring=/etc/ceph/ceph.keyring
> >> rbd: sysfs write failed
> >> rbd: error opening image rbdtestns: (1) Operation not permitted
> >> In some cases useful info is found in syslog - try "dmesg | tail".
> >> 2019-05-20 08:18:29.187 7f42ab7fe700 -1 librbd::image::RefreshRequest:
> >> failed to retrieve pool metadata: (1) Operation not permitted
> >> 2019-05-20 08:18:29.187 7f42aaffd700 -1 librbd::image::OpenRequest:
> >> failed to refresh image: (1) Operation not permitted
> >> 2019-05-20 08:18:29.187 7f42aaffd700 -1 librbd::ImageState:
> >> 0x561792408860 failed to open image: (1) Operation not permitted
> >> rbd: map failed: (22) Invalid argument
> >
> > Hmm, it looks like we overlooked updating the 'rbd' profile when PR
> > 27423 [1] was merged into v14.2.1. We'll get that fixed, but in the
> > meantime, you can add a "class rbd metadata_list" cap on the base pool
> > (w/o the namespace restriction) [2].
> >
>
> Thanks for your answer. Well I still have Kernel 4.15 so namespaces
> won't work for me at the moment.
>
> Could you please explain what the magic behind "class rbd metadata_list"
> is? Is it thought to "simply" allow access to the basepool (rbd in my
> case), so I authorize access to the pool instead of a namespaces? And if
> this is true then I do not understand the difference of your class cap
> compared to a cap like  osd 'allow rw pool=rbd'?

It allows access to invoke a single OSD object class method named
rbd.metadata_list, which is a read-only operation. Therefore, you are
giving access to read pool-level configuration overrides but not
access to read/write/execute any other things in the base pool. You
could further restrict it to the "rbd_info" object when combined w/
the "object_prefix rbd_info" matcher.

> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG stuck down after OSD failures and recovery

2019-05-20 Thread Krzysztof Klimonda
Hi,

We’ve observed the following chain of events on our luminous (12.2.8) cluster:

To summarize - we’re running this pool with min_size = 1, size = 2. On friday 
we’ve “temporarily" lost osd.52 (disk changed letter and the host had to be 
restarted, which we planned on doing after weekend), which was primary for PG 
`3.1795`. On sunday we’ve lost osd.501 that became primary for that PG. Today, 
after we’ve restored osd.51 it tried peering with osd.501 to get the log(?) and 
got stuck and marked down:

——8<——8<——
"recovery_state": [
{
"name": "Started/Primary/Peering/Down",
"enter_time": "2019-05-20 09:24:23.907107",
"comment": "not enough up instances of this PG to go active"
},
{
"name": "Started/Primary/Peering",
"enter_time": "2019-05-20 09:24:23.907055",
"past_intervals": [
{
"first": "196378",
"last": "196975",
"all_participants": [
{
"osd": 52
},
{
"osd": 448
},
{
"osd": 501
},
{
"osd": 635
}
],
"intervals": [
{
"first": "196833",
"last": "196834",
"acting": "501"
},
{
"first": "196868",
"last": "196870",
"acting": "448"
},
{
"first": "196871",
"last": "196975",
"acting": "448,635"
}
]
}
],
"probing_osds": [
"52",
"448",
"635"
],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
501
],
"peering_blocked_by": [
{
"osd": 501,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us 
proceed"
}
]
},
{
"name": "Started",
"enter_time": "2019-05-20 09:24:23.907017"
}
],
——8<——8<——

For the detailed description see below:

1. We start with an available PG `3.1795`:

ceph-osd.501.log.2.gz:2019-05-17 15:18:09.518358 7f9a09125700  1 osd.501 
pg_epoch: 196830 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] 
local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 
196379/196379/0 196830/196830/196830) [501] r=0 lpr=196830 pi=[196378,196830)/1 
luod=0'0 crt=196821'122308 lcod 196821'122307 mlcod 0'0 active] 
start_peering_interval up [52,501] -> [501], acting [52,501] -> [501], 
acting_primary 52 -> 501, up_primary 52 -> 501, role 1 -> 0, features acting 
4611087853745930235 upacting 4611087853745930235

2. One of SSDs serving OSD disappears from the system, bringing down the osd.51:

(log from its peer)
ceph-osd.501.log.2.gz:2019-05-17 15:18:09.518358 7f9a09125700  1 osd.501 
pg_epoch: 196830 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] 
local-lis/les=196378/196379 n=17 ec=100709/39 lis/c 196378/196378 les/c/f 
196379/196379/0 196830/196830/196830) [501] r=0 lpr=196830 pi=[196378,196830)/1 
luod=0'0 crt=196821'122308 lcod 196821'122307 mlcod 0'0 active] 
start_peering_interval up [52,501] -> [501], acting [52,501] -> [501], 
acting_primary 52 -> 501, up_primary 52 -> 501, role 1 -> 0, features acting 
4611087853745930235 upacting 4611087853745930235

3. 10 minutes later osd.501 has had enough, a new second replica for PG is 
chosen:

ceph-osd.501.log.2.gz:2019-05-17 15:28:10.308940 7f9a08924700  1 osd.501 
pg_epoch: 196832 pg[3.1795( v 196821'122308 (179253'120770,196821'122308] 
local-lis/les=196830/196831 n=17 ec=100709/39 lis/c 196830/196378 les/c/f 
196831/196379/0 196832/196832/196830) [501,448] r=0 lpr=196832 
pi=[196378,196832)/1 luod=0'0 crt=196821'122308 lcod 196821'122307 mlcod 0'0 
active] start_peering_interval up [501] -> [501,448], acting [501] -> 
[501,448], acting_primary 501 -> 501, up_primary 501 -> 501, role 0 -> 0, 
features acting 4611087853745930235 upacting 4611087853745930235

4. Two days later, we lose osd.501 and osd.448 becomes sole OSD for that PG:

ceph-osd.448.log.1.gz:2019-05-19 13:08:36.018264 7fae789ee700  1 osd.448 
pg_epoch: 196852 pg[3.1795( v 196821'122308 (179253'120808,196821'122308] 
local-lis/les=196835/196836 n=17 ec=100709/39 lis/c 196835/196835 

Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Frank Schilder
Dear Maged,

thanks for elaborating on this question. Is there already information in which 
release this patch will be deployed?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Frank Schilder
If min_size=1 and you loose the last disk, that's end of any data that was only 
on this disk.

Apart from this, using size=2 and min_size=1 is a really bad idea. This has 
nothing to do with data replication but rather with an inherent problem with 
high availability and the number 2. You need at least 3 members of an HA group 
to ensure stable operation with proper majorities. There are numerous stories 
about OSD flapping caused by size-2 min_size-1 pools, leading to situations 
that are extremely hard to recover from. My favourite is this one: 
https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
 . You will easily find more. The deeper problem here is called "split-brain" 
and there is no real solution to it except to avoid it at all cost.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Florent B 
Sent: 20 May 2019 21:33
To: Paul Emmerich; Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] Default min_size value for EC pools

I understand better thanks to Frank & Paul messages.

Paul, when min_size=k, is it the same problem with replicated pool size=2 & 
min_size=1 ?

On 20/05/2019 21:23, Paul Emmerich wrote:
Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting 
writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written your 
data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems 
during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean 
it's a good idea.


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Maged Mokhtar


On 20/05/2019 19:37, Frank Schilder wrote:

This is an issue that is coming up every now and then (for example: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg50415.html) and I would consider it a 
very serious one (I will give an example below). A statement like "min_size = k is unsafe and 
should never be set" deserves a bit more explanation, because ceph is the only storage system 
I know of, for which k+m redundancy does *not* mean "you can loose up to m disks and still 
have read-write access". If this is really true then, assuming the same redundancy level, 
loosing service (client access) is significantly more likely with ceph than with other storage 
systems. And this has impact on design and storage pricing.

However, some help seems on the way and an, in my opinion, utterly important 
feature update seems almost finished: https://github.com/ceph/ceph/pull/17619 . 
It will implement the following:

- recovery I/O happens as long as k shards are available (this is new)
- client I/O will happen as long as min_size shards are available
- recommended is min_size=k+1 (this might be wrong)

This is pretty good and much better than the current behaviour (see below). 
This pull request also offers useful further information.

Apparently, there is some kind of rare issue with erasure coding in ceph that makes it 
problematic to use min_size=k. I couldn't find anything better than vague explanations. 
Quote from the thread above: "Recovery on EC pools requires min_size rather than k 
shards at this time. There were reasons; they weren't great."

This is actually a situation I was in. I once lost 2 failure domains simultaneously on an 
8+2 EC pool and was really surprised that recovery stopped after some time with the worst 
degraded PGs remaining unfixed. I discovered the min_size=9 (instead of 8) and "ceph 
health detail" recommended to reduce min_size. Before doing so, I searched the web 
(I mean, why the default k+1? Come on, there must be a reason.) and found some vague 
hints about problems with min_size=k during rebuild. This is a really bad corner to be 
in. A lot of PGs are already critically degraded and the only way forward was to make a 
bad situation worse, because reducing min_size would immediately enable client I/O in 
addition to recovery I/O.

It looks like the default of min_size=k+1 will stay, because min_size=k does have some rare issues and 
these seem not to disappear. (I hope I'm wrong though.) Hence, if min_size=k will remain problematic, 
the recommendation should be "never to use m=1" instead of "never use min_size=k". 
In other words, instead of using a 2+1 EC profile, one should use a 4+2 EC profile. If one would like 
to have secure write access for n disk losses, then m>=n+1.

If this issue remains, in my opinion this should be taken up in the best 
practices section. In particular, the documentation should not use examples 
with m=1, this gives the wrong impression. Either min_size=k is safe or not. If 
it is not, it should never be used anywhere in the documentation.

I hope I marked my opinions and hypotheses clearly and that the links are 
helpful. If anyone could shed some light on as to why exactly min_size=k+1 is 
important, I would be grateful.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


I think we should separate the issue of recovery versus that of allowing 
client writes.


recovery: logically if you have k chunks, recovery should be active 
since you still have no data loss and can re-generate the remaining m 
chunks. For reasons not clear to many, current recovery activation 
requires a minimum of min_size chunks rather than k. The patch you 
mention fixes this issue and tie the recovery process to k rather than 
min_size.


client writes safety: this is controlled by the min_size value, the 
default min_size is k+1. The reason is that if you have m failures, you 
still have no data loss but you are in a critical window since you have 
no redundancy, any new writes will be stored in k chunks so if you have 
an additional failure your data written in this time window will be 
lost. You can well argue that if you are unfortunate enough to reach a 
m+1 failure situation then you are in deep-deep trouble anyway and 
should worry much more on the bulk of the data already stored not just 
the new data: the argument i see here (maybe there is something else) is 
that there is still significant probability that some of the m failures 
(hosts/disks) could be recovered so even if at some point you had m+1 
failures, not all hope is lost with regards to the existing data, but is 
guaranteed data loss for the new data.


Due to the first point on recovery being tied to min_size, currently if 
you are in a situation where you have m failures, the recommendation is 
to 

Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Frank Schilder
Dear Paul,

thank you very much for this clarification. I believe also ZFS erasure coded 
data has this property, which is probably the main cause for the expectation of 
min_size=k. So, basically, min_size=k means that we are on the security level 
of traditional redundant storage and this may or may not be good enough - there 
is no additional risk beyond that. Default in ceph is, it is not good enough. 
That's perfectly fine - assuming the rebuild gets fixed.

I have a follow-up: I thought that non-redundant writes would almost never 
occur, because PGs get remapped before accepting writes. To stay with my 
example of 2 (out of 16) failure domains failing simultaneously, I thought that 
all PGs will immediately be remapped to fully redundant sets, because there are 
still 14 failure domains up and only 10 are needed for the 8+2 EC profile. 
Furthermore, I assumed that writes would not be accepted before a PG is 
remapped, meaning that every new write will always be fully redundant while 
recovery I/O slowly recreates the missing objects in the background.

If this "remap first" strategy is not the current behaviour, would it make 
sense to consider this as an interesting feature? Is there any reason for not 
remapping all PGs (if possible) prior to starting recovery? It would eliminate 
the lack of redundancy for new writes (at least for new objects).

Thanks again and best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Paul Emmerich 
Sent: 20 May 2019 21:23
To: Frank Schilder
Cc: florent; ceph-users
Subject: Re: [ceph-users] Default min_size value for EC pools

Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting 
writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written your 
data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems 
during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean 
it's a good idea.


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, May 20, 2019 at 7:37 PM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
This is an issue that is coming up every now and then (for example: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg50415.html) and I 
would consider it a very serious one (I will give an example below). A 
statement like "min_size = k is unsafe and should never be set" deserves a bit 
more explanation, because ceph is the only storage system I know of, for which 
k+m redundancy does *not* mean "you can loose up to m disks and still have 
read-write access". If this is really true then, assuming the same redundancy 
level, loosing service (client access) is significantly more likely with ceph 
than with other storage systems. And this has impact on design and storage 
pricing.

However, some help seems on the way and an, in my opinion, utterly important 
feature update seems almost finished: https://github.com/ceph/ceph/pull/17619 . 
It will implement the following:

- recovery I/O happens as long as k shards are available (this is new)
- client I/O will happen as long as min_size shards are available
- recommended is min_size=k+1 (this might be wrong)

This is pretty good and much better than the current behaviour (see below). 
This pull request also offers useful further information.

Apparently, there is some kind of rare issue with erasure coding in ceph that 
makes it problematic to use min_size=k. I couldn't find anything better than 
vague explanations. Quote from the thread above: "Recovery on EC pools requires 
min_size rather than k shards at this time. There were reasons; they weren't 
great."

This is actually a situation I was in. I once lost 2 failure domains 
simultaneously on an 8+2 EC pool and was really surprised that recovery stopped 
after some time with the worst degraded PGs remaining unfixed. I discovered the 
min_size=9 (instead of 8) and "ceph health detail" recommended to reduce 
min_size. Before doing so, I searched the web (I mean, why the default k+1? 
Come on, there must be a reason.) and found some vague hints about problems 
with min_size=k during rebuild. This is a really bad corner to be in. A lot of 
PGs are already critically degraded and the only way forward was to make a bad 
situation worse, because reducing min_size would immediately enable client I/O 
in addition to recovery I/O.

It looks like the default of min_size=k+1 will stay, because min_size=k does 
have some rare issues and these seem not to disappear. (I hope I'm wrong 
though.) Hence, if min_size=k will remain problematic, the recommendation 
should be "never 

Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Paul Emmerich
Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting
writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written
your data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems
during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't
mean it's a good idea.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, May 20, 2019 at 7:37 PM Frank Schilder  wrote:

> This is an issue that is coming up every now and then (for example:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg50415.html) and
> I would consider it a very serious one (I will give an example below). A
> statement like "min_size = k is unsafe and should never be set" deserves a
> bit more explanation, because ceph is the only storage system I know of,
> for which k+m redundancy does *not* mean "you can loose up to m disks and
> still have read-write access". If this is really true then, assuming the
> same redundancy level, loosing service (client access) is significantly
> more likely with ceph than with other storage systems. And this has impact
> on design and storage pricing.
>
> However, some help seems on the way and an, in my opinion, utterly
> important feature update seems almost finished:
> https://github.com/ceph/ceph/pull/17619 . It will implement the following:
>
> - recovery I/O happens as long as k shards are available (this is new)
> - client I/O will happen as long as min_size shards are available
> - recommended is min_size=k+1 (this might be wrong)
>
> This is pretty good and much better than the current behaviour (see
> below). This pull request also offers useful further information.
>
> Apparently, there is some kind of rare issue with erasure coding in ceph
> that makes it problematic to use min_size=k. I couldn't find anything
> better than vague explanations. Quote from the thread above: "Recovery on
> EC pools requires min_size rather than k shards at this time. There were
> reasons; they weren't great."
>
> This is actually a situation I was in. I once lost 2 failure domains
> simultaneously on an 8+2 EC pool and was really surprised that recovery
> stopped after some time with the worst degraded PGs remaining unfixed. I
> discovered the min_size=9 (instead of 8) and "ceph health detail"
> recommended to reduce min_size. Before doing so, I searched the web (I
> mean, why the default k+1? Come on, there must be a reason.) and found some
> vague hints about problems with min_size=k during rebuild. This is a really
> bad corner to be in. A lot of PGs are already critically degraded and the
> only way forward was to make a bad situation worse, because reducing
> min_size would immediately enable client I/O in addition to recovery I/O.
>
> It looks like the default of min_size=k+1 will stay, because min_size=k
> does have some rare issues and these seem not to disappear. (I hope I'm
> wrong though.) Hence, if min_size=k will remain problematic, the
> recommendation should be "never to use m=1" instead of "never use
> min_size=k". In other words, instead of using a 2+1 EC profile, one should
> use a 4+2 EC profile. If one would like to have secure write access for n
> disk losses, then m>=n+1.
>
> If this issue remains, in my opinion this should be taken up in the best
> practices section. In particular, the documentation should not use examples
> with m=1, this gives the wrong impression. Either min_size=k is safe or
> not. If it is not, it should never be used anywhere in the documentation.
>
> I hope I marked my opinions and hypotheses clearly and that the links are
> helpful. If anyone could shed some light on as to why exactly min_size=k+1
> is important, I would be grateful.
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Default min_size value for EC pools

2019-05-20 Thread Frank Schilder
This is an issue that is coming up every now and then (for example: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg50415.html) and I 
would consider it a very serious one (I will give an example below). A 
statement like "min_size = k is unsafe and should never be set" deserves a bit 
more explanation, because ceph is the only storage system I know of, for which 
k+m redundancy does *not* mean "you can loose up to m disks and still have 
read-write access". If this is really true then, assuming the same redundancy 
level, loosing service (client access) is significantly more likely with ceph 
than with other storage systems. And this has impact on design and storage 
pricing.

However, some help seems on the way and an, in my opinion, utterly important 
feature update seems almost finished: https://github.com/ceph/ceph/pull/17619 . 
It will implement the following:

- recovery I/O happens as long as k shards are available (this is new)
- client I/O will happen as long as min_size shards are available
- recommended is min_size=k+1 (this might be wrong)

This is pretty good and much better than the current behaviour (see below). 
This pull request also offers useful further information.

Apparently, there is some kind of rare issue with erasure coding in ceph that 
makes it problematic to use min_size=k. I couldn't find anything better than 
vague explanations. Quote from the thread above: "Recovery on EC pools requires 
min_size rather than k shards at this time. There were reasons; they weren't 
great."

This is actually a situation I was in. I once lost 2 failure domains 
simultaneously on an 8+2 EC pool and was really surprised that recovery stopped 
after some time with the worst degraded PGs remaining unfixed. I discovered the 
min_size=9 (instead of 8) and "ceph health detail" recommended to reduce 
min_size. Before doing so, I searched the web (I mean, why the default k+1? 
Come on, there must be a reason.) and found some vague hints about problems 
with min_size=k during rebuild. This is a really bad corner to be in. A lot of 
PGs are already critically degraded and the only way forward was to make a bad 
situation worse, because reducing min_size would immediately enable client I/O 
in addition to recovery I/O.

It looks like the default of min_size=k+1 will stay, because min_size=k does 
have some rare issues and these seem not to disappear. (I hope I'm wrong 
though.) Hence, if min_size=k will remain problematic, the recommendation 
should be "never to use m=1" instead of "never use min_size=k". In other words, 
instead of using a 2+1 EC profile, one should use a 4+2 EC profile. If one 
would like to have secure write access for n disk losses, then m>=n+1.

If this issue remains, in my opinion this should be taken up in the best 
practices section. In particular, the documentation should not use examples 
with m=1, this gives the wrong impression. Either min_size=k is safe or not. If 
it is not, it should never be used anywhere in the documentation.

I hope I marked my opinions and hypotheses clearly and that the links are 
helpful. If anyone could shed some light on as to why exactly min_size=k+1 is 
important, I would be grateful.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Noob question - ceph-mgr crash on arm

2019-05-20 Thread Torben Hørup
Hi

Tcmalloc on arm7 is problematic. You need to compile your own with either 
jemalloc or just libc malloc

/Torben

Den 20. maj 2019 17.48.40 CEST, "Jesper Taxbøl"  skrev:
>I am trying to setup a Ceph cluster on 4 odroid-hc2 instances on top of
>Ubuntu 18.04.
>
>My ceph-mgr deamon keeps crashing on me.
>
>Any advise on how to proceed?
>
>Log on mgr node says something about ms_dispatch:
>
>2019-05-20 15:34:43.070424 b6714230  0 set uid:gid to 64045:64045
>(ceph:ceph)
>2019-05-20 15:34:43.070455 b6714230  0 ceph version 12.2.11
>(26dc3775efc7bb286a1d6d66faee0b
>a30ea23eee) luminous (stable), process ceph-mgr, pid 1169
>2019-05-20 15:34:43.070799 b6714230  0 pidfile_write: ignore empty
>--pid-file
>2019-05-20 15:34:43.101162 b6714230  1 mgr send_beacon standby
>2019-05-20 15:34:43.124462 b06f8c30 -1 *** Caught signal (Segmentation
>fault) **
>in thread b06f8c30 thread_name:ms_dispatch
>
>ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
>luminous
>(stable)
>1: (()+0x30133c) [0x77033c]
>2: (()+0x25750) [0xb688a750]
>3: (_ULarm_step()+0x55) [0xb6816ce6]
>4: (()+0x255e8) [0xb6cd85e8]
>5: (GetStackTrace(void**, int, int)+0x25) [0xb6cd8a3e]
>6: (tcmalloc::PageHeap::GrowHeap(unsigned int)+0xb9) [0xb6ccd36a]
>7: (tcmalloc::PageHeap::New(unsigned int)+0x79) [0xb6ccd5e6]
>8: (tcmalloc::CentralFreeList::Populate()+0x71) [0xb6ccc5ce]
>9: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**,
>void**)+0x1b) [0xb6ccc76
>0]
>10: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x6d)
>[0xb6ccc7de]
>11: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int,
>unsigned
>int)+0x51) [0xb6c
>cea56]
>12: (malloc()+0x22d) [0xb6cd9a8e]
>NOTE: a copy of the executable, or `objdump -rdS ` is
>needed to
>interpret this
>.
>
>--- begin dump of recent events ---
>  -90> 2019-05-20 15:34:43.053293 b6714230  5 asok(0x55b5320)
>register_command perfcounter
>s_dump hook 0x554c088
>  -89> 2019-05-20 15:34:43.053322 b6714230  5 asok(0x55b5320)
>register_command 1 hook 0x55
>4c088
>  -88> 2019-05-20 15:34:43.053330 b6714230  5 asok(0x55b5320)
>register_command perf dump h
>ook 0x554c088
>  -87> 2019-05-20 15:34:43.053341 b6714230  5 asok(0x55b5320)
>register_command perfcounter
>s_schema hook 0x554c088
>  -86> 2019-05-20 15:34:43.053360 b6714230  5 asok(0x55b5320)
>register_command perf histog
>ram dump hook 0x554c088
>  -85> 2019-05-20 15:34:43.053374 b6714230  5 asok(0x55b5320)
>register_command 2 hook 0x55
>4c088
>  -84> 2019-05-20 15:34:43.053381 b6714230  5 asok(0x55b5320)
>register_command perf schema
>hook 0x554c088
>  -83> 2019-05-20 15:34:43.053389 b6714230  5 asok(0x55b5320)
>register_command perf histog
>ram schema hook 0x554c088
>  -82> 2019-05-20 15:34:43.053410 b6714230  5 asok(0x55b5320)
>register_command perf reset
>hook 0x554c088
>  -81> 2019-05-20 15:34:43.053418 b6714230  5 asok(0x55b5320)
>register_command config show
>hook 0x554c088
>  -80> 2019-05-20 15:34:43.053425 b6714230  5 asok(0x55b5320)
>register_command config help
>hook 0x554c088
>  -79> 2019-05-20 15:34:43.053436 b6714230  5 asok(0x55b5320)
>register_command config set
>hook 0x554c088
>  -78> 2019-05-20 15:34:43.053444 b6714230  5 asok(0x55b5320)
>register_command config get
>hook 0x554c088
>  -77> 2019-05-20 15:34:43.053459 b6714230  5 asok(0x55b5320)
>register_command config diff
>hook 0x554c088
>  -76> 2019-05-20 15:34:43.053467 b6714230  5 asok(0x55b5320)
>register_command config diff
>get hook 0x554c088
>  -75> 2019-05-20 15:34:43.053475 b6714230  5 asok(0x55b5320)
>register_command log flush h
>ook 0x554c088
>  -74> 2019-05-20 15:34:43.053482 b6714230  5 asok(0x55b5320)
>register_command log dump ho
>ok 0x554c088
>  -73> 2019-05-20 15:34:43.053490 b6714230  5 asok(0x55b5320)
>register_command log reopen
>hook 0x554c088
>  -72> 2019-05-20 15:34:43.053513 b6714230  5 asok(0x55b5320)
>register_command dump_mempoo
>ls hook 0x56e3504
> -71> 2019-05-20 15:34:43.070424 b6714230  0 set uid:gid to 64045:64045
>(ceph:ceph)
>  -70> 2019-05-20 15:34:43.070455 b6714230  0 ceph version 12.2.11
>(26dc3775efc7bb286a1d6d
>66faee0ba30ea23eee) luminous (stable), process ceph-mgr, pid 1169
>-69> 2019-05-20 15:34:43.070799 b6714230  0 pidfile_write: ignore empty
>--pid-file
>  -68> 2019-05-20 15:34:43.074441 b6714230  5 asok(0x55b5320) init
>/var/run/ceph/ceph-mgr.
>odroid-c.asok
>  -67> 2019-05-20 15:34:43.074473 b6714230  5 asok(0x55b5320)
>bind_and_listen /var/run/cep
>h/ceph-mgr.odroid-c.asok
>  -66> 2019-05-20 15:34:43.074615 b6714230  5 asok(0x55b5320)
>register_command 0 hook 0x55
>4c1d0
>  -65> 2019-05-20 15:34:43.074633 b6714230  5 asok(0x55b5320)
>register_command version hoo
>k 0x554c1d0
>  -64> 2019-05-20 15:34:43.074654 b6714230  5 asok(0x55b5320)
>register_command git_version
>hook 0x554c1d0
>  -63> 2019-05-20 15:34:43.074674 b6714230  5 asok(0x55b5320)
>register_command help hook 0
>x554c1d8
>  -62> 2019-05-20 15:34:43.074694 b6714230  5 asok(0x55b5320)
>register_command get_command
>_descriptions hook 0x554c1e0
>-61> 

[ceph-users] Noob question - ceph-mgr crash on arm

2019-05-20 Thread Jesper Taxbøl
I am trying to setup a Ceph cluster on 4 odroid-hc2 instances on top of
Ubuntu 18.04.

My ceph-mgr deamon keeps crashing on me.

Any advise on how to proceed?

Log on mgr node says something about ms_dispatch:

2019-05-20 15:34:43.070424 b6714230  0 set uid:gid to 64045:64045
(ceph:ceph)
2019-05-20 15:34:43.070455 b6714230  0 ceph version 12.2.11
(26dc3775efc7bb286a1d6d66faee0b
a30ea23eee) luminous (stable), process ceph-mgr, pid 1169
2019-05-20 15:34:43.070799 b6714230  0 pidfile_write: ignore empty
--pid-file
2019-05-20 15:34:43.101162 b6714230  1 mgr send_beacon standby
2019-05-20 15:34:43.124462 b06f8c30 -1 *** Caught signal (Segmentation
fault) **
in thread b06f8c30 thread_name:ms_dispatch

ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous
(stable)
1: (()+0x30133c) [0x77033c]
2: (()+0x25750) [0xb688a750]
3: (_ULarm_step()+0x55) [0xb6816ce6]
4: (()+0x255e8) [0xb6cd85e8]
5: (GetStackTrace(void**, int, int)+0x25) [0xb6cd8a3e]
6: (tcmalloc::PageHeap::GrowHeap(unsigned int)+0xb9) [0xb6ccd36a]
7: (tcmalloc::PageHeap::New(unsigned int)+0x79) [0xb6ccd5e6]
8: (tcmalloc::CentralFreeList::Populate()+0x71) [0xb6ccc5ce]
9: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**,
void**)+0x1b) [0xb6ccc76
0]
10: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x6d)
[0xb6ccc7de]
11: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, unsigned
int)+0x51) [0xb6c
cea56]
12: (malloc()+0x22d) [0xb6cd9a8e]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to
interpret this
.

--- begin dump of recent events ---
  -90> 2019-05-20 15:34:43.053293 b6714230  5 asok(0x55b5320)
register_command perfcounter
s_dump hook 0x554c088
  -89> 2019-05-20 15:34:43.053322 b6714230  5 asok(0x55b5320)
register_command 1 hook 0x55
4c088
  -88> 2019-05-20 15:34:43.053330 b6714230  5 asok(0x55b5320)
register_command perf dump h
ook 0x554c088
  -87> 2019-05-20 15:34:43.053341 b6714230  5 asok(0x55b5320)
register_command perfcounter
s_schema hook 0x554c088
  -86> 2019-05-20 15:34:43.053360 b6714230  5 asok(0x55b5320)
register_command perf histog
ram dump hook 0x554c088
  -85> 2019-05-20 15:34:43.053374 b6714230  5 asok(0x55b5320)
register_command 2 hook 0x55
4c088
  -84> 2019-05-20 15:34:43.053381 b6714230  5 asok(0x55b5320)
register_command perf schema
hook 0x554c088
  -83> 2019-05-20 15:34:43.053389 b6714230  5 asok(0x55b5320)
register_command perf histog
ram schema hook 0x554c088
  -82> 2019-05-20 15:34:43.053410 b6714230  5 asok(0x55b5320)
register_command perf reset
hook 0x554c088
  -81> 2019-05-20 15:34:43.053418 b6714230  5 asok(0x55b5320)
register_command config show
hook 0x554c088
  -80> 2019-05-20 15:34:43.053425 b6714230  5 asok(0x55b5320)
register_command config help
hook 0x554c088
  -79> 2019-05-20 15:34:43.053436 b6714230  5 asok(0x55b5320)
register_command config set
hook 0x554c088
  -78> 2019-05-20 15:34:43.053444 b6714230  5 asok(0x55b5320)
register_command config get
hook 0x554c088
  -77> 2019-05-20 15:34:43.053459 b6714230  5 asok(0x55b5320)
register_command config diff
hook 0x554c088
  -76> 2019-05-20 15:34:43.053467 b6714230  5 asok(0x55b5320)
register_command config diff
get hook 0x554c088
  -75> 2019-05-20 15:34:43.053475 b6714230  5 asok(0x55b5320)
register_command log flush h
ook 0x554c088
  -74> 2019-05-20 15:34:43.053482 b6714230  5 asok(0x55b5320)
register_command log dump ho
ok 0x554c088
  -73> 2019-05-20 15:34:43.053490 b6714230  5 asok(0x55b5320)
register_command log reopen
hook 0x554c088
  -72> 2019-05-20 15:34:43.053513 b6714230  5 asok(0x55b5320)
register_command dump_mempoo
ls hook 0x56e3504
  -71> 2019-05-20 15:34:43.070424 b6714230  0 set uid:gid to 64045:64045
(ceph:ceph)
  -70> 2019-05-20 15:34:43.070455 b6714230  0 ceph version 12.2.11
(26dc3775efc7bb286a1d6d
66faee0ba30ea23eee) luminous (stable), process ceph-mgr, pid 1169
  -69> 2019-05-20 15:34:43.070799 b6714230  0 pidfile_write: ignore empty
--pid-file
  -68> 2019-05-20 15:34:43.074441 b6714230  5 asok(0x55b5320) init
/var/run/ceph/ceph-mgr.
odroid-c.asok
  -67> 2019-05-20 15:34:43.074473 b6714230  5 asok(0x55b5320)
bind_and_listen /var/run/cep
h/ceph-mgr.odroid-c.asok
  -66> 2019-05-20 15:34:43.074615 b6714230  5 asok(0x55b5320)
register_command 0 hook 0x55
4c1d0
  -65> 2019-05-20 15:34:43.074633 b6714230  5 asok(0x55b5320)
register_command version hoo
k 0x554c1d0
  -64> 2019-05-20 15:34:43.074654 b6714230  5 asok(0x55b5320)
register_command git_version
hook 0x554c1d0
  -63> 2019-05-20 15:34:43.074674 b6714230  5 asok(0x55b5320)
register_command help hook 0
x554c1d8
  -62> 2019-05-20 15:34:43.074694 b6714230  5 asok(0x55b5320)
register_command get_command
_descriptions hook 0x554c1e0
  -61> 2019-05-20 15:34:43.074785 b3effc30  5 asok(0x55b5320) entry start
  -60> 2019-05-20 15:34:43.076464 b36fec30  2 Event(0x554e068 nevent=5000
time_id=1).set_o
wner idx=0 owner=3010456624
  -59> 2019-05-20 15:34:43.076559 b2efdc30  2 Event(0x554e488 nevent=5000
time_id=1).set_o
wner idx=1 owner=3002063920
 

Re: [ceph-users] Could someone can help me to solve this problem about ceph-STS(secure token session)

2019-05-20 Thread Pritha Srivastava
Hello Yuan,

While creating the role, can you try setting the Principal to the user you
want the role to be assumed by, and the Action to - sts:AssumeRole, like
below:

policy_document =
"{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"AWS\":[\"arn:aws:iam:::user/TESTER1\"]},\"Action\":[\"sts:AssumeRole\"]}]}"

Also, can you search for 'AssumeRole' in radosgw logs, and attach the
snippet here.

Thanks,
Pritha

On Mon, May 20, 2019 at 2:36 PM Yuan Minghui  wrote:

>
>
> Hello everyone:
>
>When I use the method :” assume_role”, like this:
>
> sts_client = boto3.client('sts',
> aws_access_key_id=access_key,
> aws_secret_access_key=secret_key,
> endpoint_url=host,
> )
> response = sts_client.assume_role(RoleArn='arn:aws:iam:::role/AccessRole1', 
> RoleSessionName="ymh_bucketAccess")
>
>
>
> I create a role in terminal:
>
>
>
> [image: cid:image001.png@01D50F28.B58728A0]
>
> I return that :
>
>
>
> Traceback (most recent call last):
>
>   File "/Users/yuanminghui/PycharmProjects/myproject1/10-sts-demo.py",
> line 64, in test1
>
> response =
> sts_client.assume_role(RoleArn='arn:aws:iam:::role/AccessRole1',
> RoleSessionName="ymh_bucketAccess")
>
>   File
> "/Users/yuanminghui/PycharmProjects/myproject1/venv/lib/python3.7/site-packages/botocore/client.py",
> line 357, in _api_call
>
> return self._make_api_call(operation_name, kwargs)
>
>   File
> "/Users/yuanminghui/PycharmProjects/myproject1/venv/lib/python3.7/site-packages/botocore/client.py",
> line 661, in _make_api_call
>
> raise error_class(parsed_response, operation_name)
>
> botocore.exceptions.ClientError: An error occurred (Unknown) when calling
> the AssumeRole operation: Unknown
>
>
>
>
>
> I really do not know whats wrong with this? Is there someone can help?
> Thanks a lot.
>
> best wishes!
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-20 Thread Frank Schilder
Dear Yan,

thank you for taking care of this. I removed all snapshots and stopped snapshot 
creation.

Please keep me posted.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 20 May 2019 13:34:07
To: Frank Schilder
Cc: Stefan Kooman; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

On Sat, May 18, 2019 at 5:47 PM Frank Schilder  wrote:
>
> Dear Yan and Stefan,
>
> it happened again and there were only very few ops in the queue. I pulled the 
> ops list and the cache. Please find a zip file here: 
> "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l; . Its a bit 
> more than 100MB.
>

MSD cache dump shows there is a snapshot related. Please avoid using
snapshot until we fix the bug.

Regards
Yan, Zheng

> The active MDS failed over to the standby after or during the dump cache 
> operation. Is this expected? As a result, the cluster is healthy and I can't 
> do further diagnostics. In case you need more information, we have to wait 
> until next time.
>
> Some further observations:
>
> There was no load on the system. I start suspecting that this is not a 
> load-induced event. It is also not cause by excessive atime updates, the FS 
> is mounted with relatime. Could it have to do with the large level-2 network 
> (ca. 550 client servers in the same broadcast domain)? I include our kernel 
> tuning profile below, just in case. The cluster networks (back and front) are 
> isolated VLANs, no gateways, no routing.
>
> We run rolling snapshots on the file system. I didn't observe any problems 
> with this, but am wondering if this might be related. We have currently 30 
> snapshots in total. Here is the output of status and pool ls:
>
> [root@ceph-01 ~]# ceph status # before the MDS failed over
>   cluster:
> id: ###
> health: HEALTH_WARN
> 1 MDSs report slow requests
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby
> osd: 192 osds: 192 up, 192 in
>
>   data:
> pools:   5 pools, 750 pgs
> objects: 6.35 M objects, 5.2 TiB
> usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
> pgs: 750 active+clean
>
> [root@ceph-01 ~]# ceph status # after cache dump and the MDS failed over
>   cluster:
> id: ###
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby
> osd: 192 osds: 192 up, 192 in
>
>   data:
> pools:   5 pools, 750 pgs
> objects: 6.33 M objects, 5.2 TiB
> usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
> pgs: 749 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   6.3 KiB/s wr, 0 op/s rd, 0 op/s wr
>
> [root@ceph-01 ~]# ceph osd pool ls detail # after the MDS failed over
> pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 1 
> object_hash rjenkins pg_num 80 pgp_num 80 last_change 486 flags 
> hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 
> application rbd
> removed_snaps [1~5]
> pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash 
> rjenkins pg_num 300 pgp_num 300 last_change 1759 flags 
> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 274877906944000 
> stripe_width 24576 compression_mode aggressive application rbd
> removed_snaps [1~3]
> pool 3 'sr-rbd-one-stretch' replicated size 4 min_size 2 crush_rule 2 
> object_hash rjenkins pg_num 20 pgp_num 20 last_change 500 flags 
> hashpspool,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 0 
> compression_mode aggressive application rbd
> removed_snaps [1~7]
> pool 4 'con-fs-meta' replicated size 3 min_size 2 crush_rule 3 object_hash 
> rjenkins pg_num 50 pgp_num 50 last_change 428 flags hashpspool,nodelete 
> max_bytes 1099511627776 stripe_width 0 application cephfs
> pool 5 'con-fs-data' erasure size 10 min_size 8 crush_rule 6 object_hash 
> rjenkins pg_num 300 pgp_num 300 last_change 2561 flags 
> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 21990232200 
> stripe_width 32768 compression_mode aggressive application cephfs
> removed_snaps 
> [2~3d,41~2a,6d~2a,99~c,a6~1e,c6~18,df~3,e3~1,e5~3,e9~1,eb~3,ef~1,f1~1,f3~1,f5~3,f9~1,fb~3,ff~1,101~1,103~1,105~1,107~1,109~1,10b~1,10d~1,10f~1,111~1]
>
> The relevant pools are con-fs-meta and con-fs-data.
>
> Best regards,
> Frank
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
> [root@ceph-08 ~]# cat /etc/tuned/ceph/tuned.conf
> [main]
> summary=Settings for ceph cluster. Derived from throughput-performance.
> include=throughput-performance
>
> [vm]
> 

Re: [ceph-users] Large OMAP Objects in default.rgw.log pool

2019-05-20 Thread mr. non non
Hi Manuel,

I use version 12.2.8 with bluestore and also use manually index sharding 
(configured to 100).  As I checked, no buckets reach 100k of objects_per_shard.
here are health status and cluster log

# ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool 'default.rgw.log'
Search the cluster log for 'Large omap object found' for more details.

# cat ceph.log | tail -2
2019-05-19 17:49:36.306481 mon.MONNODE1 mon.0 10.118.191.231:6789/0 528758 : 
cluster [WRN] Health check failed: 1 large omap objects (LARGE_OMAP_OBJECTS)
2019-05-19 17:49:34.535543 osd.38 osd.38 MONNODE1_IP:6808/3514427 12 : cluster 
[WRN] Large omap object found. Object: 4:b172cd59:usage::usage.26:head Key 
count: 8720830 Size (bytes): 1647024346

All objects size are 0.
$  for i in `rados ls -p default.rgw.log`; do rados stat -p default.rgw.log 
${i};done  | more
default.rgw.log/obj_delete_at_hint.78 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/meta.history mtime 2019-05-20 19:19:40.00, size 50
default.rgw.log/obj_delete_at_hint.70 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.000104 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.26 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.28 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.40 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.15 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.69 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.95 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.03 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.47 mtime 2019-05-20 19:31:45.00, 
size 0
default.rgw.log/obj_delete_at_hint.35 mtime 2019-05-20 19:31:45.00, 
size 0


Please kindly advise how to remove health_warn message.

Many thanks.
Arnondh


From: EDH - Manuel Rios Fernandez 
Sent: Monday, May 20, 2019 5:41 PM
To: 'mr. non non'; ceph-users@lists.ceph.com
Subject: RE: [ceph-users] Large OMAP Objects in default.rgw.log pool


Hi Arnondh,



Whats your ceph version?



Regards





De: ceph-users  En nombre de mr. non non
Enviado el: lunes, 20 de mayo de 2019 12:39
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Large OMAP Objects in default.rgw.log pool



Hi,



I found the same issue like above.

Does anyone know how to fix it?



Thanks.

Arnondh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests from bluestore osds / crashing rbd-nbd

2019-05-20 Thread Marc Schöchlin
Hello cephers,

we have a few systems which utilize a rbd-bd map/mount to get access to a rbd 
volume.
(This problem seems to be related to "[ceph-users] Slow requests from bluestore 
osds" (the original thread))

Unfortunately the rbd-nbd device of a system crashes three mondays in series at 
~00:00 when the systemd fstrim timer executes "fstrim -av".
(which runs in parallel to deep scrub operations)

After that the device constantly reports io errors every time a access to the 
filesystem happens.
Unmounting, remapping and mounting helped to get the filesystem/device back 
into business :-)

Manual 30 minute stresstests using the following fio command, did not produce 
any problems on client side
(Ceph storage reported some slow requests while testing).

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test 
--filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw 
--rwmixread=50 --numjobs=50 --loops=10

It seems that others also experienced this problem: 
https://ceph-users.ceph.narkive.com/2FIfyx1U/rbd-nbd-timeout-and-crash
The change for setting device timeouts by not seems to be merged to luminous.
Experiments setting the timeout manually after mapping using 
https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c haven't 
change the situation.

Do you have suggestions how to analyze/solve the situation?

Regards
Marc
--



The client kernel throws messages like this:

May 19 23:59:01 int-nfs-001 CRON[836295]: (root) CMD (command -v debian-sa1 > 
/dev/null && debian-sa1 60 2)
May 20 00:00:30 int-nfs-001 systemd[1]: Starting Discard unused blocks...
May 20 00:01:02 int-nfs-001 kernel: [1077851.623582] block nbd0: Connection 
timed out
May 20 00:01:02 int-nfs-001 kernel: [1077851.623613] block nbd0: shutting down 
sockets
May 20 00:01:02 int-nfs-001 kernel: [1077851.623617] print_req_error: I/O 
error, dev nbd0, sector 84082280
May 20 00:01:02 int-nfs-001 kernel: [1077851.623632] block nbd0: Connection 
timed out
May 20 00:01:02 int-nfs-001 kernel: [1077851.623636] print_req_error: I/O 
error, dev nbd0, sector 92470887
May 20 00:01:02 int-nfs-001 kernel: [1077851.623642] block nbd0: Connection 
timed out

Ceph throws messages like this:

2019-05-20 00:00:00.000124 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173572 : 
cluster [INF] overall HEALTH_OK
2019-05-20 00:00:54.249998 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173586 : 
cluster [WRN] Health check failed: 644 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:00.330566 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173587 : 
cluster [WRN] Health check update: 594 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:09.768476 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173591 : 
cluster [WRN] Health check update: 505 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:14.768769 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173592 : 
cluster [WRN] Health check update: 497 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:20.610398 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173593 : 
cluster [WRN] Health check update: 509 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:28.721891 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173594 : 
cluster [WRN] Health check update: 501 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:34.909842 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173596 : 
cluster [WRN] Health check update: 494 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:44.770330 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173597 : 
cluster [WRN] Health check update: 500 slow requests are blocked > 32 sec. 
Implicated osds 51 (REQUEST_SLOW)
2019-05-20 00:01:49.770625 mon.ceph-mon-s43 mon.0 10.23.27.153:6789/0 173599 : 
cluster [WRN] Health check update: 608 slow requests are blocked > 32 

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-20 Thread Yan, Zheng
On Sat, May 18, 2019 at 5:47 PM Frank Schilder  wrote:
>
> Dear Yan and Stefan,
>
> it happened again and there were only very few ops in the queue. I pulled the 
> ops list and the cache. Please find a zip file here: 
> "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l; . Its a bit 
> more than 100MB.
>

MSD cache dump shows there is a snapshot related. Please avoid using
snapshot until we fix the bug.

Regards
Yan, Zheng

> The active MDS failed over to the standby after or during the dump cache 
> operation. Is this expected? As a result, the cluster is healthy and I can't 
> do further diagnostics. In case you need more information, we have to wait 
> until next time.
>
> Some further observations:
>
> There was no load on the system. I start suspecting that this is not a 
> load-induced event. It is also not cause by excessive atime updates, the FS 
> is mounted with relatime. Could it have to do with the large level-2 network 
> (ca. 550 client servers in the same broadcast domain)? I include our kernel 
> tuning profile below, just in case. The cluster networks (back and front) are 
> isolated VLANs, no gateways, no routing.
>
> We run rolling snapshots on the file system. I didn't observe any problems 
> with this, but am wondering if this might be related. We have currently 30 
> snapshots in total. Here is the output of status and pool ls:
>
> [root@ceph-01 ~]# ceph status # before the MDS failed over
>   cluster:
> id: ###
> health: HEALTH_WARN
> 1 MDSs report slow requests
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby
> osd: 192 osds: 192 up, 192 in
>
>   data:
> pools:   5 pools, 750 pgs
> objects: 6.35 M objects, 5.2 TiB
> usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
> pgs: 750 active+clean
>
> [root@ceph-01 ~]# ceph status # after cache dump and the MDS failed over
>   cluster:
> id: ###
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs-1/1/1 up  {0=ceph-12=up:active}, 1 up:standby
> osd: 192 osds: 192 up, 192 in
>
>   data:
> pools:   5 pools, 750 pgs
> objects: 6.33 M objects, 5.2 TiB
> usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
> pgs: 749 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   6.3 KiB/s wr, 0 op/s rd, 0 op/s wr
>
> [root@ceph-01 ~]# ceph osd pool ls detail # after the MDS failed over
> pool 1 'sr-rbd-meta-one' replicated size 3 min_size 2 crush_rule 1 
> object_hash rjenkins pg_num 80 pgp_num 80 last_change 486 flags 
> hashpspool,nodelete,selfmanaged_snaps max_bytes 536870912000 stripe_width 0 
> application rbd
> removed_snaps [1~5]
> pool 2 'sr-rbd-data-one' erasure size 8 min_size 6 crush_rule 5 object_hash 
> rjenkins pg_num 300 pgp_num 300 last_change 1759 flags 
> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 274877906944000 
> stripe_width 24576 compression_mode aggressive application rbd
> removed_snaps [1~3]
> pool 3 'sr-rbd-one-stretch' replicated size 4 min_size 2 crush_rule 2 
> object_hash rjenkins pg_num 20 pgp_num 20 last_change 500 flags 
> hashpspool,nodelete,selfmanaged_snaps max_bytes 5497558138880 stripe_width 0 
> compression_mode aggressive application rbd
> removed_snaps [1~7]
> pool 4 'con-fs-meta' replicated size 3 min_size 2 crush_rule 3 object_hash 
> rjenkins pg_num 50 pgp_num 50 last_change 428 flags hashpspool,nodelete 
> max_bytes 1099511627776 stripe_width 0 application cephfs
> pool 5 'con-fs-data' erasure size 10 min_size 8 crush_rule 6 object_hash 
> rjenkins pg_num 300 pgp_num 300 last_change 2561 flags 
> hashpspool,ec_overwrites,nodelete,selfmanaged_snaps max_bytes 21990232200 
> stripe_width 32768 compression_mode aggressive application cephfs
> removed_snaps 
> [2~3d,41~2a,6d~2a,99~c,a6~1e,c6~18,df~3,e3~1,e5~3,e9~1,eb~3,ef~1,f1~1,f3~1,f5~3,f9~1,fb~3,ff~1,101~1,103~1,105~1,107~1,109~1,10b~1,10d~1,10f~1,111~1]
>
> The relevant pools are con-fs-meta and con-fs-data.
>
> Best regards,
> Frank
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
> [root@ceph-08 ~]# cat /etc/tuned/ceph/tuned.conf
> [main]
> summary=Settings for ceph cluster. Derived from throughput-performance.
> include=throughput-performance
>
> [vm]
> transparent_hugepages=never
>
> [sysctl]
> # See also:
> # - https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
> # - https://www.kernel.org/doc/Documentation/sysctl/net.txt
> # - https://cromwell-intl.com/open-source/performance-tuning/tcp.html
> # - https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/
> # - https://www.spinics.net/lists/ceph-devel/msg21721.html
>
> # Set available PIDs and open files to maximum possible.
> 

Re: [ceph-users] inconsistent number of pools

2019-05-20 Thread Lars Täuber
Mon, 20 May 2019 10:52:14 +
Eugen Block  ==> ceph-users@lists.ceph.com :
> Hi, have you tried 'ceph health detail'?
> 

No I hadn't. Thanks for the hint.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent number of pools

2019-05-20 Thread Eugen Block

Hi, have you tried 'ceph health detail'?


Zitat von Lars Täuber :


Hi everybody,

with the status report I get a HEALTH_WARN I don't know how to get rid of.
It my be connected to recently removed pools.

# ceph -s
  cluster:
id: 6cba13d1-b814-489c-9aac-9c04aaf78720
health: HEALTH_WARN
1 pools have many more objects per pg than average

  services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 4h)
mgr: mon1(active, since 4h), standbys: cephsible, mon2, mon3
mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby
osd: 30 osds: 30 up (since 2h), 30 in (since 7w)

  data:
pools:   5 pools, 1029 pgs
objects: 315.51k objects, 728 GiB
usage:   4.6 TiB used, 163 TiB / 167 TiB avail
pgs: 1029 active+clean


!!! but:
# ceph osd lspools | wc -l
3

The status says there are 5 pools but the listing says there are only 3.
How to I get to know which pool is the reason for the health warning?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] inconsistent number of pools

2019-05-20 Thread Lars Täuber
Hi everybody,

with the status report I get a HEALTH_WARN I don't know how to get rid of.
It my be connected to recently removed pools.

# ceph -s
  cluster:
id: 6cba13d1-b814-489c-9aac-9c04aaf78720
health: HEALTH_WARN
1 pools have many more objects per pg than average
 
  services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 4h)
mgr: mon1(active, since 4h), standbys: cephsible, mon2, mon3
mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby
osd: 30 osds: 30 up (since 2h), 30 in (since 7w)
 
  data:
pools:   5 pools, 1029 pgs
objects: 315.51k objects, 728 GiB
usage:   4.6 TiB used, 163 TiB / 167 TiB avail
pgs: 1029 active+clean
 

!!! but:
# ceph osd lspools | wc -l
3

The status says there are 5 pools but the listing says there are only 3.
How to I get to know which pool is the reason for the health warning?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs causing high load on vm, taking down 15 min later another cephfs vm

2019-05-20 Thread Marc Roos



I got my first problem with cephfs in a production environment. Is it 
possible from these logfiles to deduct what happened?

svr1 is connected to ceph client network via switch
svr2 vm is collocated on c01 node.
c01 has osd's and the mon.a colocated. 

svr1 was the first to report errors at 03:38:44. I have no error 
messages reported of a network connection problem by any of the ceph 
nodes. I have nothing in dmesg on c01.

[@c01 ~]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
[@c01 ~]# uname -a
Linux c01 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux
[@c01 ~]# ceph versions
{
"mon": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
luminous (stable)": 32
},
"mds": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
luminous (stable)": 2
},
"rgw": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
luminous (stable)": 2
},
"overall": {
"ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) 
luminous (stable)": 42
}
}




[0] svr1 messages 
May 20 03:36:01 svr1 systemd: Started Session 308978 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308979 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308979 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308980 of user root.
May 20 03:36:01 svr1 systemd: Started Session 308980 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308981 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308981 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308982 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308982 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308983 of user root.
May 20 03:38:01 svr1 systemd: Started Session 308983 of user root.
May 20 03:38:44 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:44 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: last message repeated 5 times
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: last message repeated 5 times
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 session 
established
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 session 
established
May 20 03:38:45 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon1 192.168.x.112:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
established
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
established
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon2 192.168.x.113:6789 session 
established
May 20 03:38:45 svr1 kernel: libceph: mon2 192.168.x.113:6789 session 
established
May 20 03:38:45 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: libceph: osd0 192.168.x.111:6814 io error
May 20 03:38:45 svr1 kernel: libceph: mon2 192.168.x.113:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon2 192.168.x.113:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon2 192.168.x.113:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon2 192.168.x.113:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
established
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 io error
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
lost, hunting for new mon
May 20 03:38:45 svr1 kernel: libceph: mon0 192.168.x.111:6789 session 
established


[1] svr2 messages 
May 20 03:40:01 svr2 systemd: Stopping User Slice of root.
May 20 03:40:01 svr2 systemd: Removed slice User Slice of root.
May 20 03:40:01 

Re: [ceph-users] Large OMAP Objects in default.rgw.log pool

2019-05-20 Thread EDH - Manuel Rios Fernandez
Hi Arnondh,

 

Whats your ceph version?

 

Regards

 

 

De: ceph-users  En nombre de mr. non non
Enviado el: lunes, 20 de mayo de 2019 12:39
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Large OMAP Objects in default.rgw.log pool

 

Hi,

 

I found the same issue like above. 

Does anyone know how to fix it?

 

Thanks.

Arnondh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Large OMAP Objects in default.rgw.log pool

2019-05-20 Thread mr. non non
Hi,

I found the same issue like above.
Does anyone know how to fix it?

Thanks.
Arnondh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Massive TCP connection on radosgw

2019-05-20 Thread Li Wang
Hi John,

Thanks for your reply. We have also restarted the server to get rid of it.

Hi All,

Does anybody know a better solution than restarting the server? Since we use 
radosgw in production, we cannot afford service restart on a daily basis.

Regards,
Li Wang


> On 20 May 2019, at 2:48 PM, John Hearns  wrote:
> 
> I found similar behaviour on a Nautilus cluster on Friday. Around 300 000 
> open connections which I think were the result of a benchmarking run which 
> was terminated. I restarted the radosgw service to get rid of them.
> 
> On Mon, 20 May 2019 at 06:56, Li Wang  > wrote:
> Dear ceph community members,
> 
> We have a ceph cluster (mimic 13.2.4) with 7 nodes and 130+ OSDs. However, we 
> observed over 70 millions active TCP connections on the radosgw host, which 
> makes the radosgw very unstable. 
> 
> After further investigation, we found most of the TCP connections on the 
> radosgw are connected to OSDs.
> 
> May I ask what might be the possible reason causing the the massive amount of 
> TCP connection? And is there anything configuration or tuning work that I can 
> do to solve this issue?
> 
> Any suggestion is highly appreciated.
> 
> Regards,
> Li Wang
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph nautilus namespaces for rbd and rbd image access problem

2019-05-20 Thread Rainer Krienke
Am 20.05.19 um 09:06 schrieb Jason Dillaman:

>> $ rbd --namespace=testnamespace map rbd/rbdtestns --name client.rainer
>> --keyring=/etc/ceph/ceph.keyring
>> rbd: sysfs write failed
>> rbd: error opening image rbdtestns: (1) Operation not permitted
>> In some cases useful info is found in syslog - try "dmesg | tail".
>> 2019-05-20 08:18:29.187 7f42ab7fe700 -1 librbd::image::RefreshRequest:
>> failed to retrieve pool metadata: (1) Operation not permitted
>> 2019-05-20 08:18:29.187 7f42aaffd700 -1 librbd::image::OpenRequest:
>> failed to refresh image: (1) Operation not permitted
>> 2019-05-20 08:18:29.187 7f42aaffd700 -1 librbd::ImageState:
>> 0x561792408860 failed to open image: (1) Operation not permitted
>> rbd: map failed: (22) Invalid argument
> 
> Hmm, it looks like we overlooked updating the 'rbd' profile when PR
> 27423 [1] was merged into v14.2.1. We'll get that fixed, but in the
> meantime, you can add a "class rbd metadata_list" cap on the base pool
> (w/o the namespace restriction) [2].
> 

Thanks for your answer. Well I still have Kernel 4.15 so namespaces
won't work for me at the moment.

Could you please explain what the magic behind "class rbd metadata_list"
is? Is it thought to "simply" allow access to the basepool (rbd in my
case), so I authorize access to the pool instead of a namespaces? And if
this is true then I do not understand the difference of your class cap
compared to a cap like  osd 'allow rw pool=rbd'?

-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Could someone can help me to solve this problem about ceph-STS(secure token session)

2019-05-20 Thread Yuan Minghui
 

Hello everyone:

   When I use the method :” assume_role”, like this:
sts_client = boto3.client('sts',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
endpoint_url=host,
)
response = sts_client.assume_role(RoleArn='arn:aws:iam:::role/AccessRole1', 
RoleSessionName="ymh_bucketAccess")
 

I create a role in terminal:

 

I return that :

 

Traceback (most recent call last):

  File "/Users/yuanminghui/PycharmProjects/myproject1/10-sts-demo.py", line 64, 
in test1

response = sts_client.assume_role(RoleArn='arn:aws:iam:::role/AccessRole1', 
RoleSessionName="ymh_bucketAccess")

  File 
"/Users/yuanminghui/PycharmProjects/myproject1/venv/lib/python3.7/site-packages/botocore/client.py",
 line 357, in _api_call

return self._make_api_call(operation_name, kwargs)

  File 
"/Users/yuanminghui/PycharmProjects/myproject1/venv/lib/python3.7/site-packages/botocore/client.py",
 line 661, in _make_api_call

raise error_class(parsed_response, operation_name)

botocore.exceptions.ClientError: An error occurred (Unknown) when calling the 
AssumeRole operation: Unknown

 

 

I really do not know whats wrong with this? Is there someone can help? Thanks a 
lot.

best wishes!

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor Crash while adding OSD (Luminous)

2019-05-20 Thread Henry Spanka
Hi,
I recently upgraded my cluster to Luminous v12.2.11. While adding a new OSD the 
active monitor crashes (attempt to free invalid pointer). The other mons are 
still running but the OSD is stuck in new state. Attempting to restart the OSD 
process will crash the monitor again. Can anybody look into this?

Crash Log: https://pastebin.com/pMpth7dV 
Binary: 
http://mirror.centos.org/centos/7/storage/x86_64/ceph-luminous/ceph-12.2.11-0.el7.x86_64.rpm
 

OSD Tree: https://pastebin.com/RZQX2zAz 

I think it crashes at this point: 
https://github.com/ceph/ceph/blob/26dc3775efc7bb286a1d6d66faee0ba30ea23eee/src/crush/CrushWrapper.cc#L463
 
The
 OSD is added on a new node (not in the crush map yet). Could that be a problem?

Best regards,
Henry___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-20 Thread Kevin Flöh

Hi Frederic,

we do not have access to the original OSDs. We exported the remaining 
shards of the two pgs but we are only left with two shards (of 
reasonable size) per pg. The rest of the shards displayed by ceph pg 
query are empty. I guess marking the OSD as complete doesn't make sense 
then.


Best,
Kevin

On 17.05.19 2:36 nachm., Frédéric Nass wrote:



Le 14/05/2019 à 10:04, Kevin Flöh a écrit :


On 13.05.19 11:21 nachm., Dan van der Ster wrote:
Presumably the 2 OSDs you marked as lost were hosting those 
incomplete PGs?

It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, at 
least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on 
them, they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on a 
healthy OSD) seems to be the only way to recover your lost data, as 
with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity 
when you actually need 3 to access it. Reducing min_size to 3 will not 
help.


Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html 



This is probably the best way you want to follow form now on.

Regards,
Frédéric.



If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If 
not we would just set up a new ceph directly without fixing the old 
one and copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, let me
first show you the current ceph status:

    cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs 
incomplete

  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. Implicated 
osds

24,32,91

    services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

    data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7    active+clean+inconsistent
   2    incomplete
   1    active+recovery_wait+degraded

    io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow 
requests;

1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are 
blocked > 30 sec

MDS_TRIM 1 MDSs behind on trimming
  mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming 
(46034/128)

max_segments: 128, num_segments: 46034
OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
  pg 1.24c has 1 unfound objects
OSD_SCRUB_ERRORS 19 scrub errors
PG_AVAILABILITY Reduced 

Re: [ceph-users] ceph nautilus namespaces for rbd and rbd image access problem

2019-05-20 Thread Jason Dillaman
On Mon, May 20, 2019 at 9:08 AM Rainer Krienke  wrote:
>
> Hello,
>
> just saw this message on the client when trying and failing to map the
> rbd image:
>
> May 20 08:59:42 client kernel: libceph: bad option at
> '_pool_ns=testnamespace'

You will need kernel v4.19 (or later) I believe to utilize RBD
namespaces via krbd [1].

> Rainer
>
> Am 20.05.19 um 08:56 schrieb Rainer Krienke:
> > Hello,
> >
> > on a ceph Nautilus cluster (14.2.1) running on Ubuntu 18.04 I try to set
> > up rbd images with namespaces in order to allow different clients to
> > access only their "own" rbd images in different namespaces in just one
> > pool. The rbd image data are in an erasure encoded pool named "ecpool"
> > and the metadata in the default "rbd" pool.
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] 
https://github.com/torvalds/linux/commit/b26c047b940003295d3896b7f633a66aab95bebd

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph nautilus namespaces for rbd and rbd image access problem

2019-05-20 Thread Rainer Krienke
Hello,

just saw this message on the client when trying and failing to map the
rbd image:

May 20 08:59:42 client kernel: libceph: bad option at
'_pool_ns=testnamespace'

Rainer

Am 20.05.19 um 08:56 schrieb Rainer Krienke:
> Hello,
> 
> on a ceph Nautilus cluster (14.2.1) running on Ubuntu 18.04 I try to set
> up rbd images with namespaces in order to allow different clients to
> access only their "own" rbd images in different namespaces in just one
> pool. The rbd image data are in an erasure encoded pool named "ecpool"
> and the metadata in the default "rbd" pool.
-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph nautilus namespaces for rbd and rbd image access problem

2019-05-20 Thread Jason Dillaman
On Mon, May 20, 2019 at 8:56 AM Rainer Krienke  wrote:
>
> Hello,
>
> on a ceph Nautilus cluster (14.2.1) running on Ubuntu 18.04 I try to set
> up rbd images with namespaces in order to allow different clients to
> access only their "own" rbd images in different namespaces in just one
> pool. The rbd image data are in an erasure encoded pool named "ecpool"
> and the metadata in the default "rbd" pool.
>
> With this setup I am experiencing trouble when I try to access a rbd
> image in a namespace from a (OpenSuSE Leap 15.0 with Ceph 14.2.1) client
> and I do not understand what I am doing wrong. Hope someone can see the
> problem and give me a hint:
>
> # On one of the the ceph servers
>
> $ rbd namespace create --namespace testnamespace
> $ rbd namespace ls
> NAME
> testnamespace
>
> $ ceph auth caps client.rainer mon 'profile rbd' osd 'profile rbd
> pool=rbd namespace=testnamespace'
>
> $ ceph auth get client.rainer
> [client.rainer]
> key = AQCcVt5cHC+WJhBBoRPKhErEYzxGuU8U/GA0xA++
> caps mon = "profile rbd"
> caps osd = "profile rbd pool=rbd namespace=testnamespace"
>
> $ rbd create rbd/rbdtestns --namespace testnamespace --size 50G
> --data-pool=rbd-ecpool
>
> $ rbd --namespace testnamespace ls -l
> NAME  SIZE   PARENT FMT PROT LOCK
> rbdtestns 50 GiB  2
>
> On the openSuSE Client:
>
> $ rbd --namespace=testnamespace map rbd/rbdtestns --name client.rainer
> --keyring=/etc/ceph/ceph.keyring
> rbd: sysfs write failed
> rbd: error opening image rbdtestns: (1) Operation not permitted
> In some cases useful info is found in syslog - try "dmesg | tail".
> 2019-05-20 08:18:29.187 7f42ab7fe700 -1 librbd::image::RefreshRequest:
> failed to retrieve pool metadata: (1) Operation not permitted
> 2019-05-20 08:18:29.187 7f42aaffd700 -1 librbd::image::OpenRequest:
> failed to refresh image: (1) Operation not permitted
> 2019-05-20 08:18:29.187 7f42aaffd700 -1 librbd::ImageState:
> 0x561792408860 failed to open image: (1) Operation not permitted
> rbd: map failed: (22) Invalid argument

Hmm, it looks like we overlooked updating the 'rbd' profile when PR
27423 [1] was merged into v14.2.1. We'll get that fixed, but in the
meantime, you can add a "class rbd metadata_list" cap on the base pool
(w/o the namespace restriction) [2].

> Thanks for your help
> Rainer
> --
> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
> 56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
> Web: http://userpages.uni-koblenz.de/~krienke
> PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[1] https://github.com/ceph/ceph/pull/27423
[2] 
http://docs.ceph.com/docs/master/rados/operations/user-management/#authorization-capabilities

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph nautilus namespaces for rbd and rbd image access problem

2019-05-20 Thread Rainer Krienke
Hello,

on a ceph Nautilus cluster (14.2.1) running on Ubuntu 18.04 I try to set
up rbd images with namespaces in order to allow different clients to
access only their "own" rbd images in different namespaces in just one
pool. The rbd image data are in an erasure encoded pool named "ecpool"
and the metadata in the default "rbd" pool.

With this setup I am experiencing trouble when I try to access a rbd
image in a namespace from a (OpenSuSE Leap 15.0 with Ceph 14.2.1) client
and I do not understand what I am doing wrong. Hope someone can see the
problem and give me a hint:

# On one of the the ceph servers

$ rbd namespace create --namespace testnamespace
$ rbd namespace ls
NAME
testnamespace

$ ceph auth caps client.rainer mon 'profile rbd' osd 'profile rbd
pool=rbd namespace=testnamespace'

$ ceph auth get client.rainer
[client.rainer]
key = AQCcVt5cHC+WJhBBoRPKhErEYzxGuU8U/GA0xA++
caps mon = "profile rbd"
caps osd = "profile rbd pool=rbd namespace=testnamespace"

$ rbd create rbd/rbdtestns --namespace testnamespace --size 50G
--data-pool=rbd-ecpool

$ rbd --namespace testnamespace ls -l
NAME  SIZE   PARENT FMT PROT LOCK
rbdtestns 50 GiB  2

On the openSuSE Client:

$ rbd --namespace=testnamespace map rbd/rbdtestns --name client.rainer
--keyring=/etc/ceph/ceph.keyring
rbd: sysfs write failed
rbd: error opening image rbdtestns: (1) Operation not permitted
In some cases useful info is found in syslog - try "dmesg | tail".
2019-05-20 08:18:29.187 7f42ab7fe700 -1 librbd::image::RefreshRequest:
failed to retrieve pool metadata: (1) Operation not permitted
2019-05-20 08:18:29.187 7f42aaffd700 -1 librbd::image::OpenRequest:
failed to refresh image: (1) Operation not permitted
2019-05-20 08:18:29.187 7f42aaffd700 -1 librbd::ImageState:
0x561792408860 failed to open image: (1) Operation not permitted
rbd: map failed: (22) Invalid argument

Thanks for your help
Rainer
-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Massive TCP connection on radosgw

2019-05-20 Thread John Hearns
I found similar behaviour on a Nautilus cluster on Friday. Around 300 000
open connections which I think were the result of a benchmarking run which
was terminated. I restarted the radosgw service to get rid of them.

On Mon, 20 May 2019 at 06:56, Li Wang  wrote:

> Dear ceph community members,
>
> We have a ceph cluster (mimic 13.2.4) with 7 nodes and 130+ OSDs. However,
> we observed over 70 millions active TCP connections on the radosgw host,
> which makes the radosgw very unstable.
>
> After further investigation, we found most of the TCP connections on the
> radosgw are connected to OSDs.
>
> May I ask what might be the possible reason causing the the massive amount
> of TCP connection? And is there anything configuration or tuning work that
> I can do to solve this issue?
>
> Any suggestion is highly appreciated.
>
> Regards,
> Li Wang
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com