Hi Orit, We're running on jewel, version 10.2.7.
I've ran the bi-list with the debugging commands and this is the end of it: *2017-07-05 08:50:19.705673 7ff3bfefe700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.3 10.21.4.111:6810/3633200 <http://10.21.4.111:6810/3633200> 2297 ==== osd_op_reply(2571 .dir.be-east.5582981.76.0 [call] v0'0 uv65572318 ondisk = 0) v7 ==== 145+0+385625 (3432176685 0 1775102993) 0x7ff3b00041a0 con 0x7ff4272f48f02017-07-05 08:50:19.724193 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.111:6810/3633200 <http://10.21.4.111:6810/3633200> -- osd_op(client.5971646.0:2572 48.1b47291b .dir.be-east.5582981.76.0 [call rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 -- ?+0 0x7ff427327400 con 0x7ff4272f48f02017-07-05 08:50:19.767758 7ff3bfefe700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.3 10.21.4.111:6810/3633200 <http://10.21.4.111:6810/3633200> 2298 ==== osd_op_reply(2572 .dir.be-east.5582981.76.0 [call] v0'0 uv65572318 ondisk = 0) v7 ==== 145+0+385625 (3432176685 0 2330398289) 0x7ff3b00041a0 con 0x7ff4272f48f02017-07-05 08:50:19.786309 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.111:6810/3633200 <http://10.21.4.111:6810/3633200> -- osd_op(client.5971646.0:2573 48.1b47291b .dir.be-east.5582981.76.0 [call rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 -- ?+0 0x7ff427327400 con 0x7ff4272f48f02017-07-05 08:50:19.827960 7ff3bfefe700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.3 10.21.4.111:6810/3633200 <http://10.21.4.111:6810/3633200> 2299 ==== osd_op_reply(2573 .dir.be-east.5582981.76.0 [call] v0'0 uv65572318 ondisk = 0) v7 ==== 145+0+385625 (3432176685 0 1724305540) 0x7ff3b00041a0 con 0x7ff4272f48f02017-07-05 08:50:19.846588 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.111:6810/3633200 <http://10.21.4.111:6810/3633200> -- osd_op(client.5971646.0:2574 48.1b47291b .dir.be-east.5582981.76.0 [call rgw.bi_list] snapc 0=[] ack+read+known_if_redirected e31545) v7 -- ?+0 0x7ff427327400 con 0x7ff4272f48f02017-07-05 08:50:19.870830 7ff3bfefe700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.3 10.21.4.111:6810/3633200 <http://10.21.4.111:6810/3633200> 2300 ==== osd_op_reply(2574 .dir.be-east.5582981.76.0 [call] v0'0 uv0 ondisk = -4 ((4) Interrupted system call)) v7 ==== 145+0+0 (798610401 0 0) 0x7ff3b00041a0 con 0x7ff4272f48f0ERROR: bi_list(): (4) Interrupted system call2017-07-05 08:50:19.872489 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.112:6822/2795125 <http://10.21.4.112:6822/2795125> -- osd_op(client.5971646.0:2575 24.4322fa9f notify.0 [watch unwatch cookie 140686606221264] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff427302b102017-07-05 08:50:19.878128 7ff3bf0f7700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.23 10.21.4.112:6822/2795125 <http://10.21.4.112:6822/2795125> 63 ==== osd_op_reply(2575 notify.0 [watch unwatch cookie 140686606221264] v31545'6808 uv6416 ondisk = 0) v7 ==== 128+0+0 (3462997515 0 0) 0x7ff3980014f0 con 0x7ff427302b102017-07-05 08:50:19.878221 7ff4250219c0 20 remove_watcher() i=02017-07-05 08:50:19.878229 7ff4250219c0 2 removed watcher, disabling cache2017-07-05 08:50:19.878278 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.113:6807/2176843 <http://10.21.4.113:6807/2176843> -- osd_op(client.5971646.0:2576 24.16dafda0 notify.1 [watch unwatch cookie 140686606235888] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff427304ae02017-07-05 08:50:19.880843 7ff3beef5700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.27 10.21.4.113:6807/2176843 <http://10.21.4.113:6807/2176843> 63 ==== osd_op_reply(2576 notify.1 [watch unwatch cookie 140686606235888] v31545'6706 uv6304 ondisk = 0) v7 ==== 128+0+0 (4086455760 0 0) 0x7ff3900014f0 con 0x7ff427304ae02017-07-05 08:50:19.880910 7ff4250219c0 20 remove_watcher() i=12017-07-05 08:50:19.880940 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.111:6802/3632911 <http://10.21.4.111:6802/3632911> -- osd_op(client.5971646.0:2577 24.88aa5c95 notify.2 [watch unwatch cookie 140686606250416] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff4273083d02017-07-05 08:50:19.886387 7ff3becf3700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.1 10.21.4.111:6802/3632911 <http://10.21.4.111:6802/3632911> 94 ==== osd_op_reply(2577 notify.2 [watch unwatch cookie 140686606250416] v31545'10057 uv9497 ondisk = 0) v7 ==== 128+0+0 (2583541993 0 0) 0x7ff388001630 con 0x7ff4273083d02017-07-05 08:50:19.886476 7ff4250219c0 20 remove_watcher() i=22017-07-05 08:50:19.886513 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.111:6814/3633408 <http://10.21.4.111:6814/3633408> -- osd_op(client.5971646.0:2578 24.f8c99aee notify.3 [watch unwatch cookie 140686606264944] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff42730bca02017-07-05 08:50:19.888815 7ff3beaf1700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.5 10.21.4.111:6814/3633408 <http://10.21.4.111:6814/3633408> 32 ==== osd_op_reply(2578 notify.3 [watch unwatch cookie 140686606264944] v31545'3419 uv3231 ondisk = 0) v7 ==== 128+0+0 (1994465853 0 0) 0x7ff380000a20 con 0x7ff42730bca02017-07-05 08:50:19.888893 7ff4250219c0 20 remove_watcher() i=32017-07-05 08:50:19.888940 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.111:6802/3632911 <http://10.21.4.111:6802/3632911> -- osd_op(client.5971646.0:2579 24.a204812d notify.4 [watch unwatch cookie 140686606267200] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff4273083d02017-07-05 08:50:19.891441 7ff3becf3700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.1 10.21.4.111:6802/3632911 <http://10.21.4.111:6802/3632911> 95 ==== osd_op_reply(2579 notify.4 [watch unwatch cookie 140686606267200] v31545'10058 uv9499 ondisk = 0) v7 ==== 128+0+0 (840319076 0 0) 0x7ff388001630 con 0x7ff4273083d02017-07-05 08:50:19.891511 7ff4250219c0 20 remove_watcher() i=42017-07-05 08:50:19.891545 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.112:6822/2795125 <http://10.21.4.112:6822/2795125> -- osd_op(client.5971646.0:2580 24.31099063 notify.5 [watch unwatch cookie 140686606269328] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff427302b102017-07-05 08:50:19.893535 7ff3bf0f7700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.23 10.21.4.112:6822/2795125 <http://10.21.4.112:6822/2795125> 64 ==== osd_op_reply(2580 notify.5 [watch unwatch cookie 140686606269328] v31545'6809 uv6418 ondisk = 0) v7 ==== 128+0+0 (3042799304 0 0) 0x7ff3980014f0 con 0x7ff427302b102017-07-05 08:50:19.893592 7ff4250219c0 20 remove_watcher() i=52017-07-05 08:50:19.893624 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.113:6807/2176843 <http://10.21.4.113:6807/2176843> -- osd_op(client.5971646.0:2581 24.97c520d4 notify.6 [watch unwatch cookie 140686606271968] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff4272d5950 con 0x7ff427304ae02017-07-05 08:50:19.895393 7ff3beef5700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.27 10.21.4.113:6807/2176843 <http://10.21.4.113:6807/2176843> 64 ==== osd_op_reply(2581 notify.6 [watch unwatch cookie 140686606271968] v31545'6707 uv6306 ondisk = 0) v7 ==== 128+0+0 (1081188691 0 0) 0x7ff3900014f0 con 0x7ff427304ae02017-07-05 08:50:19.895454 7ff4250219c0 20 remove_watcher() i=62017-07-05 08:50:19.895485 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> --> 10.21.4.111:6802/3632911 <http://10.21.4.111:6802/3632911> -- osd_op(client.5971646.0:2582 24.84ada7c9 notify.7 [watch unwatch cookie 140686606274544] snapc 0=[] ondisk+write+known_if_redirected e31545) v7 -- ?+0 0x7ff427300db0 con 0x7ff4273083d02017-07-05 08:50:19.897426 7ff3becf3700 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> <== osd.1 10.21.4.111:6802/3632911 <http://10.21.4.111:6802/3632911> 96 ==== osd_op_reply(2582 notify.7 [watch unwatch cookie 140686606274544] v31545'10059 uv9501 ondisk = 0) v7 ==== 128+0+0 (4260323257 0 0) 0x7ff388001630 con 0x7ff4273083d02017-07-05 08:50:19.897492 7ff4250219c0 20 remove_watcher() i=72017-07-05 08:50:19.898051 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4273083d0 -- 0x7ff427308c502017-07-05 08:50:19.898108 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272f48f0 -- 0x7ff4272f53702017-07-05 08:50:19.898150 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff42730bca0 -- 0x7ff42730c5102017-07-05 08:50:19.898203 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272eb9f0 -- 0x7ff4272ea7302017-07-05 08:50:19.898257 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff42732aee0 -- 0x7ff427329c202017-07-05 08:50:19.898296 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272e8480 -- 0x7ff4272e71c02017-07-05 08:50:19.898343 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272fd830 -- 0x7ff4272fc5702017-07-05 08:50:19.898469 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff427302b10 -- 0x7ff4273018002017-07-05 08:50:19.898518 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272efa00 -- 0x7ff4272ee6f02017-07-05 08:50:19.898642 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272f1120 -- 0x7ff4272f20f02017-07-05 08:50:19.898690 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff427304ae0 -- 0x7ff4273053902017-07-05 08:50:19.898739 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272f9cb0 -- 0x7ff4272f89f02017-07-05 08:50:19.898780 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272e0600 -- 0x7ff4272df2f02017-07-05 08:50:19.898820 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272e1d40 -- 0x7ff4272e2da02017-07-05 08:50:19.898938 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down 0x7ff4272d8160 -- 0x7ff4272d61602017-07-05 08:50:19.899192 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> mark_down_all2017-07-05 08:50:19.899568 7ff4250219c0 1 -- 10.21.4.1:0/3313807338 <http://10.21.4.1:0/3313807338> shutdown complete.* I'll have a look at the tracker on how to open an issue there. Regards, Maarten On Wed, Jul 5, 2017 at 8:26 AM, Orit Wasserman <[email protected]> wrote: > Hi Maarten, > > On Tue, Jul 4, 2017 at 9:46 PM, Maarten De Quick <[email protected]> > wrote: > >> Hi, >> >> Background: We're having issues with our index pool (slow requests / time >> outs causes crashing of an OSD and a recovery -> application issues). We >> know we have very big buckets (eg. bucket of 77 million objects with only >> 16 shards) that need a reshard so we were looking at the resharding process. >> >> First thing we would like to do is making a backup of the bucket index, >> but this failed with: >> >> # radosgw-admin -n client.radosgw.be-west-3 bi list >> --bucket=priv-prod-up-alex > /var/backup/priv-prod-up-alex.list.backup >> 2017-07-03 21:28:30.325613 7f07fb8bc9c0 0 System already converted >> ERROR: bi_list(): (4) Interrupted system call >> >> > What version of are you using? > Can you rerun the command with --debug-rgw=20 --debug-ms=1? > Also please open a tracker issue (for rgw) with all the information. > > Thanks, > Orit > > When I grep for "idx" and I count these: >> # grep idx priv-prod-up-alex.list.backup | wc -l >> 2294942 >> When I do a bucket stats for that bucket I get: >> # radosgw-admin -n client.radosgw.be-west-3 bucket stats >> --bucket=priv-prod-up-alex | grep num_objects >> 2017-07-03 21:33:05.776499 7faca49b89c0 0 System already converted >> "num_objects": 20148575 >> >> It looks like there are 18 million objects missing and the backup is not >> complete (not sure if that's a correct assumption?). We're also afraid that >> the resharding command will face the same issue. >> Has anyone seen this behaviour before or any thoughts on how to fix it? >> >> We were also wondering if we really need the backup. As the resharding >> process creates a complete new index and keeps the old bucket, is there >> maybe a possibility to relink your bucket to the old bucket in case of >> issues? Or am I missing something important here? >> >> Any help would be greatly appreciated, thanks! >> >> Regards, >> Maarten >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
