Re: [ceph-users] PG inconsistent with error "size_too_large"
I just changed my max object size to 256MB and scrubbed and the errors went away. I’m not sure what can be done to reduce the size of these objects, though, if it really is a problem. Our cluster has dynamic bucket index resharding turned on, but that sharding process shouldn’t help it if non-index objects are what is over the limit. I don’t think a pg repair would do anything unless the config tunables are adjusted. > On Jan 15, 2020, at 10:56 AM, Massimo Sgaravatto > wrote: > > I never changed the default value for that attribute > > I am missing why I have such big objects around > > I am also wondering what a pg repair would do in such case > > Il mer 15 gen 2020, 16:18 Liam Monahan <mailto:l...@umiacs.umd.edu>> ha scritto: > Thanks for that link. > > Do you have a default osd max object size of 128M? I’m thinking about > doubling that limit to 256MB on our cluster. Our largest object is only > about 10% over that limit. > >> On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto >> mailto:massimo.sgarava...@gmail.com>> wrote: >> >> I guess this is coming from: >> >> https://github.com/ceph/ceph/pull/30783 >> <https://github.com/ceph/ceph/pull/30783> >> >> introduced in Nautilus 14.2.5 >> >> On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto >> mailto:massimo.sgarava...@gmail.com>> wrote: >> As I wrote here: >> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html >> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html> >> >> I saw the same after an update from Luminous to Nautilus 14.2.6 >> >> Cheers, Massimo >> >> On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan > <mailto:l...@umiacs.umd.edu>> wrote: >> Hi, >> >> I am getting one inconsistent object on our cluster with an inconsistency >> error that I haven’t seen before. This started happening during a rolling >> upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s >> related. >> >> I was hoping to know what the error means before trying a repair. >> >> [root@objmon04 ~]# ceph health detail >> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg >> inconsistent >> OSDMAP_FLAGS noout flag(s) set >> OSD_SCRUB_ERRORS 1 scrub errors >> PG_DAMAGED Possible data damage: 1 pg inconsistent >> pg 9.20e is active+clean+inconsistent, acting [509,674,659] >> >> rados list-inconsistent-obj 9.20e --format=json-pretty >> { >> "epoch": 759019, >> "inconsistents": [ >> { >> "object": { >> "name": >> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", >> "nspace": "", >> "locator": "", >> "snap": "head", >> "version": 692875 >> }, >> "errors": [ >> "size_too_large" >> ], >> "union_shard_errors": [], >> "selected_object_info": { >> "oid": { >> "oid": >> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", >> "key": "", >> "snapid": -2, >> "hash": 3321413134, >> "max": 0, >> "pool": 9, >> "namespace": "" >> }, >> "version": "281183'692875", >> "prior_version": "281183'692874", >> "last_reqid": "client.34042469.0:206759091", >> "user_version": 692875, >> "size": 146097278, >> "mtime": "2017-07-03 12:43:35.569986", >> "local_mtime": "2017-07-03 12:43:35.571196", >> "lost": 0, >> "flags": [ >> "dirty", >> "data_digest", >> "omap_digest" >> ], >>
Re: [ceph-users] PG inconsistent with error "size_too_large"
Thanks for that link. Do you have a default osd max object size of 128M? I’m thinking about doubling that limit to 256MB on our cluster. Our largest object is only about 10% over that limit. > On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto > wrote: > > I guess this is coming from: > > https://github.com/ceph/ceph/pull/30783 > <https://github.com/ceph/ceph/pull/30783> > > introduced in Nautilus 14.2.5 > > On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto > mailto:massimo.sgarava...@gmail.com>> wrote: > As I wrote here: > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html> > > I saw the same after an update from Luminous to Nautilus 14.2.6 > > Cheers, Massimo > > On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan <mailto:l...@umiacs.umd.edu>> wrote: > Hi, > > I am getting one inconsistent object on our cluster with an inconsistency > error that I haven’t seen before. This started happening during a rolling > upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s > related. > > I was hoping to know what the error means before trying a repair. > > [root@objmon04 ~]# ceph health detail > HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg > inconsistent > OSDMAP_FLAGS noout flag(s) set > OSD_SCRUB_ERRORS 1 scrub errors > PG_DAMAGED Possible data damage: 1 pg inconsistent > pg 9.20e is active+clean+inconsistent, acting [509,674,659] > > rados list-inconsistent-obj 9.20e --format=json-pretty > { > "epoch": 759019, > "inconsistents": [ > { > "object": { > "name": > "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", > "nspace": "", > "locator": "", > "snap": "head", > "version": 692875 > }, > "errors": [ > "size_too_large" > ], > "union_shard_errors": [], > "selected_object_info": { > "oid": { > "oid": > "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", > "key": "", > "snapid": -2, > "hash": 3321413134, > "max": 0, > "pool": 9, > "namespace": "" > }, > "version": "281183'692875", > "prior_version": "281183'692874", > "last_reqid": "client.34042469.0:206759091", > "user_version": 692875, > "size": 146097278, > "mtime": "2017-07-03 12:43:35.569986", > "local_mtime": "2017-07-03 12:43:35.571196", > "lost": 0, > "flags": [ > "dirty", > "data_digest", > "omap_digest" > ], > "truncate_seq": 0, > "truncate_size": 0, > "data_digest": "0xf19c8035", > "omap_digest": "0x", > "expected_object_size": 0, > "expected_write_size": 0, > "alloc_hint_flags": 0, > "manifest": { > "type": 0 > }, > "watchers": {} > }, > "shards": [ > { > "osd": 509, > "primary": true, > "errors": [], > "size": 146097278 > }, > { > "osd": 659, > "primary": false, > "errors": [], > "size": 146097278 > }, > { > "osd": 674, > "primary": false, > "errors": [], > "size": 146097278 > } > ] > } > ] > } > > Thanks, > Liam > — > Senior Developer > Institute for Advanced Computer Studies > University of Maryland > ___ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG inconsistent with error "size_too_large"
Hi, I am getting one inconsistent object on our cluster with an inconsistency error that I haven’t seen before. This started happening during a rolling upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s related. I was hoping to know what the error means before trying a repair. [root@objmon04 ~]# ceph health detail HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg inconsistent OSDMAP_FLAGS noout flag(s) set OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 9.20e is active+clean+inconsistent, acting [509,674,659] rados list-inconsistent-obj 9.20e --format=json-pretty { "epoch": 759019, "inconsistents": [ { "object": { "name": "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", "nspace": "", "locator": "", "snap": "head", "version": 692875 }, "errors": [ "size_too_large" ], "union_shard_errors": [], "selected_object_info": { "oid": { "oid": "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff", "key": "", "snapid": -2, "hash": 3321413134, "max": 0, "pool": 9, "namespace": "" }, "version": "281183'692875", "prior_version": "281183'692874", "last_reqid": "client.34042469.0:206759091", "user_version": 692875, "size": 146097278, "mtime": "2017-07-03 12:43:35.569986", "local_mtime": "2017-07-03 12:43:35.571196", "lost": 0, "flags": [ "dirty", "data_digest", "omap_digest" ], "truncate_seq": 0, "truncate_size": 0, "data_digest": "0xf19c8035", "omap_digest": "0x", "expected_object_size": 0, "expected_write_size": 0, "alloc_hint_flags": 0, "manifest": { "type": 0 }, "watchers": {} }, "shards": [ { "osd": 509, "primary": true, "errors": [], "size": 146097278 }, { "osd": 659, "primary": false, "errors": [], "size": 146097278 }, { "osd": 674, "primary": false, "errors": [], "size": 146097278 } ] } ] } Thanks, Liam — Senior Developer Institute for Advanced Computer Studies University of Maryland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Handling large omap objects in the .log pool
Hi, Our cluster has large omap objects in the .log pool. Recent changes to the default warn limits brought this to our awareness. Automatic resharding of rgw buckets seems to have helped with all of our other large omap warnings elsewhere. I guess my first question is what sort of things are held in the .log pool? And secondly, is there something we should or could be doing to prune .log so that it doesn’t grow omap objects large enough to get warnings about? 2019-09-18 22:38:55.411684 osd.71 (osd.71) 175 : cluster [WRN] Large omap object found. Object: 9:79810424:::data_log.8:head Key count: 774934 Size (bytes): 100233321 Thanks, Liam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGWObjectExpirer crashing after upgrade from 14.2.0 to 14.2.3
Hi, I recently took our test cluster up to a new version and am no longer able to start radosgw. The cluster itself (mon, osd, mgr) appears fine. Without being much of an expert trying to read this, from the errors that were being thrown it seems like the object expirer is choking ok handling resharded buckets. There have been no recent reshard operations on this cluster, and dynamic resharding is disabled. I though this could’ve been related to https://github.com/ceph/ceph/pull/27817 but that landed by v14.2.3... Logs from starting up radosgw: -26> 2019-09-17 16:18:45.719 7f2d93da2780 0 starting handler: civetweb -25> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: allow_unicode_in_urls: yes -24> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: canonicalize_url_path: no -23> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: decode_url: no -22> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: enable_auth_domain_check: no -21> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: enable_keep_alive: yes -20> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: listening_ports: 7480,7481s -19> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: num_threads: 512 -18> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: run_as_user: ceph -17> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: ssl_certificate: '/etc/ceph/r gw.pem' -16> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: validate_http_method: no -15> 2019-09-17 16:18:45.720 7f2d93da2780 0 civetweb: 0x55d622628600: ssl_use_pem_file: ca nnot open certificate file '/etc/ceph/rgw.pem': error:02001002:system library:fopen:No such fi le or directory -14> 2019-09-17 16:18:45.720 7f2d93da2780 -1 ERROR: failed run -13> 2019-09-17 16:18:45.721 7f2d5c97b700 5 lifecycle: schedule life cycle next start time : Wed Sep 18 04:00:00 2019 -12> 2019-09-17 16:18:45.721 7f2d5f180700 20 reqs_thread_entry: start -11> 2019-09-17 16:18:45.721 7f2d5e97f700 20 cr:s=0x55d625c94360:op=0x55d625bcd800:20MetaMa sterTrimPollCR: operate() -10> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c94360 is io blocked -9> 2019-09-17 16:18:45.721 7f2d5e97f700 20 cr:s=0x55d625c94480:op=0x55d625a68c00:17DataLo gTrimPollCR: operate() -8> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c94480 is io blocked -7> 2019-09-17 16:18:45.721 7f2d5e97f700 20 cr:s=0x55d625c945a0:op=0x55d625a69200:16Bucket TrimPollCR: operate() -6> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c945a0 is io blocked -5> 2019-09-17 16:18:45.721 7f2d5c17a700 20 BucketsSyncThread: start -4> 2019-09-17 16:18:45.721 7f2d5b979700 20 UserSyncThread: start -3> 2019-09-17 16:18:45.721 7f2d5b178700 20 process_all_logshards Resharding is disabled -2> 2019-09-17 16:18:45.721 7f2d5d97d700 20 reqs_thread_entry: start -1> 2019-09-17 16:18:45.724 7f2d731a8700 20 processing shard = obj_delete_at_hint. 01 0> 2019-09-17 16:18:45.726 7f2d731a8700 -1 *** Caught signal (Aborted) ** in thread 7f2d731a8700 thread_name:rgw_obj_expirer ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable) 1: (()+0xf630) [0x7f2d86ff7630] 2: (gsignal()+0x37) [0x7f2d86431377] 3: (abort()+0x148) [0x7f2d86432a68] 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2d86d417d5] 5: (()+0x5e746) [0x7f2d86d3f746] 6: (()+0x5e773) [0x7f2d86d3f773] 7: (()+0x5e993) [0x7f2d86d3f993] 8: (()+0x1772b) [0x7f2d92efb72b] 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) [0x7f2d92f19a03] 10: (()+0x70a8a2) [0x55d6222508a2] 11: (()+0x70a8e8) [0x55d6222508e8] 12: (RGWObjectExpirer::process_single_shard(std::string const&, utime_t const&, utime_t const &)+0x115) [0x55d622253155] 13: (RGWObjectExpirer::inspect_all_shards(utime_t const&, utime_t const&)+0xab) [0x55d6222538 2b] 14: (RGWObjectExpirer::OEWorker::entry()+0x273) [0x55d622253c43] 15: (()+0x7ea5) [0x7f2d86fefea5] 16: (clone()+0x6d) [0x7f2d864f98cd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. As another datapoint getting bucket stats fails now for two buckets in the cluster: [root@cephproxy01 ~]# radosgw-admin bucket stats --bucket=wHjAk0t failure: (2) No such file or directory: 2019-09-19 14:21:59.483 7f54ed3fd6c0 -1 ERROR: get_bucket_instance_from_oid failed: -2 [root@cephproxy01 ~]# radosgw-admin bucket stats --bucket=bzUi3MT failure: (2) No such file or directory: 2019-09-19 14:22:16.324 7fbd172666c0 -1 ERROR: get_bucket_instance_from_oid failed: -2 Has anyone seen this before? Didn’t see a lot on this from googling. Let me know if I can provide any more useful debugging information. Thanks, Liam --- University of Maryland Institute for Advanced Computer Studies ___ ceph-users mailing list