Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-15 Thread Liam Monahan
I just changed my max object size to 256MB and scrubbed and the errors went 
away.  I’m not sure what can be done to reduce the size of these objects, 
though, if it really is a problem.  Our cluster has dynamic bucket index 
resharding turned on, but that sharding process shouldn’t help it if non-index 
objects are what is over the limit.

I don’t think a pg repair would do anything unless the config tunables are 
adjusted.

> On Jan 15, 2020, at 10:56 AM, Massimo Sgaravatto 
>  wrote:
> 
> I never changed the default value for that attribute
> 
> I am missing why I have such big objects around 
> 
> I am also wondering what a pg repair would do in such case
> 
> Il mer 15 gen 2020, 16:18 Liam Monahan  <mailto:l...@umiacs.umd.edu>> ha scritto:
> Thanks for that link.
> 
> Do you have a default osd max object size of 128M?  I’m thinking about 
> doubling that limit to 256MB on our cluster.  Our largest object is only 
> about 10% over that limit.
> 
>> On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto 
>> mailto:massimo.sgarava...@gmail.com>> wrote:
>> 
>> I guess this is coming from:
>> 
>> https://github.com/ceph/ceph/pull/30783 
>> <https://github.com/ceph/ceph/pull/30783>
>> 
>> introduced in Nautilus 14.2.5
>> 
>> On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto 
>> mailto:massimo.sgarava...@gmail.com>> wrote:
>> As I wrote here:
>> 
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html 
>> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html>
>> 
>> I saw the same after an update from Luminous to Nautilus 14.2.6
>> 
>> Cheers, Massimo
>> 
>> On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan > <mailto:l...@umiacs.umd.edu>> wrote:
>> Hi,
>> 
>> I am getting one inconsistent object on our cluster with an inconsistency 
>> error that I haven’t seen before.  This started happening during a rolling 
>> upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s 
>> related.
>> 
>> I was hoping to know what the error means before trying a repair.
>> 
>> [root@objmon04 ~]# ceph health detail
>> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg 
>> inconsistent
>> OSDMAP_FLAGS noout flag(s) set
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
>> 
>> rados list-inconsistent-obj 9.20e --format=json-pretty
>> {
>> "epoch": 759019,
>> "inconsistents": [
>> {
>> "object": {
>> "name": 
>> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
>> "nspace": "",
>> "locator": "",
>> "snap": "head",
>> "version": 692875
>> },
>> "errors": [
>> "size_too_large"
>> ],
>> "union_shard_errors": [],
>> "selected_object_info": {
>> "oid": {
>> "oid": 
>> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
>> "key": "",
>> "snapid": -2,
>> "hash": 3321413134,
>> "max": 0,
>> "pool": 9,
>> "namespace": ""
>> },
>> "version": "281183'692875",
>> "prior_version": "281183'692874",
>> "last_reqid": "client.34042469.0:206759091",
>> "user_version": 692875,
>> "size": 146097278,
>> "mtime": "2017-07-03 12:43:35.569986",
>> "local_mtime": "2017-07-03 12:43:35.571196",
>> "lost": 0,
>> "flags": [
>> "dirty",
>> "data_digest",
>> "omap_digest"
>> ],
>>

Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-15 Thread Liam Monahan
Thanks for that link.

Do you have a default osd max object size of 128M?  I’m thinking about doubling 
that limit to 256MB on our cluster.  Our largest object is only about 10% over 
that limit.

> On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto 
>  wrote:
> 
> I guess this is coming from:
> 
> https://github.com/ceph/ceph/pull/30783 
> <https://github.com/ceph/ceph/pull/30783>
> 
> introduced in Nautilus 14.2.5
> 
> On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto 
> mailto:massimo.sgarava...@gmail.com>> wrote:
> As I wrote here:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html>
> 
> I saw the same after an update from Luminous to Nautilus 14.2.6
> 
> Cheers, Massimo
> 
> On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan  <mailto:l...@umiacs.umd.edu>> wrote:
> Hi,
> 
> I am getting one inconsistent object on our cluster with an inconsistency 
> error that I haven’t seen before.  This started happening during a rolling 
> upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s 
> related.
> 
> I was hoping to know what the error means before trying a repair.
> 
> [root@objmon04 ~]# ceph health detail
> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg 
> inconsistent
> OSDMAP_FLAGS noout flag(s) set
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
> 
> rados list-inconsistent-obj 9.20e --format=json-pretty
> {
> "epoch": 759019,
> "inconsistents": [
> {
> "object": {
> "name": 
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 692875
> },
> "errors": [
> "size_too_large"
> ],
> "union_shard_errors": [],
> "selected_object_info": {
> "oid": {
> "oid": 
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "key": "",
> "snapid": -2,
> "hash": 3321413134,
> "max": 0,
> "pool": 9,
> "namespace": ""
> },
> "version": "281183'692875",
> "prior_version": "281183'692874",
> "last_reqid": "client.34042469.0:206759091",
> "user_version": 692875,
> "size": 146097278,
> "mtime": "2017-07-03 12:43:35.569986",
> "local_mtime": "2017-07-03 12:43:35.571196",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 0,
> "truncate_size": 0,
> "data_digest": "0xf19c8035",
> "omap_digest": "0x",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> },
> "shards": [
> {
> "osd": 509,
> "primary": true,
> "errors": [],
> "size": 146097278
> },
> {
> "osd": 659,
> "primary": false,
> "errors": [],
> "size": 146097278
> },
> {
> "osd": 674,
> "primary": false,
> "errors": [],
> "size": 146097278
> }
> ]
> }
> ]
> }
> 
> Thanks,
> Liam
> —
> Senior Developer
> Institute for Advanced Computer Studies
> University of Maryland
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG inconsistent with error "size_too_large"

2020-01-14 Thread Liam Monahan
Hi,

I am getting one inconsistent object on our cluster with an inconsistency error 
that I haven’t seen before.  This started happening during a rolling upgrade of 
the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s related.

I was hoping to know what the error means before trying a repair.

[root@objmon04 ~]# ceph health detail
HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg 
inconsistent
OSDMAP_FLAGS noout flag(s) set
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 9.20e is active+clean+inconsistent, acting [509,674,659]

rados list-inconsistent-obj 9.20e --format=json-pretty
{
"epoch": 759019,
"inconsistents": [
{
"object": {
"name": 
"2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
"nspace": "",
"locator": "",
"snap": "head",
"version": 692875
},
"errors": [
"size_too_large"
],
"union_shard_errors": [],
"selected_object_info": {
"oid": {
"oid": 
"2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
"key": "",
"snapid": -2,
"hash": 3321413134,
"max": 0,
"pool": 9,
"namespace": ""
},
"version": "281183'692875",
"prior_version": "281183'692874",
"last_reqid": "client.34042469.0:206759091",
"user_version": 692875,
"size": 146097278,
"mtime": "2017-07-03 12:43:35.569986",
"local_mtime": "2017-07-03 12:43:35.571196",
"lost": 0,
"flags": [
"dirty",
"data_digest",
"omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xf19c8035",
"omap_digest": "0x",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 509,
"primary": true,
"errors": [],
"size": 146097278
},
{
"osd": 659,
"primary": false,
"errors": [],
"size": 146097278
},
{
"osd": 674,
"primary": false,
"errors": [],
"size": 146097278
}
]
}
]
}

Thanks,
Liam
—
Senior Developer
Institute for Advanced Computer Studies
University of Maryland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Handling large omap objects in the .log pool

2019-09-27 Thread Liam Monahan
Hi,

Our cluster has large omap objects in the .log pool.  Recent changes to the 
default warn limits brought this to our awareness.  Automatic resharding of rgw 
buckets seems to have helped with all of our other large omap warnings 
elsewhere.

I guess my first question is what sort of things are held in the .log pool?  
And secondly, is there something we should or could be doing to prune .log so 
that it doesn’t grow omap objects large enough to get warnings about?

2019-09-18 22:38:55.411684 osd.71 (osd.71) 175 : cluster [WRN] Large omap 
object found. Object: 9:79810424:::data_log.8:head Key count: 774934 Size 
(bytes): 100233321

Thanks,
Liam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGWObjectExpirer crashing after upgrade from 14.2.0 to 14.2.3

2019-09-19 Thread Liam Monahan
Hi,

I recently took our test cluster up to a new version and am no longer able to 
start radosgw.  The cluster itself (mon, osd, mgr) appears fine.

Without being much of an expert trying to read this, from the errors that were 
being thrown it seems like the object expirer is choking ok handling resharded 
buckets.  There have been no recent reshard operations on this cluster, and 
dynamic resharding is disabled.  I though this could’ve been related to 
https://github.com/ceph/ceph/pull/27817 but that landed by v14.2.3...

Logs from starting up radosgw:

 -26> 2019-09-17 16:18:45.719 7f2d93da2780 0 starting handler: civetweb 
   -25> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
allow_unicode_in_urls: yes 
   -24> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
canonicalize_url_path: no 
   -23> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: decode_url: no 
   -22> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
enable_auth_domain_check: no 
   -21> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
enable_keep_alive: yes 
   -20> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
listening_ports: 7480,7481s 
   -19> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: num_threads: 
512 
   -18> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: run_as_user: 
ceph 
   -17> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
ssl_certificate: '/etc/ceph/r 
gw.pem' 
   -16> 2019-09-17 16:18:45.720 7f2d93da2780 20 civetweb config: 
validate_http_method: no 
   -15> 2019-09-17 16:18:45.720 7f2d93da2780 0 civetweb: 0x55d622628600: 
ssl_use_pem_file: ca 
nnot open certificate file '/etc/ceph/rgw.pem': error:02001002:system 
library:fopen:No such fi 
le or directory 
   -14> 2019-09-17 16:18:45.720 7f2d93da2780 -1 ERROR: failed run 
   -13> 2019-09-17 16:18:45.721 7f2d5c97b700 5 lifecycle: schedule life cycle 
next start time 
: Wed Sep 18 04:00:00 2019 
   -12> 2019-09-17 16:18:45.721 7f2d5f180700 20 reqs_thread_entry: start 
   -11> 2019-09-17 16:18:45.721 7f2d5e97f700 20 
cr:s=0x55d625c94360:op=0x55d625bcd800:20MetaMa 
sterTrimPollCR: operate() 
   -10> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c94360 is io 
blocked 
-9> 2019-09-17 16:18:45.721 7f2d5e97f700 20 
cr:s=0x55d625c94480:op=0x55d625a68c00:17DataLo 
gTrimPollCR: operate() 
-8> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c94480 is io 
blocked 
-7> 2019-09-17 16:18:45.721 7f2d5e97f700 20 
cr:s=0x55d625c945a0:op=0x55d625a69200:16Bucket 
TrimPollCR: operate() 
-6> 2019-09-17 16:18:45.721 7f2d5e97f700 20 run: stack=0x55d625c945a0 is io 
blocked 
-5> 2019-09-17 16:18:45.721 7f2d5c17a700 20 BucketsSyncThread: start 
-4> 2019-09-17 16:18:45.721 7f2d5b979700 20 UserSyncThread: start 
-3> 2019-09-17 16:18:45.721 7f2d5b178700 20 process_all_logshards 
Resharding is disabled 
-2> 2019-09-17 16:18:45.721 7f2d5d97d700 20 reqs_thread_entry: start 
-1> 2019-09-17 16:18:45.724 7f2d731a8700 20 processing shard = 
obj_delete_at_hint. 
01 
 0> 2019-09-17 16:18:45.726 7f2d731a8700 -1 *** Caught signal (Aborted) ** 
 in thread 7f2d731a8700 thread_name:rgw_obj_expirer 

 ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus 
(stable) 
 1: (()+0xf630) [0x7f2d86ff7630] 
 2: (gsignal()+0x37) [0x7f2d86431377] 
 3: (abort()+0x148) [0x7f2d86432a68] 
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2d86d417d5] 
 5: (()+0x5e746) [0x7f2d86d3f746] 
 6: (()+0x5e773) [0x7f2d86d3f773] 
 7: (()+0x5e993) [0x7f2d86d3f993] 
 8: (()+0x1772b) [0x7f2d92efb72b] 
 9: (tcmalloc::allocate_full_cpp_throw_oom(unsigned long)+0xf3) 
[0x7f2d92f19a03] 
 10: (()+0x70a8a2) [0x55d6222508a2] 
 11: (()+0x70a8e8) [0x55d6222508e8] 
 12: (RGWObjectExpirer::process_single_shard(std::string const&, utime_t 
const&, utime_t const 
&)+0x115) [0x55d622253155] 
 13: (RGWObjectExpirer::inspect_all_shards(utime_t const&, utime_t 
const&)+0xab) [0x55d6222538 
2b] 
 14: (RGWObjectExpirer::OEWorker::entry()+0x273) [0x55d622253c43] 
 15: (()+0x7ea5) [0x7f2d86fefea5] 
 16: (clone()+0x6d) [0x7f2d864f98cd] 
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


As another datapoint getting bucket stats fails now for two buckets in the 
cluster:

[root@cephproxy01 ~]# radosgw-admin bucket stats --bucket=wHjAk0t
failure: (2) No such file or directory:
2019-09-19 14:21:59.483 7f54ed3fd6c0 -1 ERROR: get_bucket_instance_from_oid 
failed: -2

[root@cephproxy01 ~]# radosgw-admin bucket stats --bucket=bzUi3MT
failure: (2) No such file or directory:
2019-09-19 14:22:16.324 7fbd172666c0 -1 ERROR: get_bucket_instance_from_oid 
failed: -2


Has anyone seen this before?  Didn’t see a lot on this from googling.  Let me 
know if I can provide any more useful debugging information.

Thanks,
Liam
---
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list