Hi David Sorry I forgot to answer your mail. I both wanted to do some tests first and then also got the flu.
The current situation on my small "cluster" is: - After disabling any load on the affected pool , after enough times ceph pg repair and node restarts the inconsistent pg errors/warnings disappeared and the cluster came back into a clean state. - However as soon I started some qemu virtual machines (both a linux and a windows vm) I got crashing osds again and inconsistent pg's. I thought I saw it was mentioned that ec_optimizations would require any client library modifications. But maybe that is not the case for qemu? I haven't tested that yet. - writing to an ext4 fs on an RBD image mounted directly via mount -t ceph on one of the cluster nodes didn't lead to nearl immediate /to very soon osd crashes and inconsistent pg's. - As for now I copied most of the images back to an erasure coded pool without ec_optimization support enabled and there I can run the VM images again with qemu without osd crashing and inconsistent pgs. Regading your questions: - What EC scheme do you use? We are using 6+2 The erasure coded pool I've enabled ec_optimization on used jerasure 3+2 with lz4 compression. - Do you use cephadm? Yes, the cluster has been deployed from the beginning with cephadm - Do you run on via kubernetes? No it's running on baremetal with debian 13 as OS (debian 12 at start of year) - What is the operating system that you are using? Debian 13. The libs on the debian base system are from 18.2.7+ds-1 - What was the version you updated from? I've upgraded from 19.2.3 to 20.2.0. I've ran the cluster without any issues (asides from some dashboard bugs) for a week, before I tried to enable allow_ec_optimization flag on one of the erasure coded pools - What version did you use to create the pool? The original pool was created under 17.2.6 (quincy) back in 2023 - The effected pool, is it hard drives? Yes the pool is hard drives Regards Reto Am Do., 4. Dez. 2025 um 15:21 Uhr schrieb David Walter <[email protected] >: > Hi Reto, > > I would be very thankful for a reply since it seems we have to solve > this on our own. > Do you have any new insights into the problem? > > Things that could help are: > > - What EC scheme do you use? We are using 6+2 > - Do you use cephadm? > - Do you run on via kubernetes? > - What is the operating system that you are using? > - What was the version you updated from? > - What version did you use to create the pool? > - The effected pool, is it hard drives? > > Best, > > David > > On 12/1/25 07:59, David Walter wrote: > > Hi Reto, > > > > I've seen your post on the ceph users list, unfortunately I was not > > able to reply on the thread or post something myself. > > > > I have the same problem. I was upgrading from ceph squid 19.2.3 to > > ceph tentacle 20.2.0 and all was good first. > > But when I enabled ec optimizations I get scrubbing errors like this: > > > > HEALTH_ERR 88909311 scrub errors; Possible data damage: 117 pgs > > inconsistent > > [ERR] OSD_SCRUB_ERRORS: 88909311 scrub errors > > [ERR] PG_DAMAGED: Possible data damage: 117 pgs inconsistent > > > > and I was unable repairing them. Some errors disappear after > > restarting all osd daemons but they come back after another scrubbing. > > I don't have any indication that data is lost so far. But the cluster > > is under pressure and daemons crash, so it's in a critical state. > > > > When running "rados list-inconsistent-obj <PG>" one some PGs I get > > errors like these: > > > > > {"osd":62,"primary":false,"shard":4,"errors":["size_mismatch_info","obj_size_info_mismatch"],"size":700416,"object_info":{"oid":{"oid":"1000793c33a.000000dd","key":"","snapid":-2,"hash":2213519396,"max":0,"pool":4,"namespace":""},"version":"43143'3253925","prior_version":"0'0","last_reqid":"client.1822828.0:160762343","user_version":3253925,"size":4194304,"mtime":"2025-11-10T22:13:15.257031-0500","local_mtime":"2025-11-10T22:14:44.294560-0500","lost":0,"flags":["dirty","data_digest"],"truncate_seq":0,"truncate_size":0,"data_digest":"0x7e22aafe","omap_digest":"0xffffffff","expected_object_size":0,"expected_write_size":0,"alloc_hint_flags":0,"manifest":{"type":0},"watchers":{}}},{"osd":65,"primary":false,"shard":3,"errors":[],"size":700416} > > > > > > > Did you find out more about this issue or how do you plan to move > > forward? > > > > Best, > > > > David > > > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
