Hi David

Sorry I forgot to answer your mail. I both wanted to do some tests first
and then also got the flu.

The current situation on my small "cluster" is:

- After disabling any load on the affected pool , after enough times ceph
pg repair and node restarts the inconsistent pg errors/warnings disappeared
and the cluster came back into a clean state.
- However as soon I started some qemu virtual machines (both a linux and a
windows vm) I got crashing osds again and inconsistent pg's. I thought I
saw it was mentioned that ec_optimizations would require any client library
modifications. But maybe that is not the case for qemu? I haven't tested
that yet.
- writing to an ext4 fs on an RBD image mounted directly via mount -t ceph
on one of the cluster nodes didn't lead to nearl immediate /to very soon
osd crashes and inconsistent pg's.
- As for now I copied most of the images back to an erasure coded pool
without ec_optimization support enabled and there I can run the VM images
again with qemu without osd crashing and inconsistent pgs.

Regading your questions:

- What EC scheme do you use? We are using 6+2

The erasure coded pool I've enabled ec_optimization on used jerasure 3+2
with lz4 compression.

- Do you use cephadm?

Yes, the cluster has been deployed from the beginning with cephadm

- Do you run on via kubernetes?

No it's running on baremetal with debian 13 as OS (debian 12 at start of
year)

- What is the operating system that you are using?

Debian 13. The libs on the debian base system are from 18.2.7+ds-1

- What was the version you updated from?

I've upgraded from 19.2.3 to 20.2.0. I've ran the cluster without any
issues (asides from some dashboard bugs) for a week, before I tried to
enable allow_ec_optimization flag on one of the erasure coded pools

- What version did you use to create the pool?

The original pool was created under 17.2.6 (quincy) back in 2023

- The effected pool, is it hard drives?

Yes the pool is hard drives



Regards

Reto

Am Do., 4. Dez. 2025 um 15:21 Uhr schrieb David Walter <[email protected]
>:

> Hi Reto,
>
> I would be very thankful for a reply since it seems we have to solve
> this on our own.
> Do you have any new insights into the problem?
>
> Things that could help are:
>
> - What EC scheme do you use? We are using 6+2
> - Do you use cephadm?
> - Do you run on via kubernetes?
> - What is the operating system that you are using?
> - What was the version you updated from?
> - What version did you use to create the pool?
> - The effected pool, is it hard drives?
>
> Best,
>
> David
>
> On 12/1/25 07:59, David Walter wrote:
> > Hi Reto,
> >
> > I've seen your post on the ceph users list, unfortunately I was not
> > able to reply on the thread or post something myself.
> >
> > I have the same problem. I was upgrading from ceph squid 19.2.3 to
> > ceph tentacle 20.2.0 and all was good first.
> > But when I enabled ec optimizations I get scrubbing errors like this:
> >
> > HEALTH_ERR 88909311 scrub errors; Possible data damage: 117 pgs
> > inconsistent
> > [ERR] OSD_SCRUB_ERRORS: 88909311 scrub errors
> > [ERR] PG_DAMAGED: Possible data damage: 117 pgs inconsistent
> >
> > and I was unable repairing them. Some errors disappear after
> > restarting all osd daemons but they come back after another scrubbing.
> > I don't have any indication that data is lost so far. But the cluster
> > is under pressure and daemons crash, so it's in a critical state.
> >
> > When running "rados list-inconsistent-obj <PG>" one some PGs I get
> > errors like these:
> >
> >
> {"osd":62,"primary":false,"shard":4,"errors":["size_mismatch_info","obj_size_info_mismatch"],"size":700416,"object_info":{"oid":{"oid":"1000793c33a.000000dd","key":"","snapid":-2,"hash":2213519396,"max":0,"pool":4,"namespace":""},"version":"43143'3253925","prior_version":"0'0","last_reqid":"client.1822828.0:160762343","user_version":3253925,"size":4194304,"mtime":"2025-11-10T22:13:15.257031-0500","local_mtime":"2025-11-10T22:14:44.294560-0500","lost":0,"flags":["dirty","data_digest"],"truncate_seq":0,"truncate_size":0,"data_digest":"0x7e22aafe","omap_digest":"0xffffffff","expected_object_size":0,"expected_write_size":0,"alloc_hint_flags":0,"manifest":{"type":0},"watchers":{}}},{"osd":65,"primary":false,"shard":3,"errors":[],"size":700416}
>
> >
> >
> > Did you find out more about this issue or how do you plan to move
> > forward?
> >
> > Best,
> >
> > David
> >
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to