I've seen OSD's crash when running an fstrim on an RBD with an EC data pool with the EC optimizations enabled.  Always two OSD's. It's a 8+2 pool but I don't know if the "2"s are a coincidence or not.  Other than that, I haven't seen any OSD crashes with EC optimizations.  We had the same size mismatches that went away after migrating the data.

On 12/5/2025 10:34 AM, Reto Gysi via ceph-users wrote:
Hi Bill

Am Fr., 5. Dez. 2025 um 10:02 Uhr schrieb Bill Scales <
[email protected]>:

Hi Reto,

Sorry to hear about your problems with turning on ec optimizations. I've
led the team that developed this feature, so we are keen to help understand
what happened to your system.

Your configuration looks fine as far as being supported with ec
optimizations. The daemons (mons, osds and mgr) need to be running tentacle
code to use this feature, there is no requirement to update any of your
clients.

Do you have any more examples of the inconsistent objects logs that you
can share with me?

There was a bug that we fixed late in the development cycle which was
scrubbing incorrectly reporting a size mismatch for objects written prior
to turning on optimizations. This is because prior to turning on
optimizations objects are padded to a multiple of the stripe_width,
afterwards objects are no longer padded. The scrub code was occasionally
getting confused and incorrectly reporting an inconsistency. In this case
the scrub was reporting false positive errors - there was nothing wrong
with the data and no problems accessing the data. I notice the log you
shared in the email was for a size mismatch, I'm interested whether all the
logs were for mismatched sizes. We will do some further work to confirm
that the fix is in the 20.2.0 tentacle release and that there are no
problems with the back port.

As far I could see the errors where all about the mismatched sizes.


You also mention that you saw some OSD crashes. Do you have any further
information about these?

I just could cause another crash with starting up a windows VM, and doing a
WIndows file history backup to a drive that is on an RBD Image on pool
rbd_ecpool where I had the allow_ec_optimization flag enabled.

<disk type="network" device="disk">
   <driver name="qemu" type="raw" cache="writethrough" io="threads"
discard="unmap" detect_zeroes="unmap"/>
   <auth username="admin">
     <secret type="ceph" uuid="878b0bc5-c471-4ec6-a92a-f65282ffbdf6"/>
   </auth>
   <source protocol="rbd" name="rbd/game_windows_backup_drive" index="3">
     <host name="zephir" port="6789"/>
   </source>
   <target dev="vdd" bus="virtio"/>
   <alias name="virtio-disk3"/>
   <address type="pci" domain="0x0000" bus="0x00" slot="0x02"
function="0x0"/>
</disk>

  root@zephir:~# ceph -s
  cluster:
    id:     27923302-87a5-11ec-ac5b-976d21a49941
    health: HEALTH_WARN
            2 osds down
            Reduced data availability: 49 pgs inactive
            Degraded data redundancy: 15722192/79848340 objects degraded
(19.690%), 246 pgs degraded, 247 pgs undersized

  services:
    mon:           3 daemons, quorum zephir,debian,raspi (age 20h) [leader:
zephir]
    mgr:           zephir.enywvy(active, since 21h), standbys: debian.nusuye
    mds:           3/3 daemons up, 3 standby
    osd:           18 osds: 16 up (since 5m), 18 in (since 3d); 8 remapped
pgs
    cephfs-mirror: 2 daemons active (2 hosts)
    rbd-mirror:    2 daemons active (2 hosts)
    rgw:           1 daemon active (1 hosts, 1 zones)
    tcmu-runner:   5 portals active (2 hosts)

  data:
    volumes: 3/3 healthy
    pools:   25 pools, 450 pgs
    objects: 17.52M objects, 62 TiB
    usage:   116 TiB used, 63 TiB / 179 TiB avail
    pgs:     10.889% pgs not active
             15722192/79848340 objects degraded (19.690%)
             197 active+undersized+degraded
             183 active+clean
             49  undersized+degraded+peered
             15  active+clean+scrubbing
             5   active+clean+scrubbing+deep
             1   active+undersized

  io:
    client:   1.7 KiB/s rd, 1023 B/s wr, 1 op/s rd, 0 op/s wr

root@zephir:~# ceph osd status
ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE
0  debian  11.2T  5289G      0        0       0        0   exists,up
1  zephir  11.6T  4846G      0        0       0        0   exists
2  zephir  10.7T  5827G      0        0       2        4   exists,up
3  zephir  11.0T  5497G      0        0       0        0   exists
4  zephir   159G  1376G      0        0       2        0   exists,up
5  zephir   120G  1415G      0        0       0        0   exists,up
6  debian  11.3T  5200G      0        0       1        0   exists,up
7  zephir   305G  1230G      0        0       0        0   exists,up
8  zephir  10.9T  5594G      0        0       0        0   exists,up
9  zephir  11.9T  4616G      0        0       0        0   exists,up
10  zephir   235G  1300G      0        0       0        0   exists,up
11  zephir   223G  1312G      0        0       0        0   exists,up
12  zephir  11.0T  5479G      0        0       0        0   exists,up
13  debian   118G   631G      0        0       3        8   exists,up
14  debian   345G  1190G      0        0       0        0   exists,up
15  debian  11.4T  5079G      0        0       0        0   exists,up
16  debian   554G  3029G      0        0       3        8   exists,up
17  debian  12.5T  5874G      0        0       1        0   exists,up
root@zephir:~#

I attached the log of osd.1 to
https://filebin.net/jwvs6kuqrc7hx8id

Best Regards,

Reto



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to