Hi David,

I simply copied the data to a new EC pool with the EC optimizations enabled.  At the time, I wanted to do what seemed to present the least risk.  I currently believe that simply marking the OSDs out, letting them empty then marking them in to be backfilled would have also worked.  The only time I've had OSDs crash was when I ran an fstrim.  I haven't tried to dig into it much.  An fstrim does still cause OSD crashes even on a newly created pool with the optimizations enabled.  The new pools are on the same OSDs.  None of them have been destroyed/recreated.  If the fstrim really is what causes my OSD crashes then could you maybe have some filesystems mounted with the discard option that causes the same problem?

---
Jeff


On 12/14/2025 8:40 AM, David Walter wrote:
Hi Jeff,

What migration did you do that you say the mismatches went away? Did the crashing of OSDs also stop?

Migration to a new pool on the same or different OSDs? With or without EC optimization enabled? Or re-writing the data in the same pool by purging one OSD at a time?

Best,

David

On 12/5/25 11:02, Jeff Bailey via ceph-users wrote:
I've seen OSD's crash when running an fstrim on an RBD with an EC data pool with the EC optimizations enabled.  Always two OSD's. It's a 8+2 pool but I don't know if the "2"s are a coincidence or not. Other than that, I haven't seen any OSD crashes with EC optimizations.  We had the same size mismatches that went away after migrating the data.


On 12/5/2025 10:34 AM, Reto Gysi via ceph-users wrote:
Hi Bill

Am Fr., 5. Dez. 2025 um 10:02 Uhr schrieb Bill Scales <
[email protected]>:

Hi Reto,

Sorry to hear about your problems with turning on ec optimizations. I've led the team that developed this feature, so we are keen to help understand
what happened to your system.

Your configuration looks fine as far as being supported with ec
optimizations. The daemons (mons, osds and mgr) need to be running tentacle code to use this feature, there is no requirement to update any of your
clients.

Do you have any more examples of the inconsistent objects logs that you
can share with me?

There was a bug that we fixed late in the development cycle which was
scrubbing incorrectly reporting a size mismatch for objects written prior
to turning on optimizations. This is because prior to turning on
optimizations objects are padded to a multiple of the stripe_width,
afterwards objects are no longer padded. The scrub code was occasionally getting confused and incorrectly reporting an inconsistency. In this case the scrub was reporting false positive errors - there was nothing wrong
with the data and no problems accessing the data. I notice the log you
shared in the email was for a size mismatch, I'm interested whether all the logs were for mismatched sizes. We will do some further work to confirm
that the fix is in the 20.2.0 tentacle release and that there are no
problems with the back port.

As far I could see the errors where all about the mismatched sizes.


You also mention that you saw some OSD crashes. Do you have any further
information about these?

I just could cause another crash with starting up a windows VM, and doing a
WIndows file history backup to a drive that is on an RBD Image on pool
rbd_ecpool where I had the allow_ec_optimization flag enabled.

<disk type="network" device="disk">
   <driver name="qemu" type="raw" cache="writethrough" io="threads"
discard="unmap" detect_zeroes="unmap"/>
   <auth username="admin">
     <secret type="ceph" uuid="878b0bc5-c471-4ec6-a92a-f65282ffbdf6"/>
   </auth>
   <source protocol="rbd" name="rbd/game_windows_backup_drive" index="3">
     <host name="zephir" port="6789"/>
   </source>
   <target dev="vdd" bus="virtio"/>
   <alias name="virtio-disk3"/>
   <address type="pci" domain="0x0000" bus="0x00" slot="0x02"
function="0x0"/>
</disk>

  root@zephir:~# ceph -s
  cluster:
    id:     27923302-87a5-11ec-ac5b-976d21a49941
    health: HEALTH_WARN
            2 osds down
            Reduced data availability: 49 pgs inactive
            Degraded data redundancy: 15722192/79848340 objects degraded
(19.690%), 246 pgs degraded, 247 pgs undersized

  services:
    mon:           3 daemons, quorum zephir,debian,raspi (age 20h) [leader:
zephir]
    mgr:           zephir.enywvy(active, since 21h), standbys: debian.nusuye
    mds:           3/3 daemons up, 3 standby
    osd:           18 osds: 16 up (since 5m), 18 in (since 3d); 8 remapped
pgs
    cephfs-mirror: 2 daemons active (2 hosts)
    rbd-mirror:    2 daemons active (2 hosts)
    rgw:           1 daemon active (1 hosts, 1 zones)
    tcmu-runner:   5 portals active (2 hosts)

  data:
    volumes: 3/3 healthy
    pools:   25 pools, 450 pgs
    objects: 17.52M objects, 62 TiB
    usage:   116 TiB used, 63 TiB / 179 TiB avail
    pgs:     10.889% pgs not active
             15722192/79848340 objects degraded (19.690%)
             197 active+undersized+degraded
             183 active+clean
             49  undersized+degraded+peered
             15  active+clean+scrubbing
             5   active+clean+scrubbing+deep
             1   active+undersized

  io:
    client:   1.7 KiB/s rd, 1023 B/s wr, 1 op/s rd, 0 op/s wr

root@zephir:~# ceph osd status
ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA STATE
0  debian  11.2T  5289G      0        0       0        0 exists,up
1  zephir  11.6T  4846G      0        0       0        0 exists
2  zephir  10.7T  5827G      0        0       2        4 exists,up
3  zephir  11.0T  5497G      0        0       0        0 exists
4  zephir   159G  1376G      0        0       2        0 exists,up
5  zephir   120G  1415G      0        0       0        0 exists,up
6  debian  11.3T  5200G      0        0       1        0 exists,up
7  zephir   305G  1230G      0        0       0        0 exists,up
8  zephir  10.9T  5594G      0        0       0        0 exists,up
9  zephir  11.9T  4616G      0        0       0        0 exists,up
10  zephir   235G  1300G      0        0       0        0 exists,up
11  zephir   223G  1312G      0        0       0        0 exists,up
12  zephir  11.0T  5479G      0        0       0        0 exists,up
13  debian   118G   631G      0        0       3        8 exists,up
14  debian   345G  1190G      0        0       0        0 exists,up
15  debian  11.4T  5079G      0        0       0        0 exists,up
16  debian   554G  3029G      0        0       3        8 exists,up
17  debian  12.5T  5874G      0        0       1        0 exists,up
root@zephir:~#

I attached the log of osd.1 to
https://filebin.net/jwvs6kuqrc7hx8id

Best Regards,

Reto



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to