[ceph-users] Re: Fwd: On your post: "[Tentacle 20.2.0]: Inconsistent pg's after enabling ec optimisation flag"

Jeff Bailey via ceph-users Mon, 15 Dec 2025 09:02:06 -0800

Hi David,

I simply copied the data to a new EC pool with the EC optimizationsenabled. At the time, I wanted to do what seemed to present the leastrisk. I currently believe that simply marking the OSDs out, lettingthem empty then marking them in to be backfilled would have alsoworked. The only time I've had OSDs crash was when I ran an fstrim. Ihaven't tried to dig into it much. An fstrim does still cause OSDcrashes even on a newly created pool with the optimizations enabled. The new pools are on the same OSDs. None of them have beendestroyed/recreated. If the fstrim really is what causes my OSD crashesthen could you maybe have some filesystems mounted with the discardoption that causes the same problem?


---
Jeff


On 12/14/2025 8:40 AM, David Walter wrote:

Hi Jeff,

What migration did you do that you say the mismatches went away? Didthe crashing of OSDs also stop?

Migration to a new pool on the same or different OSDs? With or withoutEC optimization enabled? Or re-writing the data in the same pool bypurging one OSD at a time?


Best,

David

On 12/5/25 11:02, Jeff Bailey via ceph-users wrote:

I've seen OSD's crash when running an fstrim on an RBD with an ECdata pool with the EC optimizations enabled. Always two OSD's. It'sa 8+2 pool but I don't know if the "2"s are a coincidence or not.Other than that, I haven't seen any OSD crashes with ECoptimizations. We had the same size mismatches that went away aftermigrating the data.



On 12/5/2025 10:34 AM, Reto Gysi via ceph-users wrote:

Hi Bill

Am Fr., 5. Dez. 2025 um 10:02 Uhr schrieb Bill Scales <
[email protected]>:

Hi Reto,
Sorry to hear about your problems with turning on ec optimizations.I'veled the team that developed this feature, so we are keen to helpunderstand
what happened to your system.

Your configuration looks fine as far as being supported with ec
optimizations. The daemons (mons, osds and mgr) need to be runningtentaclecode to use this feature, there is no requirement to update any ofyour
clients.
Do you have any more examples of the inconsistent objects logs thatyou
can share with me?

There was a bug that we fixed late in the development cycle which was
scrubbing incorrectly reporting a size mismatch for objects writtenprior
to turning on optimizations. This is because prior to turning on
optimizations objects are padded to a multiple of the stripe_width,
afterwards objects are no longer padded. The scrub code wasoccasionallygetting confused and incorrectly reporting an inconsistency. Inthis casethe scrub was reporting false positive errors - there was nothingwrong
with the data and no problems accessing the data. I notice the log you
shared in the email was for a size mismatch, I'm interested whetherall thelogs were for mismatched sizes. We will do some further work toconfirm
that the fix is in the 20.2.0 tentacle release and that there are no
problems with the back port.

As far I could see the errors where all about the mismatched sizes.

You also mention that you saw some OSD crashes. Do you have anyfurther
information about these?

I just could cause another crash with starting up a windows VM, anddoing a

WIndows file history backup to a drive that is on an RBD Image on pool
rbd_ecpool where I had the allow_ec_optimization flag enabled.

<disk type="network" device="disk">
   <driver name="qemu" type="raw" cache="writethrough" io="threads"
discard="unmap" detect_zeroes="unmap"/>
   <auth username="admin">
     <secret type="ceph" uuid="878b0bc5-c471-4ec6-a92a-f65282ffbdf6"/>
   </auth>

<source protocol="rbd" name="rbd/game_windows_backup_drive"index="3">

     <host name="zephir" port="6789"/>
   </source>
   <target dev="vdd" bus="virtio"/>
   <alias name="virtio-disk3"/>
   <address type="pci" domain="0x0000" bus="0x00" slot="0x02"
function="0x0"/>
</disk>

  root@zephir:~# ceph -s
  cluster:
    id:     27923302-87a5-11ec-ac5b-976d21a49941
    health: HEALTH_WARN
            2 osds down
            Reduced data availability: 49 pgs inactive

Degraded data redundancy: 15722192/79848340 objectsdegraded

(19.690%), 246 pgs degraded, 247 pgs undersized

  services:

mon: 3 daemons, quorum zephir,debian,raspi (age 20h)[leader:

zephir]

mgr: zephir.enywvy(active, since 21h), standbys:debian.nusuye

    mds:           3/3 daemons up, 3 standby

osd: 18 osds: 16 up (since 5m), 18 in (since 3d); 8remapped

pgs
    cephfs-mirror: 2 daemons active (2 hosts)
    rbd-mirror:    2 daemons active (2 hosts)
    rgw:           1 daemon active (1 hosts, 1 zones)
    tcmu-runner:   5 portals active (2 hosts)

  data:
    volumes: 3/3 healthy
    pools:   25 pools, 450 pgs
    objects: 17.52M objects, 62 TiB
    usage:   116 TiB used, 63 TiB / 179 TiB avail
    pgs:     10.889% pgs not active
             15722192/79848340 objects degraded (19.690%)
             197 active+undersized+degraded
             183 active+clean
             49  undersized+degraded+peered
             15  active+clean+scrubbing
             5   active+clean+scrubbing+deep
             1   active+undersized

  io:
    client:   1.7 KiB/s rd, 1023 B/s wr, 1 op/s rd, 0 op/s wr

root@zephir:~# ceph osd status
ID  HOST     USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA STATE
0  debian  11.2T  5289G      0        0       0        0 exists,up
1  zephir  11.6T  4846G      0        0       0        0 exists
2  zephir  10.7T  5827G      0        0       2        4 exists,up
3  zephir  11.0T  5497G      0        0       0        0 exists
4  zephir   159G  1376G      0        0       2        0 exists,up
5  zephir   120G  1415G      0        0       0        0 exists,up
6  debian  11.3T  5200G      0        0       1        0 exists,up
7  zephir   305G  1230G      0        0       0        0 exists,up
8  zephir  10.9T  5594G      0        0       0        0 exists,up
9  zephir  11.9T  4616G      0        0       0        0 exists,up
10  zephir   235G  1300G      0        0       0        0 exists,up
11  zephir   223G  1312G      0        0       0        0 exists,up
12  zephir  11.0T  5479G      0        0       0        0 exists,up
13  debian   118G   631G      0        0       3        8 exists,up
14  debian   345G  1190G      0        0       0        0 exists,up
15  debian  11.4T  5079G      0        0       0        0 exists,up
16  debian   554G  3029G      0        0       3        8 exists,up
17  debian  12.5T  5874G      0        0       1        0 exists,up
root@zephir:~#

I attached the log of osd.1 to
https://filebin.net/jwvs6kuqrc7hx8id

Best Regards,

Reto



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Fwd: On your post: "[Tentacle 20.2.0]: Inconsistent pg's after enabling ec optimisation flag"

Reply via email to