I've seen OSD's crash when running an fstrim on an RBD with an EC data
pool with the EC optimizations enabled. Always two OSD's. It's a 8+2
pool but I don't know if the "2"s are a coincidence or not. Other than
that, I haven't seen any OSD crashes with EC optimizations. We had the
same size mismatches that went away after migrating the data.
On 12/5/2025 10:34 AM, Reto Gysi via ceph-users wrote:
Hi Bill
Am Fr., 5. Dez. 2025 um 10:02 Uhr schrieb Bill Scales <
[email protected]>:
Hi Reto,
Sorry to hear about your problems with turning on ec optimizations. I've
led the team that developed this feature, so we are keen to help understand
what happened to your system.
Your configuration looks fine as far as being supported with ec
optimizations. The daemons (mons, osds and mgr) need to be running tentacle
code to use this feature, there is no requirement to update any of your
clients.
Do you have any more examples of the inconsistent objects logs that you
can share with me?
There was a bug that we fixed late in the development cycle which was
scrubbing incorrectly reporting a size mismatch for objects written prior
to turning on optimizations. This is because prior to turning on
optimizations objects are padded to a multiple of the stripe_width,
afterwards objects are no longer padded. The scrub code was occasionally
getting confused and incorrectly reporting an inconsistency. In this case
the scrub was reporting false positive errors - there was nothing wrong
with the data and no problems accessing the data. I notice the log you
shared in the email was for a size mismatch, I'm interested whether all the
logs were for mismatched sizes. We will do some further work to confirm
that the fix is in the 20.2.0 tentacle release and that there are no
problems with the back port.
As far I could see the errors where all about the mismatched sizes.
You also mention that you saw some OSD crashes. Do you have any further
information about these?
I just could cause another crash with starting up a windows VM, and doing a
WIndows file history backup to a drive that is on an RBD Image on pool
rbd_ecpool where I had the allow_ec_optimization flag enabled.
<disk type="network" device="disk">
<driver name="qemu" type="raw" cache="writethrough" io="threads"
discard="unmap" detect_zeroes="unmap"/>
<auth username="admin">
<secret type="ceph" uuid="878b0bc5-c471-4ec6-a92a-f65282ffbdf6"/>
</auth>
<source protocol="rbd" name="rbd/game_windows_backup_drive" index="3">
<host name="zephir" port="6789"/>
</source>
<target dev="vdd" bus="virtio"/>
<alias name="virtio-disk3"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x02"
function="0x0"/>
</disk>
root@zephir:~# ceph -s
cluster:
id: 27923302-87a5-11ec-ac5b-976d21a49941
health: HEALTH_WARN
2 osds down
Reduced data availability: 49 pgs inactive
Degraded data redundancy: 15722192/79848340 objects degraded
(19.690%), 246 pgs degraded, 247 pgs undersized
services:
mon: 3 daemons, quorum zephir,debian,raspi (age 20h) [leader:
zephir]
mgr: zephir.enywvy(active, since 21h), standbys: debian.nusuye
mds: 3/3 daemons up, 3 standby
osd: 18 osds: 16 up (since 5m), 18 in (since 3d); 8 remapped
pgs
cephfs-mirror: 2 daemons active (2 hosts)
rbd-mirror: 2 daemons active (2 hosts)
rgw: 1 daemon active (1 hosts, 1 zones)
tcmu-runner: 5 portals active (2 hosts)
data:
volumes: 3/3 healthy
pools: 25 pools, 450 pgs
objects: 17.52M objects, 62 TiB
usage: 116 TiB used, 63 TiB / 179 TiB avail
pgs: 10.889% pgs not active
15722192/79848340 objects degraded (19.690%)
197 active+undersized+degraded
183 active+clean
49 undersized+degraded+peered
15 active+clean+scrubbing
5 active+clean+scrubbing+deep
1 active+undersized
io:
client: 1.7 KiB/s rd, 1023 B/s wr, 1 op/s rd, 0 op/s wr
root@zephir:~# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 debian 11.2T 5289G 0 0 0 0 exists,up
1 zephir 11.6T 4846G 0 0 0 0 exists
2 zephir 10.7T 5827G 0 0 2 4 exists,up
3 zephir 11.0T 5497G 0 0 0 0 exists
4 zephir 159G 1376G 0 0 2 0 exists,up
5 zephir 120G 1415G 0 0 0 0 exists,up
6 debian 11.3T 5200G 0 0 1 0 exists,up
7 zephir 305G 1230G 0 0 0 0 exists,up
8 zephir 10.9T 5594G 0 0 0 0 exists,up
9 zephir 11.9T 4616G 0 0 0 0 exists,up
10 zephir 235G 1300G 0 0 0 0 exists,up
11 zephir 223G 1312G 0 0 0 0 exists,up
12 zephir 11.0T 5479G 0 0 0 0 exists,up
13 debian 118G 631G 0 0 3 8 exists,up
14 debian 345G 1190G 0 0 0 0 exists,up
15 debian 11.4T 5079G 0 0 0 0 exists,up
16 debian 554G 3029G 0 0 3 8 exists,up
17 debian 12.5T 5874G 0 0 1 0 exists,up
root@zephir:~#
I attached the log of osd.1 to
https://filebin.net/jwvs6kuqrc7hx8id
Best Regards,
Reto
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]