[ceph-users] Re: Data is still actively corrupted!

dominik.baack via ceph-users Thu, 12 Feb 2026 01:46:30 -0800

Hi,

thanks for your reply.

Mounting was done without the 'root_squash' option, here is thecorresponding fstab entry:

192.168.251.2,192.168.251.3,192.168.251.4,192.168.251.5,192.168.251.6,192.168.251.7://cephfs cephname=gpu01,secretfile=/etc/ceph/gpu01.key,noatime,_netdev,recover_session=clean,read_from_replica=balance0 0


Dominik


Am 2026-02-12 09:51, schrieb goetze:

Hi !

Have you mounted cephfs with the 'root_squash' option set? If so,
remove that option. I may be wrong here, but as far as I know, this is
still considered unsafe and can lead to data corruption since the
necessary code changes have not yet made it into the mainstream linux
kernel.

Carsten
 ------------------------------------------------------------------
Carsten Goetze
Computer Graphics          tel:   +49 531 391-2109
TU Braunschweig            fax:   +49 531 391-2103
Muehlenpfordtstr. 23       eMail: [email protected]
D-38106 Braunschweig       http://www.cg.cs.tu-bs.de/people/goetze

Am 12.02.2026 um 07:15 schrieb Dominik Baack via ceph-users
<[email protected]>:

Hi,

I noticed now that files are still actively corrupted / replaced by
empty files when open saved.

Access was done via Ceph 19.2.1 from Ubuntu 24 with kernel mount.
Ceph servers are still running 20.2 Tentacle (deployed via cephadm)
with an Ubuntu 24.04 host.

Options currently set are to noout, norebalance, noscrub and
nodeep-scrub

I am currently setting up a read only mount to copy all existing
data for a backup.

I have currently no clue whats going on, I currently was able to
observe this behavior only on the nodes. As those were upgraded as
well (MLNX driver, nvidia-fs,...) can it be network?

Any Idea how to recover from this?

Cheers
Dominik

Am 11.02.2026 um 18:44 schrieb dominik.baack via ceph-users:

Hi,

after an controlled shutdown of the whole cluster do to external
circumstances we decided to update from 19.2 to 20.2 after the
restart. The system was health before and after the update.
The nodes mounting the filesystem were not equally lucky and were
partially shutdown hard. Storage was kept running additional
~30min after node shutdown, all inflight operations should have
finished.

Now we discover the some of the user files seem to be replaced
with zeros. For example:

stat .gitignore
File: .gitignore
Size: 4429            Blocks: 9          IO Block: 4194304
regular file
Device: 0,48    Inode: 1100241384598  Links: 1

hexdump -C .gitignore
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
|................|
*
00001140  00 00 00 00 00 00 00 00  00 00 00 00 00  |.............|
0000114d

Scanning for files containing only zeros show several issues of
files that were likely accessed before or during the shutdown of
the nodes.

How should I progress from here?

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Data is still actively corrupted!

Reply via email to