[ceph-users] Re: Data is still actively corrupted!

Dominik Baack via ceph-users Thu, 12 Feb 2026 22:52:23 -0800

Hi,

Thank you very much.

I will change the mounting options and report back if we still have anyproblems afterwards. As of now it seems we were very lucky and mostfiles are reproducible.



Cheers
Dominik

Am 12.02.2026 um 11:53 schrieb Eugen Block:

Hi,

you seem to hit this issue [0] with the read_from_replica option, seethe announcement in [1]. But I haven't looked in detail, not sure ifthere's a way to fix it and if read_from_replica=balance has the sameeffect.


Regards,
Eugen

[0] https://tracker.ceph.com/issues/73997

[1]https://lists.ceph.io/hyperkitty/list/[email protected]/thread/JI2ZRF7A3PW55BTH5TFMHNFCZUITYAJJ/

Am Do., 12. Feb. 2026 um 10:46 Uhr schrieb dominik.baack viaceph-users <[email protected]>:


    Hi,

    thanks for your reply.

    Mounting was done without the 'root_squash' option, here is the
    corresponding fstab entry:

    
192.168.251.2,192.168.251.3,192.168.251.4,192.168.251.5,192.168.251.6,192.168.251.7:/

    /cephfs ceph
    
name=gpu01,secretfile=/etc/ceph/gpu01.key,noatime,_netdev,recover_session=clean,read_from_replica=balance

    0  0

    Dominik


    Am 2026-02-12 09:51, schrieb goetze:
    > Hi !
    >
    > Have you mounted cephfs with the 'root_squash' option set? If so,
    > remove that option. I may be wrong here, but as far as I know,
    this is
    > still considered unsafe and can lead to data corruption since the
    > necessary code changes have not yet made it into the mainstream
    linux
    > kernel.
    >
    > Carsten
    > ------------------------------------------------------------------
    > Carsten Goetze
    > Computer Graphics          tel:   +49 531 391-2109
    > TU Braunschweig            fax:   +49 531 391-2103
    > Muehlenpfordtstr. 23       eMail: [email protected]
    > D-38106 Braunschweig http://www.cg.cs.tu-bs.de/people/goetze
    >
    >> Am 12.02.2026 um 07:15 schrieb Dominik Baack via ceph-users
    >> <[email protected]>:
    >>
    >> Hi,
    >>
    >> I noticed now that files are still actively corrupted / replaced by
    >> empty files when open saved.
    >>
    >> Access was done via Ceph 19.2.1 from Ubuntu 24 with kernel mount.
    >> Ceph servers are still running 20.2 Tentacle (deployed via cephadm)
    >> with an Ubuntu 24.04 host.
    >>
    >> Options currently set are to noout, norebalance, noscrub and
    >> nodeep-scrub
    >>
    >> I am currently setting up a read only mount to copy all existing
    >> data for a backup.
    >>
    >> I have currently no clue whats going on, I currently was able to
    >> observe this behavior only on the nodes. As those were upgraded as
    >> well (MLNX driver, nvidia-fs,...) can it be network?
    >>
    >> Any Idea how to recover from this?
    >>
    >> Cheers
    >> Dominik
    >>
    >> Am 11.02.2026 um 18:44 schrieb dominik.baack via ceph-users:
    >>
    >>> Hi,
    >>>
    >>> after an controlled shutdown of the whole cluster do to external
    >>> circumstances we decided to update from 19.2 to 20.2 after the
    >>> restart. The system was health before and after the update.
    >>> The nodes mounting the filesystem were not equally lucky and were
    >>> partially shutdown hard. Storage was kept running additional
    >>> ~30min after node shutdown, all inflight operations should have
    >>> finished.
    >>>
    >>> Now we discover the some of the user files seem to be replaced
    >>> with zeros. For example:
    >>>
    >>> stat .gitignore
    >>> File: .gitignore
    >>> Size: 4429            Blocks: 9          IO Block: 4194304
    >>> regular file
    >>> Device: 0,48    Inode: 1100241384598  Links: 1
    >>>
    >>> hexdump -C .gitignore
    >>> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
    >>> |................|
    >>> *
    >>> 00001140  00 00 00 00 00 00 00 00  00 00 00 00 00  |.............|
    >>> 0000114d
    >>>
    >>> Scanning for files containing only zeros show several issues of
    >>> files that were likely accessed before or during the shutdown of
    >>> the nodes.
    >>>
    >>> How should I progress from here?
    >> _______________________________________________
    >> ceph-users mailing list -- [email protected]
    >> To unsubscribe send an email to [email protected]
    _______________________________________________
    ceph-users mailing list -- [email protected]
    To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Data is still actively corrupted!

Reply via email to