Hi Gregory,

thanks for the reply. I have the dump of the metadata pool, but I'm not sure what to check there. Is it what you mean?

The cluster was operational until today at noon, when a full restart of the daemons was issued, like many other times in the past. I was trying to issue the repaired command to get a real error in the logs, but it was apparently not the case.

Thanks,


    Alessandro


Il 11/07/18 18:22, Gregory Farnum ha scritto:
Have you checked the actual journal objects as the "journal export" suggested? Did you identify any actual source of the damage before issuing the "repaired" command?
What is the history of the filesystems on this cluster?

On Wed, Jul 11, 2018 at 8:10 AM Alessandro De Salvo <alessandro.desa...@roma1.infn.it <mailto:alessandro.desa...@roma1.infn.it>> wrote:

    Hi,

    after the upgrade to luminous 12.2.6 today, all our MDSes have been
    marked as damaged. Trying to restart the instances only result in
    standby MDSes. We currently have 2 filesystems active and 2 MDSes
    each.

    I found the following error messages in the mon:


    mds.0 <node1_IP>:6800/2412911269 down:damaged
    mds.1 <node2_IP>:6800/830539001 down:damaged
    mds.0 <node3_IP>:6800/4080298733 down:damaged


    Whenever I try to force the repaired state with ceph mds repaired
    <fs_name>:<rank> I get something like this in the MDS logs:


    2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
    error getting journal off disk
    2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
    [ERR] : Error recovering journal 0x201: (5) Input/output error


    Any attempt of running the journal export results in errors, like
    this one:


    cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
    Error ((5) Input/output error)2018-07-11 17:01:30.631571
    7f94354fff00 -1
    Header 200.00000000 is unreadable

    2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal
    not
    readable, attempt object-by-object dump with `rados`


    Same happens for recover_dentries

    cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
    Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
    200.00000000 is unreadable
    Errors:
    0

    Is there something I could try to do to have the cluster back?

    I was able to dump the contents of the metadata pool with rados
    export
    -p cephfs_metadata <filename> and I'm currently trying the procedure
    described in
    
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery

    but I'm not sure if it will work as it's apparently doing nothing
    at the
    moment (maybe it's just very slow).

    Any help is appreciated, thanks!


         Alessandro

    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to