Re: [ceph-users] Urgent: Reduced data availability / All pgs inactive

Ranjan Ghosh Thu, 21 Feb 2019 08:56:51 -0800

Wow. Thank you so much Irek! Your help saved me from a lot of trouble...

It turned out to be indeed a firewall issue. Port 6800 in one directionwasn't open.



Am 21.02.19 um 07:05 schrieb Irek Fasikhov:

Hi,

You have problems with MRG.
http://docs.ceph.com/docs/master/rados/operations/pg-states/

/The ceph-mgr hasn’t yet received any information about the PG’s statefrom an OSD since mgr started up./

чт, 21 февр. 2019 г. в 09:04, Irek Fasikhov <[email protected]<mailto:[email protected]>>:


    Hi,

    You have problems with MRG.
    http://docs.ceph.com/docs/master/rados/operations/pg-states/
    /The ceph-mgr hasn’t yet received any information about the PG’s
    state from an OSD since mgr started up./


    ср, 20 февр. 2019 г. в 23:10, Ranjan Ghosh <[email protected]
    <mailto:[email protected]>>:

        Hi all,

        hope someone can help me. After restarting a node of my
        2-node-cluster suddenly I get this:

        root@yak2 /var/www/projects # ceph -s
          cluster:
            id:     749b2473-9300-4535-97a6-ee6d55008a1b
            health: HEALTH_WARN
                    Reduced data availability: 200 pgs inactive

          services:
            mon: 3 daemons, quorum yak1,yak2,yak0
            mgr: yak0.planwerk6.de <http://yak0.planwerk6.de>(active),
        standbys: yak1.planwerk6.de <http://yak1.planwerk6.de>,
        yak2.planwerk6.de <http://yak2.planwerk6.de>
            mds: cephfs-1/1/1 up  {0=yak1.planwerk6.de
        <http://yak1.planwerk6.de>=up:active}, 1 up:standby
            osd: 2 osds: 2 up, 2 in

          data:
            pools:   2 pools, 200 pgs
            objects: 0  objects, 0 B
            usage:   0 B used, 0 B / 0 B avail
            pgs:     100.000% pgs unknown
                     200 unknown

        And this:


        root@yak2 /var/www/projects # ceph health detail
        HEALTH_WARN Reduced data availability: 200 pgs inactive
        PG_AVAILABILITY Reduced data availability: 200 pgs inactive
            pg 1.34 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.35 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.36 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.37 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.38 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.39 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.3a is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.3b is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.3c is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.3d is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.3e is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.3f is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.40 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.41 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.42 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.43 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.44 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.45 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.46 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.47 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.48 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.49 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.4a is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.4b is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.4c is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 1.4d is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.34 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.35 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.36 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.38 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.39 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.3a is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.3b is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.3c is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.3d is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.3e is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.3f is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.40 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.41 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.42 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.43 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.44 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.45 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.46 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.47 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.48 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.49 is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.4a is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.4b is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.4e is stuck inactive for 3506.815664, current state
        unknown, last acting []
            pg 2.4f is stuck inactive for 3506.815664, current state
        unknown, last acting []

        But if I query an individual PG I get this:

        root@yak1 /var/www/projects # ceph pg 1.49 query
        {
            "state": "active+clean",
            "snap_trimq": "[]",
            "snap_trimq_len": 0,
            "epoch": 162,
            "up": [
                0,
                1
            ],
            "acting": [
                0,
                1
            ],
            "acting_recovery_backfill": [
                "0",
                "1"
            ],
            "info": {
                "pgid": "1.49",
                "last_update": "127'38077",
                "last_complete": "127'38077",
                "log_tail": "127'35000",
                "last_user_version": 38077,
                "last_backfill": "MAX",
                "last_backfill_bitwise": 0,
                "purged_snaps": [],
                "history": {
                    "epoch_created": 10,
                    "epoch_pool_created": 10,
                    "last_epoch_started": 159,
                    "last_interval_started": 158,
                    "last_epoch_clean": 159,
                    "last_interval_clean": 158,
                    "last_epoch_split": 0,
                    "last_epoch_marked_full": 0,
                    "same_up_since": 158,
                    "same_interval_since": 158,
                    "same_primary_since": 135,
                    "last_scrub": "127'36909",
                    "last_scrub_stamp": "2019-02-20 15:02:45.204342",
                    "last_deep_scrub": "127'36714",
                    "last_deep_scrub_stamp": "2019-02-16
        07:55:15.205861",
                    "last_clean_scrub_stamp": "2019-02-20
        15:02:45.204342"
                },
                "stats": {
                    "version": "127'38077",
                    "reported_seq": "58934",
                    "reported_epoch": "162",
                    "state": "active+clean",
                    "last_fresh": "2019-02-20 19:56:56.740536",
                    "last_change": "2019-02-20 19:52:27.063812",
                    "last_active": "2019-02-20 19:56:56.740536",
                    "last_peered": "2019-02-20 19:56:56.740536",
                    "last_clean": "2019-02-20 19:56:56.740536",
                    "last_became_active": "2019-02-20 19:52:27.062689",
                    "last_became_peered": "2019-02-20 19:52:27.062689",
                    "last_unstale": "2019-02-20 19:56:56.740536",
                    "last_undegraded": "2019-02-20 19:56:56.740536",
                    "last_fullsized": "2019-02-20 19:56:56.740536",
                    "mapping_epoch": 158,
                    "log_start": "127'35000",
                    "ondisk_log_start": "127'35000",
                    "created": 10,
                    "last_epoch_clean": 159,
                    "parent": "0.0",
                    "parent_split_bits": 0,
                    "last_scrub": "127'36909",
                    "last_scrub_stamp": "2019-02-20 15:02:45.204342",
                    "last_deep_scrub": "127'36714",
                    "last_deep_scrub_stamp": "2019-02-16
        07:55:15.205861",
                    "last_clean_scrub_stamp": "2019-02-20
        15:02:45.204342",
                    "log_size": 3077,
                    "ondisk_log_size": 3077,
                    "stats_invalid": false,
                    "dirty_stats_invalid": false,
                    "omap_stats_invalid": false,
                    "hitset_stats_invalid": false,
                    "hitset_bytes_stats_invalid": false,
                    "pin_stats_invalid": false,
                    "manifest_stats_invalid": true,
                    "snaptrimq_len": 0,
                    "stat_sum": {
                        "num_bytes": 478347970,
                        "num_objects": 12052,
                        "num_object_clones": 0,
                        "num_object_copies": 24104,
                        "num_objects_missing_on_primary": 0,
                        "num_objects_missing": 0,
                        "num_objects_degraded": 0,
                        "num_objects_misplaced": 0,
                        "num_objects_unfound": 0,
                        "num_objects_dirty": 12052,
                        "num_whiteouts": 0,
                        "num_read": 20186,
                        "num_read_kb": 1952018,
                        "num_write": 38927,
                        "num_write_kb": 484756,
                        "num_scrub_errors": 0,
                        "num_shallow_scrub_errors": 0,
                        "num_deep_scrub_errors": 0,
                        "num_objects_recovered": 6,
                        "num_bytes_recovered": 4101,
                        "num_keys_recovered": 0,
                        "num_objects_omap": 0,
                        "num_objects_hit_set_archive": 0,
                        "num_bytes_hit_set_archive": 0,
                        "num_flush": 0,
                        "num_flush_kb": 0,
                        "num_evict": 0,
                        "num_evict_kb": 0,
                        "num_promote": 0,
                        "num_flush_mode_high": 0,
                        "num_flush_mode_low": 0,
                        "num_evict_mode_some": 0,
                        "num_evict_mode_full": 0,
                        "num_objects_pinned": 0,
                        "num_legacy_snapsets": 0,
                        "num_large_omap_objects": 0,
                        "num_objects_manifest": 0
                    },
                    "up": [
                        0,
                        1
                    ],
                    "acting": [
                        0,
                        1
                    ],
                    "blocked_by": [],
                    "up_primary": 0,
                    "acting_primary": 0,
                    "purged_snaps": []
                },
                "empty": 0,
                "dne": 0,
                "incomplete": 0,
                "last_epoch_started": 159,
                "hit_set_history": {
                    "current_last_update": "0'0",
                    "history": []
                }
            },
            "peer_info": [
                {
                    "peer": "1",
                    "pgid": "1.49",
                    "last_update": "127'38077",
                    "last_complete": "127'38077",
                    "log_tail": "127'35000",
                    "last_user_version": 38077,
                    "last_backfill": "MAX",
                    "last_backfill_bitwise": 0,
                    "purged_snaps": [],
                    "history": {
                        "epoch_created": 10,
                        "epoch_pool_created": 10,
                        "last_epoch_started": 159,
                        "last_interval_started": 158,
                        "last_epoch_clean": 159,
                        "last_interval_clean": 158,
                        "last_epoch_split": 0,
                        "last_epoch_marked_full": 0,
                        "same_up_since": 158,
                        "same_interval_since": 158,
                        "same_primary_since": 135,
                        "last_scrub": "127'36909",
                        "last_scrub_stamp": "2019-02-20 15:02:45.204342",
                        "last_deep_scrub": "127'36714",
                        "last_deep_scrub_stamp": "2019-02-16
        07:55:15.205861",
                        "last_clean_scrub_stamp": "2019-02-20
        15:02:45.204342"
                    },
                    "stats": {
                        "version": "127'38077",
                        "reported_seq": "58745",
                        "reported_epoch": "134",
                        "state": "active+undersized+degraded",
                        "last_fresh": "2019-02-20 19:06:19.180016",
                        "last_change": "2019-02-20 19:04:39.483332",
                        "last_active": "2019-02-20 19:06:19.180016",
                        "last_peered": "2019-02-20 19:06:19.180016",
                        "last_clean": "2019-02-20 18:23:33.675145",
                        "last_became_active": "2019-02-20
        19:04:39.483332",
                        "last_became_peered": "2019-02-20
        19:04:39.483332",
                        "last_unstale": "2019-02-20 19:06:19.180016",
                        "last_undegraded": "2019-02-20 19:04:39.477829",
                        "last_fullsized": "2019-02-20 19:04:39.477717",
                        "mapping_epoch": 158,
                        "log_start": "127'35000",
                        "ondisk_log_start": "127'35000",
                        "created": 10,
                        "last_epoch_clean": 124,
                        "parent": "0.0",
                        "parent_split_bits": 0,
                        "last_scrub": "127'36909",
                        "last_scrub_stamp": "2019-02-20 15:02:45.204342",
                        "last_deep_scrub": "127'36714",
                        "last_deep_scrub_stamp": "2019-02-16
        07:55:15.205861",
                        "last_clean_scrub_stamp": "2019-02-20
        15:02:45.204342",
                        "log_size": 3077,
                        "ondisk_log_size": 3077,
                        "stats_invalid": false,
                        "dirty_stats_invalid": false,
                        "omap_stats_invalid": false,
                        "hitset_stats_invalid": false,
                        "hitset_bytes_stats_invalid": false,
                        "pin_stats_invalid": false,
                        "manifest_stats_invalid": true,
                        "snaptrimq_len": 0,
                        "stat_sum": {
                            "num_bytes": 478347970,
                            "num_objects": 12052,
                            "num_object_clones": 0,
                            "num_object_copies": 24104,
                            "num_objects_missing_on_primary": 0,
                            "num_objects_missing": 0,
                            "num_objects_degraded": 12052,
                            "num_objects_misplaced": 0,
                            "num_objects_unfound": 0,
                            "num_objects_dirty": 12052,
                            "num_whiteouts": 0,
                            "num_read": 20186,
                            "num_read_kb": 1952018,
                            "num_write": 38927,
                            "num_write_kb": 484756,
                            "num_scrub_errors": 0,
                            "num_shallow_scrub_errors": 0,
                            "num_deep_scrub_errors": 0,
                            "num_objects_recovered": 6,
                            "num_bytes_recovered": 4101,
                            "num_keys_recovered": 0,
                            "num_objects_omap": 0,
                            "num_objects_hit_set_archive": 0,
                            "num_bytes_hit_set_archive": 0,
                            "num_flush": 0,
                            "num_flush_kb": 0,
                            "num_evict": 0,
                            "num_evict_kb": 0,
                            "num_promote": 0,
                            "num_flush_mode_high": 0,
                            "num_flush_mode_low": 0,
                            "num_evict_mode_some": 0,
                            "num_evict_mode_full": 0,
                            "num_objects_pinned": 0,
                            "num_legacy_snapsets": 0,
                            "num_large_omap_objects": 0,
                            "num_objects_manifest": 0
                        },
                        "up": [
                            0,
                            1
                        ],
                        "acting": [
                            0,
                            1
                        ],
                        "blocked_by": [],
                        "up_primary": 0,
                        "acting_primary": 0,
                        "purged_snaps": []
                    },
                    "empty": 0,
                    "dne": 0,
                    "incomplete": 0,
                    "last_epoch_started": 159,
                    "hit_set_history": {
                        "current_last_update": "0'0",
                        "history": []
                    }
                }
            ],
            "recovery_state": [
                {
                    "name": "Started/Primary/Active",
                    "enter_time": "2019-02-20 19:52:27.027151",
                    "might_have_unfound": [],
                    "recovery_progress": {
                        "backfill_targets": [],
                        "waiting_on_backfill": [],
                        "last_backfill_started": "MIN",
                        "backfill_info": {
                            "begin": "MIN",
                            "end": "MIN",
                            "objects": []
                        },
                        "peer_backfill_info": [],
                        "backfills_in_flight": [],
                        "recovering": [],
                        "pg_backend": {
                            "pull_from_peer": [],
                            "pushing": []
                        }
                    },
                    "scrub": {
                        "scrubber.epoch_start": "0",
                        "scrubber.active": false,
                        "scrubber.state": "INACTIVE",
                        "scrubber.start": "MIN",
                        "scrubber.end": "MIN",
                        "scrubber.max_end": "MIN",
                        "scrubber.subset_last_update": "0'0",
                        "scrubber.deep": false,
                        "scrubber.waiting_on_whom": []
                    }
                },
                {
                    "name": "Started",
                    "enter_time": "2019-02-20 19:52:25.976144"
                }
            ],
            "agent_state": {}
        }

        I wonder what it all means and how to get out of this
        situation. The cluster seems to work normally. But it's quite
        disconcerting as you can probably imagine. Could it be a
        firewall issue? I'm not aware of any changes and I don't see
        any peering problems...

        Thank you

        Ranjan







        _______________________________________________
        ceph-users mailing list
        [email protected] <mailto:[email protected]>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Urgent: Reduced data availability / All pgs inactive

Reply via email to