Re: [ceph-users] Urgent: Reduced data availability / All pgs inactive

Irek Fasikhov Wed, 20 Feb 2019 22:06:17 -0800

Hi,

You have problems with MRG.
http://docs.ceph.com/docs/master/rados/operations/pg-states/
*The ceph-mgr hasn’t yet received any information about the PG’s state from
an OSD since mgr started up.*


чт, 21 февр. 2019 г. в 09:04, Irek Fasikhov <[email protected]>:

> Hi,
>
> You have problems with MRG.
> http://docs.ceph.com/docs/master/rados/operations/pg-states/
> *The ceph-mgr hasn’t yet received any information about the PG’s state
> from an OSD since mgr started up.*
>
>
> ср, 20 февр. 2019 г. в 23:10, Ranjan Ghosh <[email protected]>:
>
>> Hi all,
>>
>> hope someone can help me. After restarting a node of my 2-node-cluster
>> suddenly I get this:
>>
>> root@yak2 /var/www/projects # ceph -s
>>   cluster:
>>     id:     749b2473-9300-4535-97a6-ee6d55008a1b
>>     health: HEALTH_WARN
>>             Reduced data availability: 200 pgs inactive
>>
>>   services:
>>     mon: 3 daemons, quorum yak1,yak2,yak0
>>     mgr: yak0.planwerk6.de(active), standbys: yak1.planwerk6.de,
>> yak2.planwerk6.de
>>     mds: cephfs-1/1/1 up  {0=yak1.planwerk6.de=up:active}, 1 up:standby
>>     osd: 2 osds: 2 up, 2 in
>>
>>   data:
>>     pools:   2 pools, 200 pgs
>>     objects: 0  objects, 0 B
>>     usage:   0 B used, 0 B / 0 B avail
>>     pgs:     100.000% pgs unknown
>>              200 unknown
>>
>> And this:
>>
>>
>> root@yak2 /var/www/projects # ceph health detail
>> HEALTH_WARN Reduced data availability: 200 pgs inactive
>> PG_AVAILABILITY Reduced data availability: 200 pgs inactive
>>     pg 1.34 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.35 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.36 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.37 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.38 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.39 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.3a is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.3b is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.3c is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.3d is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.3e is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.3f is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.40 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.41 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.42 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.43 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.44 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.45 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.46 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.47 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.48 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.49 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.4a is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.4b is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.4c is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 1.4d is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.34 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.35 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.36 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.38 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.39 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.3a is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.3b is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.3c is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.3d is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.3e is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.3f is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.40 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.41 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.42 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.43 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.44 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.45 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.46 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.47 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.48 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.49 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.4a is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.4b is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.4e is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>     pg 2.4f is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>>
>> But if I query an individual PG I get this:
>>
>> root@yak1 /var/www/projects # ceph pg 1.49 query
>> {
>>     "state": "active+clean",
>>     "snap_trimq": "[]",
>>     "snap_trimq_len": 0,
>>     "epoch": 162,
>>     "up": [
>>         0,
>>         1
>>     ],
>>     "acting": [
>>         0,
>>         1
>>     ],
>>     "acting_recovery_backfill": [
>>         "0",
>>         "1"
>>     ],
>>     "info": {
>>         "pgid": "1.49",
>>         "last_update": "127'38077",
>>         "last_complete": "127'38077",
>>         "log_tail": "127'35000",
>>         "last_user_version": 38077,
>>         "last_backfill": "MAX",
>>         "last_backfill_bitwise": 0,
>>         "purged_snaps": [],
>>         "history": {
>>             "epoch_created": 10,
>>             "epoch_pool_created": 10,
>>             "last_epoch_started": 159,
>>             "last_interval_started": 158,
>>             "last_epoch_clean": 159,
>>             "last_interval_clean": 158,
>>             "last_epoch_split": 0,
>>             "last_epoch_marked_full": 0,
>>             "same_up_since": 158,
>>             "same_interval_since": 158,
>>             "same_primary_since": 135,
>>             "last_scrub": "127'36909",
>>             "last_scrub_stamp": "2019-02-20 15:02:45.204342",
>>             "last_deep_scrub": "127'36714",
>>             "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",
>>             "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342"
>>         },
>>         "stats": {
>>             "version": "127'38077",
>>             "reported_seq": "58934",
>>             "reported_epoch": "162",
>>             "state": "active+clean",
>>             "last_fresh": "2019-02-20 19:56:56.740536",
>>             "last_change": "2019-02-20 19:52:27.063812",
>>             "last_active": "2019-02-20 19:56:56.740536",
>>             "last_peered": "2019-02-20 19:56:56.740536",
>>             "last_clean": "2019-02-20 19:56:56.740536",
>>             "last_became_active": "2019-02-20 19:52:27.062689",
>>             "last_became_peered": "2019-02-20 19:52:27.062689",
>>             "last_unstale": "2019-02-20 19:56:56.740536",
>>             "last_undegraded": "2019-02-20 19:56:56.740536",
>>             "last_fullsized": "2019-02-20 19:56:56.740536",
>>             "mapping_epoch": 158,
>>             "log_start": "127'35000",
>>             "ondisk_log_start": "127'35000",
>>             "created": 10,
>>             "last_epoch_clean": 159,
>>             "parent": "0.0",
>>             "parent_split_bits": 0,
>>             "last_scrub": "127'36909",
>>             "last_scrub_stamp": "2019-02-20 15:02:45.204342",
>>             "last_deep_scrub": "127'36714",
>>             "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",
>>             "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342",
>>             "log_size": 3077,
>>             "ondisk_log_size": 3077,
>>             "stats_invalid": false,
>>             "dirty_stats_invalid": false,
>>             "omap_stats_invalid": false,
>>             "hitset_stats_invalid": false,
>>             "hitset_bytes_stats_invalid": false,
>>             "pin_stats_invalid": false,
>>             "manifest_stats_invalid": true,
>>             "snaptrimq_len": 0,
>>             "stat_sum": {
>>                 "num_bytes": 478347970,
>>                 "num_objects": 12052,
>>                 "num_object_clones": 0,
>>                 "num_object_copies": 24104,
>>                 "num_objects_missing_on_primary": 0,
>>                 "num_objects_missing": 0,
>>                 "num_objects_degraded": 0,
>>                 "num_objects_misplaced": 0,
>>                 "num_objects_unfound": 0,
>>                 "num_objects_dirty": 12052,
>>                 "num_whiteouts": 0,
>>                 "num_read": 20186,
>>                 "num_read_kb": 1952018,
>>                 "num_write": 38927,
>>                 "num_write_kb": 484756,
>>                 "num_scrub_errors": 0,
>>                 "num_shallow_scrub_errors": 0,
>>                 "num_deep_scrub_errors": 0,
>>                 "num_objects_recovered": 6,
>>                 "num_bytes_recovered": 4101,
>>                 "num_keys_recovered": 0,
>>                 "num_objects_omap": 0,
>>                 "num_objects_hit_set_archive": 0,
>>                 "num_bytes_hit_set_archive": 0,
>>                 "num_flush": 0,
>>                 "num_flush_kb": 0,
>>                 "num_evict": 0,
>>                 "num_evict_kb": 0,
>>                 "num_promote": 0,
>>                 "num_flush_mode_high": 0,
>>                 "num_flush_mode_low": 0,
>>                 "num_evict_mode_some": 0,
>>                 "num_evict_mode_full": 0,
>>                 "num_objects_pinned": 0,
>>                 "num_legacy_snapsets": 0,
>>                 "num_large_omap_objects": 0,
>>                 "num_objects_manifest": 0
>>             },
>>             "up": [
>>                 0,
>>                 1
>>             ],
>>             "acting": [
>>                 0,
>>                 1
>>             ],
>>             "blocked_by": [],
>>             "up_primary": 0,
>>             "acting_primary": 0,
>>             "purged_snaps": []
>>         },
>>         "empty": 0,
>>         "dne": 0,
>>         "incomplete": 0,
>>         "last_epoch_started": 159,
>>         "hit_set_history": {
>>             "current_last_update": "0'0",
>>             "history": []
>>         }
>>     },
>>     "peer_info": [
>>         {
>>             "peer": "1",
>>             "pgid": "1.49",
>>             "last_update": "127'38077",
>>             "last_complete": "127'38077",
>>             "log_tail": "127'35000",
>>             "last_user_version": 38077,
>>             "last_backfill": "MAX",
>>             "last_backfill_bitwise": 0,
>>             "purged_snaps": [],
>>             "history": {
>>                 "epoch_created": 10,
>>                 "epoch_pool_created": 10,
>>                 "last_epoch_started": 159,
>>                 "last_interval_started": 158,
>>                 "last_epoch_clean": 159,
>>                 "last_interval_clean": 158,
>>                 "last_epoch_split": 0,
>>                 "last_epoch_marked_full": 0,
>>                 "same_up_since": 158,
>>                 "same_interval_since": 158,
>>                 "same_primary_since": 135,
>>                 "last_scrub": "127'36909",
>>                 "last_scrub_stamp": "2019-02-20 15:02:45.204342",
>>                 "last_deep_scrub": "127'36714",
>>                 "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",
>>                 "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342"
>>             },
>>             "stats": {
>>                 "version": "127'38077",
>>                 "reported_seq": "58745",
>>                 "reported_epoch": "134",
>>                 "state": "active+undersized+degraded",
>>                 "last_fresh": "2019-02-20 19:06:19.180016",
>>                 "last_change": "2019-02-20 19:04:39.483332",
>>                 "last_active": "2019-02-20 19:06:19.180016",
>>                 "last_peered": "2019-02-20 19:06:19.180016",
>>                 "last_clean": "2019-02-20 18:23:33.675145",
>>                 "last_became_active": "2019-02-20 19:04:39.483332",
>>                 "last_became_peered": "2019-02-20 19:04:39.483332",
>>                 "last_unstale": "2019-02-20 19:06:19.180016",
>>                 "last_undegraded": "2019-02-20 19:04:39.477829",
>>                 "last_fullsized": "2019-02-20 19:04:39.477717",
>>                 "mapping_epoch": 158,
>>                 "log_start": "127'35000",
>>                 "ondisk_log_start": "127'35000",
>>                 "created": 10,
>>                 "last_epoch_clean": 124,
>>                 "parent": "0.0",
>>                 "parent_split_bits": 0,
>>                 "last_scrub": "127'36909",
>>                 "last_scrub_stamp": "2019-02-20 15:02:45.204342",
>>                 "last_deep_scrub": "127'36714",
>>                 "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",
>>                 "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342",
>>                 "log_size": 3077,
>>                 "ondisk_log_size": 3077,
>>                 "stats_invalid": false,
>>                 "dirty_stats_invalid": false,
>>                 "omap_stats_invalid": false,
>>                 "hitset_stats_invalid": false,
>>                 "hitset_bytes_stats_invalid": false,
>>                 "pin_stats_invalid": false,
>>                 "manifest_stats_invalid": true,
>>                 "snaptrimq_len": 0,
>>                 "stat_sum": {
>>                     "num_bytes": 478347970,
>>                     "num_objects": 12052,
>>                     "num_object_clones": 0,
>>                     "num_object_copies": 24104,
>>                     "num_objects_missing_on_primary": 0,
>>                     "num_objects_missing": 0,
>>                     "num_objects_degraded": 12052,
>>                     "num_objects_misplaced": 0,
>>                     "num_objects_unfound": 0,
>>                     "num_objects_dirty": 12052,
>>                     "num_whiteouts": 0,
>>                     "num_read": 20186,
>>                     "num_read_kb": 1952018,
>>                     "num_write": 38927,
>>                     "num_write_kb": 484756,
>>                     "num_scrub_errors": 0,
>>                     "num_shallow_scrub_errors": 0,
>>                     "num_deep_scrub_errors": 0,
>>                     "num_objects_recovered": 6,
>>                     "num_bytes_recovered": 4101,
>>                     "num_keys_recovered": 0,
>>                     "num_objects_omap": 0,
>>                     "num_objects_hit_set_archive": 0,
>>                     "num_bytes_hit_set_archive": 0,
>>                     "num_flush": 0,
>>                     "num_flush_kb": 0,
>>                     "num_evict": 0,
>>                     "num_evict_kb": 0,
>>                     "num_promote": 0,
>>                     "num_flush_mode_high": 0,
>>                     "num_flush_mode_low": 0,
>>                     "num_evict_mode_some": 0,
>>                     "num_evict_mode_full": 0,
>>                     "num_objects_pinned": 0,
>>                     "num_legacy_snapsets": 0,
>>                     "num_large_omap_objects": 0,
>>                     "num_objects_manifest": 0
>>                 },
>>                 "up": [
>>                     0,
>>                     1
>>                 ],
>>                 "acting": [
>>                     0,
>>                     1
>>                 ],
>>                 "blocked_by": [],
>>                 "up_primary": 0,
>>                 "acting_primary": 0,
>>                 "purged_snaps": []
>>             },
>>             "empty": 0,
>>             "dne": 0,
>>             "incomplete": 0,
>>             "last_epoch_started": 159,
>>             "hit_set_history": {
>>                 "current_last_update": "0'0",
>>                 "history": []
>>             }
>>         }
>>     ],
>>     "recovery_state": [
>>         {
>>             "name": "Started/Primary/Active",
>>             "enter_time": "2019-02-20 19:52:27.027151",
>>             "might_have_unfound": [],
>>             "recovery_progress": {
>>                 "backfill_targets": [],
>>                 "waiting_on_backfill": [],
>>                 "last_backfill_started": "MIN",
>>                 "backfill_info": {
>>                     "begin": "MIN",
>>                     "end": "MIN",
>>                     "objects": []
>>                 },
>>                 "peer_backfill_info": [],
>>                 "backfills_in_flight": [],
>>                 "recovering": [],
>>                 "pg_backend": {
>>                     "pull_from_peer": [],
>>                     "pushing": []
>>                 }
>>             },
>>             "scrub": {
>>                 "scrubber.epoch_start": "0",
>>                 "scrubber.active": false,
>>                 "scrubber.state": "INACTIVE",
>>                 "scrubber.start": "MIN",
>>                 "scrubber.end": "MIN",
>>                 "scrubber.max_end": "MIN",
>>                 "scrubber.subset_last_update": "0'0",
>>                 "scrubber.deep": false,
>>                 "scrubber.waiting_on_whom": []
>>             }
>>         },
>>         {
>>             "name": "Started",
>>             "enter_time": "2019-02-20 19:52:25.976144"
>>         }
>>     ],
>>     "agent_state": {}
>> }
>>
>> I wonder what it all means and how to get out of this situation. The
>> cluster seems to work normally. But it's quite disconcerting as you can
>> probably imagine. Could it be a firewall issue? I'm not aware of any
>> changes and I don't see any peering problems...
>>
>> Thank you
>>
>> Ranjan
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Urgent: Reduced data availability / All pgs inactive

Reply via email to