Fellow Cephers,
I'm scratching my head on this one. Somehow a bunch of objects were lost in
my cluster, which is currently ceph version 0.87.1
(283c2e7cfa2457799f534744d7d549f83ea1335e).
The symptoms are that "ceph -s" reports a bunch of inconsistent PGs:
cluster 8a2c9e43-9f17-42e0-92fd-88a40152303d
health HEALTH_ERR 13 pgs inconsistent; 123 scrub errors; mds0: Client
sabnzbd:storage failing to respond to cache pressure; noout flag(s) set
monmap e9: 3 mons at {guinan=
10.42.6.48:6789/0,tuvok=10.42.6.33:6789/0,yar=10.42.6.43:6789/0}, election
epoch 1252, quorum 0,1,2 tuvok,yar,guinan
mdsmap e698: 1/1/1 up {0=pulaski=up:active}
osdmap e41375: 29 osds: 29 up, 29 in
flags noout
pgmap v22573849: 1088 pgs, 3 pools, 32175 GB data, 9529 kobjects
96663 GB used, 41779 GB / 135 TB avail
1072 active+clean
3 active+clean+scrubbing+deep
13 active+clean+inconsistent
client io 1004 kB/s rd, 2 op/s
I say the objects were "lost", because grepping the logs for the OSDs
holding the affected PGs, I see lines like:
2015-05-10 06:27:34.720648 7f2df27fc700 0
filestore(/var/lib/ceph/osd/ceph-11) write couldn't open
0.176_head/adb9ff76/10006ecde46.00000000/head//0: (61) No data available
2015-05-10 15:44:34.723479 7f2df2ffd700 -1
filestore(/var/lib/ceph/osd/ceph-11) error creating
9be4ff76/10006ee7848.00000000/head//0
(/var/lib/ceph/osd/ceph-11/current/0.176_head/DIR_6/DIR_7/DIR_F/DIR_F/10006ee7848.00000000__head_9BE4FF76__0)
in index: (61) No data available
All the affected PGs are in pool 0, which is the data pool for CephFS. The
replication setting for pool 0 is 2: "ceph osd dump | head -n 9":
epoch 41375
fsid 8a2c9e43-9f17-42e0-92fd-88a40152303d
created 2014-04-06 21:16:19.449590
modified 2015-05-10 13:57:21.376468
flags noout
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 17399 flags hashpspool
crash_replay_interval 45 min_read_recency_for_promote 1 stripe_width 0
pool 1 'metadata' replicated size 4 min_size 3 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 18915 flags hashpspool
min_read_recency_for_promote 1 stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool
min_read_recency_for_promote 1 stripe_width 0
max_osd 29
I'm a bit fuzzy on the timeline about when missing objects started
appearing. It's a tad alarming and I'd like any pointers for getting a
better understanding of the situation.
To make matters worse, I'm running CephFS and a lot of the missing objects
are strip 0 of a file, which leaves me with no idea how to find out what
the affected file was so I can delete it and restore from backups. Pointers
here would be useful as well. (My current method for mapping an object to
CephFS file is to read the xattrs on the 0th stripe object and pick out the
strings.)
Thanks in advance for any suggestions/pointers!
--
Aaron Ten Clay
http://www.aarontc.com/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com