Re: [ceph-users] 10d

Dan van der Ster Fri, 17 Jul 2015 06:16:59 -0700

Hi Greg + list,

Sorry to reply to this old'ish thread, but today one of these PGs bit
us in the ass.


Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
and 69 all crash when trying to delete pg 36.10d. They all crash with

   ENOTEMPTY suggests garbage data in osd data dir

(full log below). There is indeed some "garbage" in there:

# find 36.10d_head/
36.10d_head/
36.10d_head/DIR_D
36.10d_head/DIR_D/DIR_0
36.10d_head/DIR_D/DIR_0/DIR_1
36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
36.10d_head/DIR_D/DIR_0/DIR_9


Do you have any suggestion how to get these OSDs back running? We
already tried manually moving 36.10d_head to 36.10d_head.bak but then
the OSD crashes for a different reason:

    -1> 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
36.10d coll 36.10d_head
     0> 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
15:07:42.442902
osd/PG.cc: 2839: FAILED assert(r > 0)


Any clues?

Cheers, Dan

2015-07-17 14:40:54.493935 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
from 0)
2015-07-17 14:40:54.494019 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
in osd data dir
2015-07-17 14:40:54.494021 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
{
   "ops": [
       {
           "op_num": 0,
           "op_name": "remove",
           "collection": "36.10d_head",
           "oid": "10d\/\/head\/\/36"
       },
       {
           "op_num": 1,
           "op_name": "rmcoll",
           "collection": "36.10d_head"
       }
   ]
}

2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
function 'unsigned int
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int,
ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
14:40:54.502996
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned
long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
2: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*,
std::allocator<ObjectStore::Transaction*> >&, unsigned long,
ThreadPool::TPHandle*)+0x64) [0x97d794]
3: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle&)+0x2a0) [0x97da50]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
6: /lib64/libpthread.so.0() [0x3fbec079d1]
7: (clone()+0x6d) [0x3fbe8e88fd]

On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster <[email protected]> wrote:
> On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum <[email protected]> wrote:
>> On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster <[email protected]> 
>> wrote:
>>> Hi,
>>>
>>> After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
>>> PGs go inconsistent.
>>>
>>> First, immediately after we updated the OSDs PG 34.10d went inconsistent:
>>>
>>> 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
>>> cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
>>> 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
>>> bytes,0/0 hit_set_archive bytes.
>>>
>>> Second, an hour later 55.10d went inconsistent:
>>>
>>> 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
>>> cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
>>> clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
>>> bytes,0/0 hit_set_archive bytes.
>>>
>>> Then last night 36.10d suffered the same fate:
>>>
>>> 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
>>> cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
>>> 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
>>> whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.
>>>
>>>
>>> In all cases, one object is missing. In all cases, the PG id is 10d.
>>> Is this an epic coincidence or could something else going on here?
>>
>> I'm betting on something else. What OSDs is each PG mapped to?
>> It looks like each of them is missing one object on some of the OSDs,
>> what are the objects?
>
> 34.10d: [52,202,218]
> 55.10d: [303,231,65]
> 36.10d: [30,171,69]
>
> So no common OSDs. I've already repaired all of these PGs, and logs
> have nothing interesting, so I can't say more about the objects.
>
> Cheers, Dan
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 10d

Reply via email to