Re: [ceph-users] 10d
I just filed a ticket after trying ceph-objectstore-tool: http://tracker.ceph.com/issues/12428 On Fri, Jul 17, 2015 at 3:36 PM, Dan van der Ster d...@vanderster.com wrote: A bit of progress: rm'ing everything from inside current/36.10d_head/ actually let the OSD start and continue deleting other PGs. Cheers, Dan On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote: Thanks for the quick reply. We /could/ just wipe these OSDs and start from scratch (the only other pools were 4+2 ec and recovery already brought us to 100% active+clean). But it'd be good to understand and prevent this kind of crash... Cheers, Dan On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote: I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one
Re: [ceph-users] 10d
A bit of progress: rm'ing everything from inside current/36.10d_head/ actually let the OSD start and continue deleting other PGs. Cheers, Dan On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote: Thanks for the quick reply. We /could/ just wipe these OSDs and start from scratch (the only other pools were 4+2 ec and recovery already brought us to 100% active+clean). But it'd be good to understand and prevent this kind of crash... Cheers, Dan On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote: I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped
Re: [ceph-users] 10d
Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? 34.10d: [52,202,218] 55.10d: [303,231,65] 36.10d: [30,171,69] So no common OSDs. I've already repaired all of these PGs, and logs have nothing interesting, so I can't say more about the objects. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10d
I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? 34.10d: [52,202,218] 55.10d: [303,231,65] 36.10d: [30,171,69] So no common OSDs. I've already repaired all of these PGs, and logs have nothing interesting, so I can't say more about the objects. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10d
Thanks for the quick reply. We /could/ just wipe these OSDs and start from scratch (the only other pools were 4+2 ec and recovery already brought us to 100% active+clean). But it'd be good to understand and prevent this kind of crash... Cheers, Dan On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote: I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? 34.10d: [52,202,218] 55.10d: [303,231,65] 36.10d: [30,171,69] So no common OSDs. I've already repaired all of these PGs, and
Re: [ceph-users] 10d
On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10d
On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? 34.10d: [52,202,218] 55.10d: [303,231,65] 36.10d: [30,171,69] So no common OSDs. I've already repaired all of these PGs, and logs have nothing interesting, so I can't say more about the objects. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com