Re: [ceph-users] 10d

2015-07-22 Thread Dan van der Ster
I just filed a ticket after trying ceph-objectstore-tool:
http://tracker.ceph.com/issues/12428

On Fri, Jul 17, 2015 at 3:36 PM, Dan van der Ster d...@vanderster.com wrote:
 A bit of progress: rm'ing everything from inside current/36.10d_head/
 actually let the OSD start and continue deleting other PGs.

 Cheers, Dan

 On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote:
 Thanks for the quick reply.

 We /could/ just wipe these OSDs and start from scratch (the only other
 pools were 4+2 ec and recovery already brought us to 100%
 active+clean).

 But it'd be good to understand and prevent this kind of crash...

 Cheers, Dan




 On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote:
 I think you'll need to use the ceph-objectstore-tool to remove the
 PG/data consistently, but I've not done this — David or Sam will need
 to chime in.
 -Greg

 On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi Greg + list,

 Sorry to reply to this old'ish thread, but today one of these PGs bit
 us in the ass.

 Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
 and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

 (full log below). There is indeed some garbage in there:

 # find 36.10d_head/
 36.10d_head/
 36.10d_head/DIR_D
 36.10d_head/DIR_D/DIR_0
 36.10d_head/DIR_D/DIR_0/DIR_1
 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
 36.10d_head/DIR_D/DIR_0/DIR_9


 Do you have any suggestion how to get these OSDs back running? We
 already tried manually moving 36.10d_head to 36.10d_head.bak but then
 the OSD crashes for a different reason:

 -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
 36.10d coll 36.10d_head
  0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
 function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
 ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
 15:07:42.442902
 osd/PG.cc: 2839: FAILED assert(r  0)


 Any clues?

 Cheers, Dan

 2015-07-17 14:40:54.493935 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
 not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
 from 0)
 2015-07-17 14:40:54.494019 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
 in osd data dir
 2015-07-17 14:40:54.494021 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
 {
ops: [
{
op_num: 0,
op_name: remove,
collection: 36.10d_head,
oid: 10d\/\/head\/\/36
},
{
op_num: 1,
op_name: rmcoll,
collection: 36.10d_head
}
]
 }

 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
 14:40:54.502996
 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) [0x97d794]
 3: (FileStore::_do_op(FileStore::OpSequencer*,
 ThreadPool::TPHandle)+0x2a0) [0x97da50]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
 6: /lib64/libpthread.so.0() [0x3fbec079d1]
 7: (clone()+0x6d) [0x3fbe8e88fd]

 On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went 
 inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one 

Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
A bit of progress: rm'ing everything from inside current/36.10d_head/
actually let the OSD start and continue deleting other PGs.

Cheers, Dan

On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote:
 Thanks for the quick reply.

 We /could/ just wipe these OSDs and start from scratch (the only other
 pools were 4+2 ec and recovery already brought us to 100%
 active+clean).

 But it'd be good to understand and prevent this kind of crash...

 Cheers, Dan




 On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote:
 I think you'll need to use the ceph-objectstore-tool to remove the
 PG/data consistently, but I've not done this — David or Sam will need
 to chime in.
 -Greg

 On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi Greg + list,

 Sorry to reply to this old'ish thread, but today one of these PGs bit
 us in the ass.

 Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
 and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

 (full log below). There is indeed some garbage in there:

 # find 36.10d_head/
 36.10d_head/
 36.10d_head/DIR_D
 36.10d_head/DIR_D/DIR_0
 36.10d_head/DIR_D/DIR_0/DIR_1
 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
 36.10d_head/DIR_D/DIR_0/DIR_9


 Do you have any suggestion how to get these OSDs back running? We
 already tried manually moving 36.10d_head to 36.10d_head.bak but then
 the OSD crashes for a different reason:

 -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
 36.10d coll 36.10d_head
  0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
 function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
 ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
 15:07:42.442902
 osd/PG.cc: 2839: FAILED assert(r  0)


 Any clues?

 Cheers, Dan

 2015-07-17 14:40:54.493935 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
 not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
 from 0)
 2015-07-17 14:40:54.494019 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
 in osd data dir
 2015-07-17 14:40:54.494021 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
 {
ops: [
{
op_num: 0,
op_name: remove,
collection: 36.10d_head,
oid: 10d\/\/head\/\/36
},
{
op_num: 1,
op_name: rmcoll,
collection: 36.10d_head
}
]
 }

 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
 14:40:54.502996
 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) [0x97d794]
 3: (FileStore::_do_op(FileStore::OpSequencer*,
 ThreadPool::TPHandle)+0x2a0) [0x97da50]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
 6: /lib64/libpthread.so.0() [0x3fbec079d1]
 7: (clone()+0x6d) [0x3fbe8e88fd]

 On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped 

Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
Hi Greg + list,

Sorry to reply to this old'ish thread, but today one of these PGs bit
us in the ass.

Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
and 69 all crash when trying to delete pg 36.10d. They all crash with

   ENOTEMPTY suggests garbage data in osd data dir

(full log below). There is indeed some garbage in there:

# find 36.10d_head/
36.10d_head/
36.10d_head/DIR_D
36.10d_head/DIR_D/DIR_0
36.10d_head/DIR_D/DIR_0/DIR_1
36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
36.10d_head/DIR_D/DIR_0/DIR_9


Do you have any suggestion how to get these OSDs back running? We
already tried manually moving 36.10d_head to 36.10d_head.bak but then
the OSD crashes for a different reason:

-1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
36.10d coll 36.10d_head
 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
15:07:42.442902
osd/PG.cc: 2839: FAILED assert(r  0)


Any clues?

Cheers, Dan

2015-07-17 14:40:54.493935 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
from 0)
2015-07-17 14:40:54.494019 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
in osd data dir
2015-07-17 14:40:54.494021 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
{
   ops: [
   {
   op_num: 0,
   op_name: remove,
   collection: 36.10d_head,
   oid: 10d\/\/head\/\/36
   },
   {
   op_num: 1,
   op_name: rmcoll,
   collection: 36.10d_head
   }
   ]
}

2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
function 'unsigned int
FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
14:40:54.502996
os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long,
ThreadPool::TPHandle*)+0x64) [0x97d794]
3: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle)+0x2a0) [0x97da50]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
6: /lib64/libpthread.so.0() [0x3fbec079d1]
7: (clone()+0x6d) [0x3fbe8e88fd]

On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped to?
 It looks like each of them is missing one object on some of the OSDs,
 what are the objects?

 34.10d: [52,202,218]
 55.10d: [303,231,65]
 36.10d: [30,171,69]

 So no common OSDs. I've already repaired all of these PGs, and logs
 have nothing interesting, so I can't say more about the objects.

 Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10d

2015-07-17 Thread Gregory Farnum
I think you'll need to use the ceph-objectstore-tool to remove the
PG/data consistently, but I've not done this — David or Sam will need
to chime in.
-Greg

On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote:
 Hi Greg + list,

 Sorry to reply to this old'ish thread, but today one of these PGs bit
 us in the ass.

 Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
 and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

 (full log below). There is indeed some garbage in there:

 # find 36.10d_head/
 36.10d_head/
 36.10d_head/DIR_D
 36.10d_head/DIR_D/DIR_0
 36.10d_head/DIR_D/DIR_0/DIR_1
 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
 36.10d_head/DIR_D/DIR_0/DIR_9


 Do you have any suggestion how to get these OSDs back running? We
 already tried manually moving 36.10d_head to 36.10d_head.bak but then
 the OSD crashes for a different reason:

 -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
 36.10d coll 36.10d_head
  0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
 function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
 ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
 15:07:42.442902
 osd/PG.cc: 2839: FAILED assert(r  0)


 Any clues?

 Cheers, Dan

 2015-07-17 14:40:54.493935 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
 not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
 from 0)
 2015-07-17 14:40:54.494019 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
 in osd data dir
 2015-07-17 14:40:54.494021 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
 {
ops: [
{
op_num: 0,
op_name: remove,
collection: 36.10d_head,
oid: 10d\/\/head\/\/36
},
{
op_num: 1,
op_name: rmcoll,
collection: 36.10d_head
}
]
 }

 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
 14:40:54.502996
 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) [0x97d794]
 3: (FileStore::_do_op(FileStore::OpSequencer*,
 ThreadPool::TPHandle)+0x2a0) [0x97da50]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
 6: /lib64/libpthread.so.0() [0x3fbec079d1]
 7: (clone()+0x6d) [0x3fbe8e88fd]

 On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped to?
 It looks like each of them is missing one object on some of the OSDs,
 what are the objects?

 34.10d: [52,202,218]
 55.10d: [303,231,65]
 36.10d: [30,171,69]

 So no common OSDs. I've already repaired all of these PGs, and logs
 have nothing interesting, so I can't say more about the objects.

 Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
Thanks for the quick reply.

We /could/ just wipe these OSDs and start from scratch (the only other
pools were 4+2 ec and recovery already brought us to 100%
active+clean).

But it'd be good to understand and prevent this kind of crash...

Cheers, Dan




On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote:
 I think you'll need to use the ceph-objectstore-tool to remove the
 PG/data consistently, but I've not done this — David or Sam will need
 to chime in.
 -Greg

 On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote:
 Hi Greg + list,

 Sorry to reply to this old'ish thread, but today one of these PGs bit
 us in the ass.

 Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
 and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

 (full log below). There is indeed some garbage in there:

 # find 36.10d_head/
 36.10d_head/
 36.10d_head/DIR_D
 36.10d_head/DIR_D/DIR_0
 36.10d_head/DIR_D/DIR_0/DIR_1
 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
 36.10d_head/DIR_D/DIR_0/DIR_9


 Do you have any suggestion how to get these OSDs back running? We
 already tried manually moving 36.10d_head to 36.10d_head.bak but then
 the OSD crashes for a different reason:

 -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
 36.10d coll 36.10d_head
  0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
 function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
 ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
 15:07:42.442902
 osd/PG.cc: 2839: FAILED assert(r  0)


 Any clues?

 Cheers, Dan

 2015-07-17 14:40:54.493935 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
 not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
 from 0)
 2015-07-17 14:40:54.494019 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
 in osd data dir
 2015-07-17 14:40:54.494021 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
 {
ops: [
{
op_num: 0,
op_name: remove,
collection: 36.10d_head,
oid: 10d\/\/head\/\/36
},
{
op_num: 1,
op_name: rmcoll,
collection: 36.10d_head
}
]
 }

 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
 14:40:54.502996
 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) [0x97d794]
 3: (FileStore::_do_op(FileStore::OpSequencer*,
 ThreadPool::TPHandle)+0x2a0) [0x97da50]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
 6: /lib64/libpthread.so.0() [0x3fbec079d1]
 7: (clone()+0x6d) [0x3fbe8e88fd]

 On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped to?
 It looks like each of them is missing one object on some of the OSDs,
 what are the objects?

 34.10d: [52,202,218]
 55.10d: [303,231,65]
 36.10d: [30,171,69]

 So no common OSDs. I've already repaired all of these PGs, and 

Re: [ceph-users] 10d

2015-06-17 Thread Gregory Farnum
On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

I'm betting on something else. What OSDs is each PG mapped to?
It looks like each of them is missing one object on some of the OSDs,
what are the objects?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10d

2015-06-17 Thread Dan van der Ster
On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped to?
 It looks like each of them is missing one object on some of the OSDs,
 what are the objects?

34.10d: [52,202,218]
55.10d: [303,231,65]
36.10d: [30,171,69]

So no common OSDs. I've already repaired all of these PGs, and logs
have nothing interesting, so I can't say more about the objects.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com