[ceph-users] ceph osd crash help needed

2019-08-12 Thread response
Hi all,

I have a production cluster, I recently purged all snaps.

Now on a set of OSD's when they backfill im getting an assert like the below : 

-4> 2019-08-13 00:25:14.577 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] exit 
Started
/Primary/Active/WaitRemoteBackfillReserved 0.244929 1 0.64
-3> 2019-08-13 00:25:14.577 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] enter 
Started/Primary/Active/Backfilling
-2> 2019-08-13 00:25:14.653 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] 
local-lis/les=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=[205889,206046)/1 rops=1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfilling mbc={} ps=80] 
backfill_pos is 0:b74d67be:::rbd_data.dae7bc6b8b4567.b4b8:head
-1> 2019-08-13 00:25:14.757 7ff4637b1700 -1 
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: In function 'uint64_t 
SnapSet::get_clone_bytes(snapid_t) const' thread 7ff4637b1700 time 2019-08-13 
00:25:14.759270
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: 5263: FAILED 
ceph_assert(clone_overlap.count(clone))

 ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x55989e4a6450]
 2: (()+0x517628) [0x55989e4a6628]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
 4: 
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr, 
pg_stat_t*)+0x297) [0x55989e7b2197]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, 
bool*)+0xfdc) [0x55989e7e059c]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, 
unsigned long*)+0x110b) [0x55989e7e468b]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
ThreadPool::TPHandle&)+0x302) [0x55989e639192]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) 
[0x55989e6544d7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) 
[0x55989ec2ba74]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
 12: (()+0x7fa3) [0x7ff47f718fa3]
 13: (clone()+0x3f) [0x7ff47f2c84cf]

 0> 2019-08-13 00:25:14.761 7ff4637b1700 -1 *** Caught signal (Aborted) **
 in thread 7ff4637b1700 thread_name:tp_osd_tp

 ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus 
(stable)
 1: (()+0x12730) [0x7ff47f723730]
 2: (gsignal()+0x10b) [0x7ff47f2067bb]
 3: (abort()+0x121) [0x7ff47f1f1535]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a3) [0x55989e4a64a1]
 5: (()+0x517628) [0x55989e4a6628]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
 7: 
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr, 
pg_stat_t*)+0x297) [0x55989e7b2197]
 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, 
bool*)+0xfdc) [0x55989e7e059c]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, 
unsigned long*)+0x110b) [0x55989e7e468b]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
ThreadPool::TPHandle&)+0x302) [0x55989e639192]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
 12: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) 
[0x55989ec2ba74]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
 15: (()+0x7fa3) [0x7ff47f718fa3]
 16: (clone()+0x3f) [0x7ff47f2c84cf]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.



 FAILED ceph_assert(clone_overlap.count(clone)

if possible id like to 'nuke' this from the osd as there is no snap's active, 
however  would love some advise on the best way to go about this. 

best regards
Kevin Myers

___
ceph-users mailing list
ceph-users@lists.ceph.com

[ceph-users] backfilling causing a crash in osd.

2019-08-02 Thread response
Hi All,

Ive got a strange situation hopefully someone can help with.

We have a backfill occuring, that never completes, the destination osd of the 
recovery predictably crashes.  Outing the destination osd so another osd takes 
the backfill causes a different osd in the cluster then to crash, boot, rinse 
and repeat. 

The logs show : 

--- begin dump of recent events ---
-2> 2019-08-02 06:26:16.17 7ff9fadf6700  5 -- 10.1.100.22:6808/365 
>> 10.1.100.6:0/3789781062 conn(0x55d272342000 :6808 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4238 cs=1 l=1). rx 
client.352388821 seq 74 0x55d2723ad740 osd_op(client.352388821.0:46698064 
0.142e 0.b4eab42e (undecoded) ondisk+write+known_if_redirected e174744) v8
-1> 2019-08-02 06:26:16.133367 7ff9fadf6700  1 -- 10.1.100.22:6808/365 
<== client.352388821 10.1.100.6:0/3789781062 74  
osd_op(client.352388821.0:46698064 0.142e 0.b4eab42e (undecoded) 
ondisk+write+known_if_redirected e174744) v8  248+0+16384 (881189615 0 
2173568771) 0x55d2723ad740 con 0x55d272342000
 0> 2019-08-02 06:26:16.185021 7ff9df594700 -1 *** Caught signal (Aborted) 
**
 in thread 7ff9df594700 thread_name:tp_osd_tp

 ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous 
(stable)
 1: (()+0xa59c94) [0x55d25900ec94]
 2: (()+0x110e0) [0x7ff9fe9a10e0]
 3: (gsignal()+0xcf) [0x7ff9fd968fff]
 4: (abort()+0x16a) [0x7ff9fd96a42a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x28e) [0x55d2590573ee]
 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo 
const&, std::shared_ptr, bool, 
ObjectStore::Transaction*)+0x1287) [0x55d258bad597]
 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, 
ObjectStore::Transaction*)+0x305) [0x55d258d3d6e5]
 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr)+0x12e) 
[0x55d258d3d8fe]
 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x2e3) 
[0x55d258d4d723]
 10: (PGBackend::handle_message(boost::intrusive_ptr)+0x50) 
[0x55d258c50ce0]
 11: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x4f1) [0x55d258bb44a1]
 12: (OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3ab) [0x55d258a21dcb]
 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr 
const&)+0x5a) [0x55d258cda97a]
 14: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x102d) [0x55d258a4fdbd]
 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef) 
[0x55d25905c0cf]
 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d25905f3d0]
 17: (()+0x74a4) [0x7ff9fe9974a4]
 18: (clone()+0x3f) [0x7ff9fda1ed0f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-osd.35.log
--- end dump of recent events ---


Any help would be very much appreciated. 

All the Best 
Kevin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RADOSGW err=Input/output error

2018-07-04 Thread response
Hi Drew,

Try to increase debugging with 

debug ms = 1 
debug rgw = 20

Regards
Kev


- Original Message -
From: "Drew Weaver" 
To: "ceph-users" 
Sent: Tuesday, July 3, 2018 1:39:55 PM
Subject: [ceph-users] RADOSGW err=Input/output error

An application is having general failures writing to a test cluster we have 
setup. 



2018-07-02 23:13:26.128282 7fe00b560700 0 WARNING: set_req_state_err err_no=5 
resorting to 500 

2018-07-02 23:13:26.128460 7fe00b560700 1 == req done req=0x7fe00b55a110 op 
status=-5 http_status=500 == 

2018-07-02 23:13:27.530236 7fe00b560700 1 civetweb: 0x55639acc: x - - 
[02/Jul/2018:23:12:55 -0400] "PUT /my-new-bucket/BEOST_0292/2752 HTTP/1.1" 
500 0 - APN/1.0 Veritas/1.0 BackupExec/20.0 

2018-07-02 23:13:27.532849 7fdfe2d0f700 1 == starting new request 
req=0x7fdfe2d09110 = 

2018-07-02 23:13:27.538476 7fdfe2d0f700 0 WARNING: set_req_state_err err_no=5 
resorting to 500 

2018-07-02 23:13:27.538554 7fdfe2d0f700 0 ERROR: 
RESTFUL_IO(s)->complete_header() returned err=Input/output error 

2018-07-02 23:13:27.538623 7fdfe2d0f700 1 == req done req=0x7fdfe2d09110 op 
status=-5 http_status=500 == 

2018-07-02 23:13:27.538683 7fdfe2d0f700 1 civetweb: 0x55639ae7d000: x - - 
[02/Jul/2018:23:13:27 -0400] "PUT /my-new-bucket/BEOST_0292/2753 HTTP/1.1" 
500 0 - APN/1.0 Veritas/1.0 BackupExec/20.0 

2018-07-02 23:13:28.252088 7fe002d4f700 1 == starting new request 
req=0x7fe002d49110 = 



I’ve done a bit of Googling and trying to match the time up with logs on other 
hosts but there really doesn’t seem to be very much else happening at the time 
when this error occurs. This is version 12.2.5. 



Anyone have any hints at what I could look at to try and find the cause of 
this? 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com