Hi Greg,
many thanks. This is a new cluster created initially with luminous
12.2.0. I'm not sure the instructions on jewel really apply on my case
too, and all the machines have ntp enabled, but I'll have a look, many
thanks for the link. All machines are set to CET, although I'm running
over docker containers which are using UTC internally, but they are all
consistent.
At the moment, after setting 5 of the osds out the cluster resumed, and
now I'm recreating those osds to be on the safe side.
Thanks,
Alessandro
Il 31/01/18 19:26, Gregory Farnum ha scritto:
On Tue, Jan 30, 2018 at 5:49 AM Alessandro De Salvo
<[email protected]
<mailto:[email protected]>> wrote:
Hi,
we have several times a day different OSDs running Luminous 12.2.2 and
Bluestore crashing with errors like this:
starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2
/var/lib/ceph/osd/ceph-2/journal
2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082
log_to_monitors
{default=true}
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556c6df51550]
2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x3b6)
[0x556c6db5e106]
3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
[0x556c6db78d39]
5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x556c6dc38897]
8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x556c6df57069]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x556c6df59000]
11: (()+0x7e25) [0x7f1e16c17e25]
12: (clone()+0x6d) [0x7f1e15d0b34d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
2018-01-30 13:45:29.505317 7f1dfd734700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556c6df51550]
2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x3b6)
[0x556c6db5e106]
3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
[0x556c6db78d39]
5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x556c6dc38897]
8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x556c6df57069]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x556c6df59000]
11: (()+0x7e25) [0x7f1e16c17e25]
12: (clone()+0x6d) [0x7f1e15d0b34d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Is it a known issue? How can we fix that?
Hmm, it looks a lot like http://tracker.ceph.com/issues/19185, but
that wasn't supposed to be a problem in Luminous. When was this
cluster created?
There was a thread in October titled "[ceph-users] [Jewel] Crash Osd
with void Hit_set_trim" that had instructions for diagnosing and
dealing with it in Jewel; you might investigate that.
-Greg
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com