Hi Greg,

many thanks. This is a new cluster created initially with luminous 12.2.0. I'm not sure the instructions on jewel really apply on my case too, and all the machines have ntp enabled, but I'll have a look, many thanks for the link. All machines are set to CET, although I'm running over docker containers which are using UTC internally, but they are all consistent.

At the moment, after setting 5 of the osds out the cluster resumed, and now I'm recreating those osds to be on the safe side.

Thanks,


    Alessandro


Il 31/01/18 19:26, Gregory Farnum ha scritto:
On Tue, Jan 30, 2018 at 5:49 AM Alessandro De Salvo <[email protected] <mailto:[email protected]>> wrote:

    Hi,

    we have several times a day different OSDs running Luminous 12.2.2 and
    Bluestore crashing with errors like this:


    starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2
    /var/lib/ceph/osd/ceph-2/journal
    2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082
    log_to_monitors
    {default=true}
    
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
    In function 'void
    PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
    int)'
    thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
    
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
    12819: FAILED assert(obc)
      ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
    luminous (stable)
      1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
    const*)+0x110) [0x556c6df51550]
      2:
    (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
    std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x3b6)
    [0x556c6db5e106]
      3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
      4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
    [0x556c6db78d39]
      5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
    ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
      6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
    boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
    [0x556c6d9c0899]
      7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
    const&)+0x57) [0x556c6dc38897]
      8: (OSD::ShardedOpWQ::_process(unsigned int,
    ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
      9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
    [0x556c6df57069]
      10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
    [0x556c6df59000]
      11: (()+0x7e25) [0x7f1e16c17e25]
      12: (clone()+0x6d) [0x7f1e15d0b34d]
      NOTE: a copy of the executable, or `objdump -rdS <executable>` is
    needed to interpret this.
    2018-01-30 13:45:29.505317 7f1dfd734700 -1
    
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
    In function 'void
    PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
    int)'
    thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
    
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
    12819: FAILED assert(obc)

      ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba)
    luminous (stable)
      1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
    const*)+0x110) [0x556c6df51550]
      2:
    (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
    std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x3b6)
    [0x556c6db5e106]
      3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
      4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2389)
    [0x556c6db78d39]
      5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
    ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
      6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
    boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9)
    [0x556c6d9c0899]
      7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
    const&)+0x57) [0x556c6dc38897]
      8: (OSD::ShardedOpWQ::_process(unsigned int,
    ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
      9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
    [0x556c6df57069]
      10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
    [0x556c6df59000]
      11: (()+0x7e25) [0x7f1e16c17e25]
      12: (clone()+0x6d) [0x7f1e15d0b34d]
      NOTE: a copy of the executable, or `objdump -rdS <executable>` is
    needed to interpret this.


    Is it a known issue? How can we fix that?



Hmm, it looks a lot like http://tracker.ceph.com/issues/19185, but that wasn't supposed to be a problem in Luminous. When was this cluster created?

There was a thread in October titled "[ceph-users] [Jewel] Crash Osd with void Hit_set_trim" that had instructions for diagnosing and dealing with it in Jewel; you might investigate that.
-Greg

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to