Re: [ceph-users] OSD Crash When Upgrading from Jewel to Luminous?

2018-08-22 Thread Gregory Farnum
Adjusting CRUSH weight shouldn't have caused this. Unfortunately the logs
don't have a lot of hints — the thread that crashed doesn't have any output
except for the Crashed state. If you can reproduce this with more debugging
on we ought to be able to track it down; if not it seems we missed a
strange upgrade issue that others haven't run into. :/
-Greg

On Tue, Aug 21, 2018 at 7:42 AM Kenneth Van Alstyne <
kvanalst...@knightpoint.com> wrote:

> After looking into this further, is it possible that adjusting CRUSH
> weight of the OSDs while running mis-matched versions of the ceph-osd
> daemon across the cluster can cause this issue?  Under certain
> circumstances in our cluster, this may happen automatically on the
> backend.  I can’t duplicate the issue in a lab, but highly suspect this is
> what happened.
>
> Thanks,
>
> --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> 
> c: 228-547-8045 <(228)%20547-8045> f: 571-266-3106 <(571)%20266-3106>
> www.knightpoint.com
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 2 / ISO 27001 / CMMI Level 3
>
> Notice: This e-mail message, including any attachments, is for the sole
> use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, copy, use, disclosure,
> or distribution is STRICTLY prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy all copies
> of the original message.
>
> On Aug 17, 2018, at 4:01 PM, Gregory Farnum  wrote:
>
> Do you have more logs that indicate what state machine event the crashing
> OSDs received? This obviously shouldn't have happened, but it's a plausible
> failure mode, especially if it's a relatively rare combination of events.
> -Greg
>
> On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne <
> kvanalst...@knightpoint.com> wrote:
>
>> Hello all:
>> I ran into an issue recently with one of my clusters when
>> upgrading from 10.2.10 to 12.2.7.  I have previously tested the upgrade in
>> a lab and upgraded one of our five production clusters with no issues.  On
>> the second cluster, however, I ran into an issue where all OSDs that were
>> NOT running Luminous yet (which was about 40% of the cluster at the time)
>> all crashed with the same backtrace, which I have pasted below:
>>
>> ===
>>  0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc: In
>> function
>> 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state> PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f145c9ec700 time
>> 2018-08-13 17:35:13.157319
>> osd/PG.cc: 5860: FAILED assert(0 == "we got a bad state machine event")
>>
>>  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x7f) [0x55b9bf08614f]
>>  2:
>> (PG::RecoveryState::Crashed::Crashed(boost::statechart::state> PG::RecoveryState::RecoveryMachine, boost::mpl::list> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
>> (boost::statechart::history_mode)0>::my_context)+0xc4) [0x55b9bea62db4]
>>  3: (()+0x447366) [0x55b9bea9a366]
>>  4: (boost::statechart::simple_state> PG::RecoveryState::Started, boost::mpl::list> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>> mpl_::na, mpl_::na, mpl_::na>,
>> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
>> const&, void const*)+0x2f7) [0x55b9beac8b77]
>>  5: (boost::statechart::state_machine> PG::RecoveryState::Initial, std::allocator,
>> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
>> const&)+0x6b) [0x55b9beaab5bb]
>>  6: (PG::handle_peering_event(std::shared_ptr,
>> PG::RecoveryCtx*)+0x384) [0x55b9bea7db14]
>>  7: (OSD::process_peering_events(std::__cxx11::list> std::allocator > const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723]
>>  8: (ThreadPool::BatchWorkQueue::_void_process(void*,
>> ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a]
>>  9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40]
>>  10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0]
>>  11: (()+0x7507) [0x7f14e2c96507]
>>  12: (clone()+0x3f) [0x7f14e0ca214f]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>> ===
>>
>> Once I restarted the impacted OSDs, which brought them up to 12.2.7,
>> everything recovered just fine and the cluster is healthy.  The only rub is
>> that losing that many OSDs simultaneously caused a significant 

Re: [ceph-users] OSD Crash When Upgrading from Jewel to Luminous?

2018-08-21 Thread Kenneth Van Alstyne
After looking into this further, is it possible that adjusting CRUSH weight of 
the OSDs while running mis-matched versions of the ceph-osd daemon across the 
cluster can cause this issue?  Under certain circumstances in our cluster, this 
may happen automatically on the backend.  I can’t duplicate the issue in a lab, 
but highly suspect this is what happened.

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001 / CMMI Level 3

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

On Aug 17, 2018, at 4:01 PM, Gregory Farnum 
mailto:gfar...@redhat.com>> wrote:

Do you have more logs that indicate what state machine event the crashing OSDs 
received? This obviously shouldn't have happened, but it's a plausible failure 
mode, especially if it's a relatively rare combination of events.
-Greg

On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne 
mailto:kvanalst...@knightpoint.com>> wrote:
Hello all:
I ran into an issue recently with one of my clusters when upgrading 
from 10.2.10 to 12.2.7.  I have previously tested the upgrade in a lab and 
upgraded one of our five production clusters with no issues.  On the second 
cluster, however, I ran into an issue where all OSDs that were NOT running 
Luminous yet (which was about 40% of the cluster at the time) all crashed with 
the same backtrace, which I have pasted below:

===
 0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc: In 
function 
'PG::RecoveryState::Crashed::Crashed(boost::statechart::state::my_context)' thread 7f145c9ec700 time 
2018-08-13 17:35:13.157319
osd/PG.cc: 5860: FAILED assert(0 == "we got a bad state machine 
event")

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) 
[0x55b9bf08614f]
 2: 
(PG::RecoveryState::Crashed::Crashed(boost::statechart::state, (boost::statechart::history_mode)0>::my_context)+0xc4) 
[0x55b9bea62db4]
 3: (()+0x447366) [0x55b9bea9a366]
 4: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x2f7) [0x55b9beac8b77]
 5: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
 const&)+0x6b) [0x55b9beaab5bb]
 6: (PG::handle_peering_event(std::shared_ptr, 
PG::RecoveryCtx*)+0x384) [0x55b9bea7db14]
 7: (OSD::process_peering_events(std::__cxx11::list > 
const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723]
 8: (ThreadPool::BatchWorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0]
 11: (()+0x7507) [0x7f14e2c96507]
 12: (clone()+0x3f) [0x7f14e0ca214f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
===

Once I restarted the impacted OSDs, which brought them up to 12.2.7, everything 
recovered just fine and the cluster is healthy.  The only rub is that losing 
that many OSDs simultaneously caused a significant I/O disruption to the 
production servers for several minutes while I brought up the remaining OSDs.  
I have been trying to duplicate this issue in a lab again before continuing the 
upgrades on the other three clusters, but am coming up short.  Has anyone seen 
anything like this and am I missing something obvious?

Given how quickly the issue happened and the fact that I’m having a hard time 
reproducing this issue, I am limited in the amount of logging and debug 
information I have available, unfortunately.  If it helps, all ceph-mon, 
ceph-mds, radosgw, and ceph-mgr daemons were running 12.2.7, while 30 of the 50 
total ceph-osd daemons were also on 12.2.7 when the remaining 20 ceph-osd 
daemons (on 10.2.10) crashed.

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 
20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001 / CMMI 

Re: [ceph-users] OSD Crash When Upgrading from Jewel to Luminous?

2018-08-17 Thread Gregory Farnum
Do you have more logs that indicate what state machine event the crashing
OSDs received? This obviously shouldn't have happened, but it's a plausible
failure mode, especially if it's a relatively rare combination of events.
-Greg

On Fri, Aug 17, 2018 at 4:49 PM Kenneth Van Alstyne <
kvanalst...@knightpoint.com> wrote:

> Hello all:
> I ran into an issue recently with one of my clusters when
> upgrading from 10.2.10 to 12.2.7.  I have previously tested the upgrade in
> a lab and upgraded one of our five production clusters with no issues.  On
> the second cluster, however, I ran into an issue where all OSDs that were
> NOT running Luminous yet (which was about 40% of the cluster at the time)
> all crashed with the same backtrace, which I have pasted below:
>
> ===
>  0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc: In function
> 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f145c9ec700 time
> 2018-08-13 17:35:13.157319
> osd/PG.cc: 5860: FAILED assert(0 == "we got a bad state machine event")
>
>  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x7f) [0x55b9bf08614f]
>  2:
> (PG::RecoveryState::Crashed::Crashed(boost::statechart::state PG::RecoveryState::RecoveryMachine, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::my_context)+0xc4) [0x55b9bea62db4]
>  3: (()+0x447366) [0x55b9bea9a366]
>  4: (boost::statechart::simple_state PG::RecoveryState::Started, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x2f7) [0x55b9beac8b77]
>  5: (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0x6b) [0x55b9beaab5bb]
>  6: (PG::handle_peering_event(std::shared_ptr,
> PG::RecoveryCtx*)+0x384) [0x55b9bea7db14]
>  7: (OSD::process_peering_events(std::__cxx11::list std::allocator > const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723]
>  8: (ThreadPool::BatchWorkQueue::_void_process(void*,
> ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a]
>  9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40]
>  10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0]
>  11: (()+0x7507) [0x7f14e2c96507]
>  12: (clone()+0x3f) [0x7f14e0ca214f]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
> ===
>
> Once I restarted the impacted OSDs, which brought them up to 12.2.7,
> everything recovered just fine and the cluster is healthy.  The only rub is
> that losing that many OSDs simultaneously caused a significant I/O
> disruption to the production servers for several minutes while I brought up
> the remaining OSDs.  I have been trying to duplicate this issue in a lab
> again before continuing the upgrades on the other three clusters, but am
> coming up short.  Has anyone seen anything like this and am I missing
> something obvious?
>
> Given how quickly the issue happened and the fact that I’m having a hard
> time reproducing this issue, I am limited in the amount of logging and
> debug information I have available, unfortunately.  If it helps, all
> ceph-mon, ceph-mds, radosgw, and ceph-mgr daemons were running 12.2.7,
> while 30 of the 50 total ceph-osd daemons were also on 12.2.7 when the
> remaining 20 ceph-osd daemons (on 10.2.10) crashed.
>
> Thanks,
>
> --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> 
> c: 228-547-8045 <(228)%20547-8045> f: 571-266-3106 <(571)%20266-3106>
> www.knightpoint.com
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 2 / ISO 27001 / CMMI Level 3
>
> Notice: This e-mail message, including any attachments, is for the sole
> use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, copy, use, disclosure, or
> distribution is STRICTLY prohibited. If you are not the intended recipient,
> please contact the sender by reply e-mail and destroy all copies of the
> original message.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

[ceph-users] OSD Crash When Upgrading from Jewel to Luminous?

2018-08-17 Thread Kenneth Van Alstyne
Hello all:
I ran into an issue recently with one of my clusters when upgrading 
from 10.2.10 to 12.2.7.  I have previously tested the upgrade in a lab and 
upgraded one of our five production clusters with no issues.  On the second 
cluster, however, I ran into an issue where all OSDs that were NOT running 
Luminous yet (which was about 40% of the cluster at the time) all crashed with 
the same backtrace, which I have pasted below:

===
 0> 2018-08-13 17:35:13.160849 7f145c9ec700 -1 osd/PG.cc: In function 
'PG::RecoveryState::Crashed::Crashed(boost::statechart::state::my_context)' thread 7f145c9ec700 time 
2018-08-13 17:35:13.157319
osd/PG.cc: 5860: FAILED assert(0 == "we got a bad state machine event")

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) 
[0x55b9bf08614f]
 2: 
(PG::RecoveryState::Crashed::Crashed(boost::statechart::state, (boost::statechart::history_mode)0>::my_context)+0xc4) 
[0x55b9bea62db4]
 3: (()+0x447366) [0x55b9bea9a366]
 4: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x2f7) [0x55b9beac8b77]
 5: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
 const&)+0x6b) [0x55b9beaab5bb]
 6: (PG::handle_peering_event(std::shared_ptr, 
PG::RecoveryCtx*)+0x384) [0x55b9bea7db14]
 7: (OSD::process_peering_events(std::__cxx11::list > 
const&, ThreadPool::TPHandle&)+0x263) [0x55b9be9d1723]
 8: (ThreadPool::BatchWorkQueue::_void_process(void*, 
ThreadPool::TPHandle&)+0x2a) [0x55b9bea1274a]
 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xeb0) [0x55b9bf076d40]
 10: (ThreadPool::WorkThread::entry()+0x10) [0x55b9bf077ef0]
 11: (()+0x7507) [0x7f14e2c96507]
 12: (clone()+0x3f) [0x7f14e0ca214f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
===

Once I restarted the impacted OSDs, which brought them up to 12.2.7, everything 
recovered just fine and the cluster is healthy.  The only rub is that losing 
that many OSDs simultaneously caused a significant I/O disruption to the 
production servers for several minutes while I brought up the remaining OSDs.  
I have been trying to duplicate this issue in a lab again before continuing the 
upgrades on the other three clusters, but am coming up short.  Has anyone seen 
anything like this and am I missing something obvious?

Given how quickly the issue happened and the fact that I’m having a hard time 
reproducing this issue, I am limited in the amount of logging and debug 
information I have available, unfortunately.  If it helps, all ceph-mon, 
ceph-mds, radosgw, and ceph-mgr daemons were running 12.2.7, while 30 of the 50 
total ceph-osd daemons were also on 12.2.7 when the remaining 20 ceph-osd 
daemons (on 10.2.10) crashed.

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com 
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 2 / ISO 27001 / CMMI Level 3

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com