Re: [ceph-users] OSDs going down/up at random

2018-01-10 Thread Brad Hubbard
On Wed, Jan 10, 2018 at 8:32 PM, Mike O'Connor  wrote:
> On 10/01/2018 4:48 PM, Mike O'Connor wrote:
>> On 10/01/2018 4:24 PM, Sam Huracan wrote:
>>> Hi Mike,
>>>
>>> Could you show system log at moment osd down and up?
> So now I know its a crash, what my next step. As soon as I put the
> system under write load, OSDs start crashing.

Could be this issue (or at least related).

http://tracker.ceph.com/issues/22102

You can start by adding information about your configuration, how/when
you see the crash, and the stack trace to that tracker.

I'd also look for the more detailed log for osd12 which should give
more information.

>
> Mike
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs going down/up at random

2018-01-10 Thread Mike O'Connor
On 10/01/2018 4:48 PM, Mike O'Connor wrote:
> On 10/01/2018 4:24 PM, Sam Huracan wrote:
>> Hi Mike,
>>
>> Could you show system log at moment osd down and up?
So now I know its a crash, what my next step. As soon as I put the
system under write load, OSDs start crashing.

Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs going down/up at random

2018-01-09 Thread Mike O'Connor
On 10/01/2018 4:24 PM, Sam Huracan wrote:
> Hi Mike,
>
> Could you show system log at moment osd down and up?
Ok so I have no idea how I missed this each time I looked but the syslog
does show a problem.

I've created the dump file mentioned in the log its 29M compressed so
any one who wants it I'll have to more directly send it.

Mike

--
Jan 10 15:56:31 pve ceph-osd[2722]: 2018-01-10 15:56:31.338068
7efe5eac1700 -1 abort: Corruption: block checksum mismatch
Jan 10 15:56:31 pve ceph-osd[2722]: *** Caught signal (Aborted) **
Jan 10 15:56:31 pve ceph-osd[2722]:  in thread 7efe5eac1700
thread_name:tp_osd_tp
Jan 10 15:56:31 pve ceph-osd[2722]:  ceph version 12.2.2
(215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)
Jan 10 15:56:31 pve ceph-osd[2722]:  1: (()+0xa16664) [0x55a8b396b664]
Jan 10 15:56:31 pve ceph-osd[2722]:  2: (()+0x110c0) [0x7efe796b70c0]
Jan 10 15:56:31 pve ceph-osd[2722]:  3: (gsignal()+0xcf) [0x7efe7867efcf]
Jan 10 15:56:31 pve ceph-osd[2722]:  4: (abort()+0x16a) [0x7efe786803fa]
Jan 10 15:56:31 pve ceph-osd[2722]:  5:
(RocksDBStore::get(std::__cxx11::basic_string const&, char const*,
unsigned long, ceph::buffer::list*)+0x29f) [0x55a8b38a995f]
Jan 10 15:56:31 pve ceph-osd[2722]:  6:
(BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x5ae)
[0x55a8b382d2ae]
Jan 10 15:56:31 pve ceph-osd[2722]:  7:
(BlueStore::getattr(boost::intrusive_ptr&,
ghobject_t const&, char const*, ceph::buffer::ptr&)+0xf6) [0x55a8b382e326]
Jan 10 15:56:31 pve ceph-osd[2722]:  8:
(PGBackend::objects_get_attr(hobject_t const&,
std::__cxx11::basic_string const&, ceph::buffer::list*)+0x106) [0x55a8b35bde26]
Jan 10 15:56:31 pve ceph-osd[2722]:  9:
(PrimaryLogPG::get_snapset_context(hobject_t const&, bool,
std::map, ceph::buffer::list,
std::less >,
std::allocator const,
ceph::buffer::list> > > const*, bool)+0x3fb) [0x55a8b35081db]
Jan 10 15:56:31 pve ceph-osd[2722]:  10:
(PrimaryLogPG::get_object_context(hobject_t const&, bool,
std::map, ceph::buffer::list,
std::less >,
std::allocator const,
ceph::buffer::list> > > const*)+0xc39) [0x55a8b352fec9]
Jan 10 15:56:31 pve ceph-osd[2722]:  11:
(PrimaryLogPG::find_object_context(hobject_t const&,
std::shared_ptr*, bool, bool, hobject_t*)+0x387)
[0x55a8b3533687]
Jan 10 15:56:31 pve ceph-osd[2722]:  12:
(PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2214)
[0x55a8b3571694]
Jan 10 15:56:31 pve ceph-osd[2722]:  13:
(PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xec6) [0x55a8b352c436]
Jan 10 15:56:31 pve ceph-osd[2722]:  14:
(OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3ab)
[0x55a8b33a99eb]
Jan 10 15:56:31 pve ceph-osd[2722]:  15:
(PGQueueable::RunVis::operator()(boost::intrusive_ptr
const&)+0x5a) [0x55a8b3647eba]
Jan 10 15:56:31 pve ceph-osd[2722]:  16:
(OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x103d) [0x55a8b33d0f4d]
Jan 10 15:56:31 pve ceph-osd[2722]:  17:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef)
[0x55a8b39b806f]
Jan 10 15:56:31 pve ceph-osd[2722]:  18:
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a8b39bb370]
Jan 10 15:56:31 pve ceph-osd[2722]:  19: (()+0x7494) [0x7efe796ad494]
Jan 10 15:56:31 pve ceph-osd[2722]:  20: (clone()+0x3f) [0x7efe78734aff]
Jan 10 15:56:31 pve ceph-osd[2722]: 2018-01-10 15:56:31.343532
7efe5eac1700 -1 *** Caught signal (Aborted) **
Jan 10 15:56:31 pve ceph-osd[2722]:  in thread 7efe5eac1700
thread_name:tp_osd_tp
Jan 10 15:56:31 pve ceph-osd[2722]:  ceph version 12.2.2
(215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)
Jan 10 15:56:31 pve ceph-osd[2722]:  1: (()+0xa16664) [0x55a8b396b664]
Jan 10 15:56:31 pve ceph-osd[2722]:  2: (()+0x110c0) [0x7efe796b70c0]
Jan 10 15:56:31 pve ceph-osd[2722]:  3: (gsignal()+0xcf) [0x7efe7867efcf]
Jan 10 15:56:31 pve ceph-osd[2722]:  4: (abort()+0x16a) [0x7efe786803fa]
Jan 10 15:56:31 pve ceph-osd[2722]:  5:
(RocksDBStore::get(std::__cxx11::basic_string const&, char const*,
unsigned long, ceph::buffer::list*)+0x29f) [0x55a8b38a995f]
Jan 10 15:56:31 pve ceph-osd[2722]:  6:
(BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x5ae)
[0x55a8b382d2ae]
Jan 10 15:56:31 pve ceph-osd[2722]:  7:
(BlueStore::getattr(boost::intrusive_ptr&,
ghobject_t const&, char const*, ceph::buffer::ptr&)+0xf6) [0x55a8b382e326]
Jan 10 15:56:31 pve ceph-osd[2722]:  8:
(PGBackend::objects_get_attr(hobject_t const&,
std::__cxx11::basic_string

Re: [ceph-users] OSDs going down/up at random

2018-01-09 Thread Sam Huracan
Hi Mike,

Could you show system log at moment osd down and up?

On Jan 10, 2018 12:52, "Mike O'Connor"  wrote:

> On 10/01/2018 3:52 PM, Linh Vu wrote:
> >
> > Have you checked your firewall?
> >
> There are no ip tables rules at this time but connection tracking is
> enable. I would expect errors about running out of table space if that
> was an issue.
>
> Thanks
> Mike
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs going down/up at random

2018-01-09 Thread Mike O'Connor
On 10/01/2018 3:52 PM, Linh Vu wrote:
>
> Have you checked your firewall?
>
There are no ip tables rules at this time but connection tracking is
enable. I would expect errors about running out of table space if that
was an issue.

Thanks
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs going down/up at random

2018-01-09 Thread Linh Vu
Have you checked your firewall?


From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Mike O'Connor 
<m...@oeg.com.au>
Sent: Wednesday, 10 January 2018 3:40:30 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] OSDs going down/up at random

Hi All

I have a ceph host (12.2.2) which has 14 OSDs which seem to go down the
up, what should I look at to try to identify the issue ?
The system has three LSI SAS9201-8i cards which is then connected 14
drives at this time. (option of 24 drives)
I have three of these chassis but only one is running right now so I
have CEPH set for singe node.

I have very carefully looks at the logs files and not found anything
which indicates any issues with the controller and the drives.

dmesg has these messages.
---
[78752.708932] libceph: osd3 http://10.1.6.2:6834 socket closed (con state OPEN)
[78752.710319] libceph: osd3 http://10.1.6.2:6834 socket closed (con state
CONNECTING)
[78753.426244] libceph: osd3 down
[78753.426640] libceph: osd3 down
[78776.496962] libceph: osd5 http://10.1.6.2:6810 socket closed (con state OPEN)
[78776.498626] libceph: osd5 http://10.1.6.2:6810 socket closed (con state
CONNECTING)
[78777.446384] libceph: osd5 down
[78777.446720] libceph: osd5 down
[78806.466973] libceph: osd3 up
[78806.467429] libceph: osd3 up
[78855.565098] libceph: osd10 http://10.1.6.2:6801 socket closed (con state 
OPEN)
[78855.567062] libceph: osd10 http://10.1.6.2:6801 socket closed (con state
CONNECTING)
[78856.554209] libceph: osd10 down
[78856.554357] libceph: osd10 down
[78868.265665] libceph: osd1 http://10.1.6.2:6830 socket closed (con state OPEN)
[78868.266347] libceph: osd1 http://10.1.6.2:6830 socket closed (con state
CONNECTING)
[78868.529575] libceph: osd1 down
[78869.469264] libceph: osd1 down
[78899.538533] libceph: osd10 up
[78899.538808] libceph: osd10 up
[78903.556418] libceph: osd5 up
[78905.309401] libceph: osd5 up
[78909.755499] libceph: osd1 up
[78912.008581] libceph: osd1 up
[78912.040872] libceph: osd4 http://10.1.6.2:6850 socket error on write
[78924.736964] libceph: osd8 http://10.1.6.2:6809 socket closed (con state OPEN)
[78924.738402] libceph: osd8 http://10.1.6.2:6809 socket closed (con state
CONNECTING)
[78925.602597] libceph: osd8 down
[78925.602942] libceph: osd8 down
[78988.648108] libceph: osd8 up
[78988.648462] libceph: osd8 up
[79010.808917] libceph: osd4 http://10.1.6.2:6850 socket closed (con state OPEN)
[79010.810722] libceph: osd4 http://10.1.6.2:6850 socket closed (con state
CONNECTING)
[79011.617598] libceph: osd4 down
[79011.617861] libceph: osd4 down
[79072.772966] libceph: osd14 http://10.1.6.2:6854 socket closed (con state 
OPEN)
[79072.773434] libceph: osd14 http://10.1.6.2:6854 socket closed (con state 
OPEN)
[79072.774219] libceph: osd14 http://10.1.6.2:6854 socket closed (con state
CONNECTING)
[79073.657383] libceph: osd14 down
[79073.657552] libceph: osd14 down
[79082.565025] libceph: osd13 http://10.1.6.2:6846 socket closed (con state 
OPEN)
[79082.565814] libceph: osd13 http://10.1.6.2:6846 socket closed (con state 
OPEN)
[79082.566279] libceph: osd13 http://10.1.6.2:6846 socket closed (con state
CONNECTING)
[79082.670861] libceph: osd13 down
[79082.671023] libceph: osd13 down
[79115.435180] libceph: osd14 up
[79115.435989] libceph: osd14 up
[79117.603991] libceph: osd13 up
[79118.557601] libceph: osd13 up
[79154.719547] libceph: osd4 up
[79154.720232] libceph: osd4 up
[79175.900935] libceph: osd12 http://10.1.6.2:6822 socket closed (con state 
OPEN)
[79175.902922] libceph: osd12 http://10.1.6.2:6822 socket closed (con state
CONNECTING)
[79176.650847] libceph: osd12 down
[79176.651138] libceph: osd12 down
[79219.762665] libceph: osd12 up
[79219.763090] libceph: osd12 up
[79252.405666] libceph: osd11 http://10.1.6.2:6805 socket closed (con state 
OPEN)
[79252.406349] libceph: osd11 http://10.1.6.2:6805 socket closed (con state
CONNECTING)
[79252.462748] libceph: osd11 down
[79252.462855] libceph: osd11 down
[79285.656850] libceph: osd11 up
[79285.657341] libceph: osd11 up
[80558.024975] libceph: osd13 http://10.1.6.2:6854 socket closed (con state 
OPEN)
[80558.025751] libceph: osd13 http://10.1.6.2:6854 socket closed (con state 
OPEN)
[80558.026341] libceph: osd13 http://10.1.6.2:6854 socket closed (con state
CONNECTING)
[80558.652903] libceph: osd13 http://10.1.6.2:6854 socket error on write
[80558.734330] libceph: osd13 down
[80558.734501] libceph: osd13 down
[80590.753493] libceph: osd13 up
[80592.884936] libceph: osd13 up
[80592.897062] libceph: osd12 http://10.1.6.2:6822 socket closed (con state 
OPEN)
[90351.841800] libceph: osd1 down
[90371.299988] libceph: osd1 down
[90391.238370] libceph: osd1 up
[90391.778979] libceph: osd1 up

Thanks for any help/ideas
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-us

[ceph-users] OSDs going down/up at random

2018-01-09 Thread Mike O'Connor
Hi All

I have a ceph host (12.2.2) which has 14 OSDs which seem to go down the
up, what should I look at to try to identify the issue ?
The system has three LSI SAS9201-8i cards which is then connected 14
drives at this time. (option of 24 drives)
I have three of these chassis but only one is running right now so I
have CEPH set for singe node.

I have very carefully looks at the logs files and not found anything
which indicates any issues with the controller and the drives.

dmesg has these messages.
---
[78752.708932] libceph: osd3 10.1.6.2:6834 socket closed (con state OPEN)
[78752.710319] libceph: osd3 10.1.6.2:6834 socket closed (con state
CONNECTING)
[78753.426244] libceph: osd3 down
[78753.426640] libceph: osd3 down
[78776.496962] libceph: osd5 10.1.6.2:6810 socket closed (con state OPEN)
[78776.498626] libceph: osd5 10.1.6.2:6810 socket closed (con state
CONNECTING)
[78777.446384] libceph: osd5 down
[78777.446720] libceph: osd5 down
[78806.466973] libceph: osd3 up
[78806.467429] libceph: osd3 up
[78855.565098] libceph: osd10 10.1.6.2:6801 socket closed (con state OPEN)
[78855.567062] libceph: osd10 10.1.6.2:6801 socket closed (con state
CONNECTING)
[78856.554209] libceph: osd10 down
[78856.554357] libceph: osd10 down
[78868.265665] libceph: osd1 10.1.6.2:6830 socket closed (con state OPEN)
[78868.266347] libceph: osd1 10.1.6.2:6830 socket closed (con state
CONNECTING)
[78868.529575] libceph: osd1 down
[78869.469264] libceph: osd1 down
[78899.538533] libceph: osd10 up
[78899.538808] libceph: osd10 up
[78903.556418] libceph: osd5 up
[78905.309401] libceph: osd5 up
[78909.755499] libceph: osd1 up
[78912.008581] libceph: osd1 up
[78912.040872] libceph: osd4 10.1.6.2:6850 socket error on write
[78924.736964] libceph: osd8 10.1.6.2:6809 socket closed (con state OPEN)
[78924.738402] libceph: osd8 10.1.6.2:6809 socket closed (con state
CONNECTING)
[78925.602597] libceph: osd8 down
[78925.602942] libceph: osd8 down
[78988.648108] libceph: osd8 up
[78988.648462] libceph: osd8 up
[79010.808917] libceph: osd4 10.1.6.2:6850 socket closed (con state OPEN)
[79010.810722] libceph: osd4 10.1.6.2:6850 socket closed (con state
CONNECTING)
[79011.617598] libceph: osd4 down
[79011.617861] libceph: osd4 down
[79072.772966] libceph: osd14 10.1.6.2:6854 socket closed (con state OPEN)
[79072.773434] libceph: osd14 10.1.6.2:6854 socket closed (con state OPEN)
[79072.774219] libceph: osd14 10.1.6.2:6854 socket closed (con state
CONNECTING)
[79073.657383] libceph: osd14 down
[79073.657552] libceph: osd14 down
[79082.565025] libceph: osd13 10.1.6.2:6846 socket closed (con state OPEN)
[79082.565814] libceph: osd13 10.1.6.2:6846 socket closed (con state OPEN)
[79082.566279] libceph: osd13 10.1.6.2:6846 socket closed (con state
CONNECTING)
[79082.670861] libceph: osd13 down
[79082.671023] libceph: osd13 down
[79115.435180] libceph: osd14 up
[79115.435989] libceph: osd14 up
[79117.603991] libceph: osd13 up
[79118.557601] libceph: osd13 up
[79154.719547] libceph: osd4 up
[79154.720232] libceph: osd4 up
[79175.900935] libceph: osd12 10.1.6.2:6822 socket closed (con state OPEN)
[79175.902922] libceph: osd12 10.1.6.2:6822 socket closed (con state
CONNECTING)
[79176.650847] libceph: osd12 down
[79176.651138] libceph: osd12 down
[79219.762665] libceph: osd12 up
[79219.763090] libceph: osd12 up
[79252.405666] libceph: osd11 10.1.6.2:6805 socket closed (con state OPEN)
[79252.406349] libceph: osd11 10.1.6.2:6805 socket closed (con state
CONNECTING)
[79252.462748] libceph: osd11 down
[79252.462855] libceph: osd11 down
[79285.656850] libceph: osd11 up
[79285.657341] libceph: osd11 up
[80558.024975] libceph: osd13 10.1.6.2:6854 socket closed (con state OPEN)
[80558.025751] libceph: osd13 10.1.6.2:6854 socket closed (con state OPEN)
[80558.026341] libceph: osd13 10.1.6.2:6854 socket closed (con state
CONNECTING)
[80558.652903] libceph: osd13 10.1.6.2:6854 socket error on write
[80558.734330] libceph: osd13 down
[80558.734501] libceph: osd13 down
[80590.753493] libceph: osd13 up
[80592.884936] libceph: osd13 up
[80592.897062] libceph: osd12 10.1.6.2:6822 socket closed (con state OPEN)
[90351.841800] libceph: osd1 down
[90371.299988] libceph: osd1 down
[90391.238370] libceph: osd1 up
[90391.778979] libceph: osd1 up

Thanks for any help/ideas
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com