[ceph-users] Mon crashes virtual void LogMonitor::update_from_paxos(bool*)

2020-01-15 Thread Kevin Hrpcek
f0946674a00 10 mon.sephmon5@-1(probing).log 
v86521000 update_from_paxos version 86521000 summary v 0
  -261> 2020-01-15 16:36:46.084 7f0946674a00 10 mon.sephmon5@-1(probing).log 
v86521000 update_from_paxos latest full 86520999
  -261> 2020-01-15 16:36:46.084 7f0946674a00  7 mon.sephmon5@-1(probing).log 
v86521000 update_from_paxos loading summary e86520999
  -261> 2020-01-15 16:36:46.084 7f0946674a00  7 mon.sephmon5@-1(probing).log 
v86521000 update_from_paxos loaded summary e86520999
  -261> 2020-01-15 16:36:46.085 7f0946674a00 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc:
 In function 'virtual void LogMonitor::update_from_paxos(bool*)' thread 
7f0946674a00 time 2020-01-15 16:36:46.084573
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc:
 103: FAILED assert(err == 0)

 ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14b) [0x7f093da5087b]
 2: (()+0x26fa07) [0x7f093da50a07]
 3: (LogMonitor::update_from_paxos(bool*)+0x19d3) [0x55e5a0378de3]
 4: (PaxosService::refresh(bool*)+0x22b) [0x55e5a0427e1b]
 5: (Monitor::refresh_from_paxos(bool*)+0xd3) [0x55e5a0325943]
 6: (Monitor::preinit()+0xac4) [0x55e5a0326784]
 7: (main()+0x2611) [0x55e5a021b2b1]
 8: (__libc_start_main()+0xf5) [0x7f09397c6505]
 9: (()+0x24ad40) [0x55e5a02fad40]

  -261> 2020-01-15 16:36:46.086 7f0946674a00 -1 *** Caught signal (Aborted) **
 in thread 7f0946674a00 thread_name:ceph-mon


--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] January Ceph Science Group Virtual Meeting

2020-01-13 Thread Kevin Hrpcek
Hello,

We will be having a Ceph science/research/big cluster call on Wednesday January 
22nd. If anyone wants to discuss something specific they can add it to the pad 
linked below. If you have questions or comments you can contact me.

This is an informal open call of community members mostly from hpc/htc/research 
environments where we discuss whatever is on our minds regarding ceph. Updates, 
outages, features, maintenance, etc...there is no set presenter but I do 
attempt to keep the conversation lively.

https://pad.ceph.com/p/Ceph_Science_User_Group_20200122

Ceph calendar event details:

January 22, 2020
9am US Central
4pm Central Eurpean

We try to keep it to an hour or less.

Description:Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1579363980705000=AOvVaw2SfHjvt23rQFRJn8z4_zJ8>
Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1579363980705000=AOvVaw2CgJXpvLRSOlaaWC5rc3id>
To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1579363980705000=AOvVaw3mDPS_6nmD3yh_9Sw7Z7So>
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US)
See all numbers: 
https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1579363980705000=AOvVaw2aHSwR3wGU0yTs-bCsUFoC>
2.) Enter Meeting ID: 908675367
3.) Press #
Want to test your video connection? 
https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1579363980705000=AOvVaw3UlW-AxGCX7TXfn8VAGfH4>


Kevin


--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Science User Group Call October

2019-10-21 Thread Kevin Hrpcek
Hello,

This Wednesday we'll have a ceph science user group call. This is an informal 
conversation focused on using ceph in htc/hpc and scientific research 
environments.

Call details copied from the event:

Wednesday October 23rd
14:00 UTC
4:00PM Central European
10:00AM Eastern American

Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1572095869727000=AOvVaw2s9XswFrmihEuDdJMRHxy6>
Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1572095869727000=AOvVaw0EMESnNO_RKhONBQ8sgKI2>
To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1572095869727000=AOvVaw350zlzIKbJ0pjXk5apTWwi>
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: 
https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1572095869727000=AOvVaw0Gycb74NLeUaeZuvSg4pgy>
2.) Enter Meeting ID: 908675367
3.) Press #
Want to test your video connection? 
https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1572095869727000=AOvVaw1bRfUtekflHoeS36FKwXw2>

--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Scientific Computing User Group

2019-08-27 Thread Kevin Hrpcek
The first ceph + htc/hpc/science virtual user group meeting is tomorrow 
Wednesday August 28th at 10:30am us eastern/4:30pm eu central time. Duration 
will be kept to <= 1 hour.

I'd like this to be conducted as a user group and not only one person 
talking/presenting. For this first meeting I'd like to get input from everyone 
on the call regarding what field they are in and how ceph is used as a solution 
for their implementation. We'll see where it goes from there. Use the pad link 
below to get to a url for live meeting notes.

Meeting connection details from the ceph community calendar:

Description: Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1567360416884000=AOvVaw1yGuD9c8NbJk7MX4uqnsN2>

Meetings will be recorded and posted to the Ceph Youtube channel.

To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1567360416884000=AOvVaw2gv2hpS9KWJGhGkC6WTqzz>

To join from a Red Hat Deskphone or Softphone, dial: 84336.

Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1567360416884000=AOvVaw0SvLNonjp6O--t7_XUO18j>
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US)
See all numbers: 
https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1567360416884000=AOvVaw2zh8KetLc01bmQWGSDY9lK>
2.) Enter Meeting ID: 908675367 3.) Press #

Want to test your video connection? 
https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1567360416884000=AOvVaw0Euz9flNV7X85AWSYNZ2R->

Kevin


On 8/2/19 12:08 PM, Mike Perez wrote:
We have scheduled the next meeting on the community calendar for August 28 at 
14:30 UTC. Each meeting will then take place on the last Wednesday of each 
month.

Here's the pad to collect agenda/notes: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index

--
Mike Perez (thingee)


On Tue, Jul 23, 2019 at 10:40 AM Kevin Hrpcek 
mailto:kevin.hrp...@ssec.wisc.edu>> wrote:
Update

We're going to hold off until August for this so we can promote it on the Ceph 
twitter with more notice. Sorry for the inconvenience if you were planning on 
the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for 
updates.

Kevin


On 7/5/19 11:15 PM, Kevin Hrpcek wrote:
We've had some positive feedback and will be moving forward with this user 
group. The first virtual user group meeting is planned for July 24th at 4:30pm 
central European time/10:30am American eastern time. We will keep it to an hour 
in length. The plan is to use the ceph bluejeans video conferencing and it will 
be put on the ceph community calendar. I will send out links when it is closer 
to the 24th.

The goal of this user group is to promote conversations and sharing ideas for 
how ceph is used in the the scientific/hpc/htc communities. Please be willing 
to discuss your use cases, cluster configs, problems you've had, shortcomings 
in ceph, etc... Not everyone pays attention to the ceph lists so feel free to 
share the meeting information with others you know that may be interested in 
joining in.

Contact me if you have questions, comments, suggestions, or want to volunteer a 
topic for meetings. I will be brainstorming some conversation starters but it 
would also be interesting to have people give a deep dive into their use of 
ceph and what they have built around it to support the science being done at 
their facility.

Kevin



On 6/17/19 10:43 AM, Kevin Hrpcek wrote:
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
htt

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
I change the crush weights. My 4 second sleep doesn't let peering finish for 
each one before continuing. I'd test with some small steps to get an idea of 
how much remaps when increasing the weight by $x. I've found my cluster is 
comfortable with +1 increases...also it take awhile to get to a weight of 11 if 
I did anything smaller.

for i in {264..311}; do ceph osd crush reweight osd.${i} 11.0;sleep 4;done

Kevin

On 7/24/19 12:33 PM, Xavier Trilla wrote:
Hi Kevin,

Yeah, that makes a lot of sense, and looks even safer than adding OSDs one by 
one. What do you change, the crush weight? Or the reweight? (I guess you change 
the crush weight, I am right?)

Thanks!



El 24 jul 2019, a les 19:17, Kevin Hrpcek 
mailto:kevin.hrp...@ssec.wisc.edu>> va escriure:

I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, 
you can obviously change the weight increase steps to what you are comfortable 
with. This has worked well for me and my workloads. I've sometimes seen peering 
take longer if I do steps too quickly but I don't run any mission critical has 
to be up 100% stuff and I usually don't notice if a pg takes a while to peer.

Add all OSDs with an initial weight of 0. (nothing gets remapped)
Ensure cluster is healthy.
Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep 
between each for peering.
Let the cluster balance and get healthy or close to healthy.
Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am 
at the desired weight.

Kevin

On 7/24/19 11:44 AM, Xavier Trilla wrote:
Hi,

What would be the proper way to add 100 new OSDs to a cluster?

I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.

Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.

Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.

Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.

Thanks!
Xavier.



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, 
you can obviously change the weight increase steps to what you are comfortable 
with. This has worked well for me and my workloads. I've sometimes seen peering 
take longer if I do steps too quickly but I don't run any mission critical has 
to be up 100% stuff and I usually don't notice if a pg takes a while to peer.

Add all OSDs with an initial weight of 0. (nothing gets remapped)
Ensure cluster is healthy.
Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep 
between each for peering.
Let the cluster balance and get healthy or close to healthy.
Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am 
at the desired weight.

Kevin

On 7/24/19 11:44 AM, Xavier Trilla wrote:
Hi,

What would be the proper way to add 100 new OSDs to a cluster?

I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.

Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.

Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.

Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.

Thanks!
Xavier.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Scientific Computing User Group

2019-07-23 Thread Kevin Hrpcek
Update

We're going to hold off until August for this so we can promote it on the Ceph 
twitter with more notice. Sorry for the inconvenience if you were planning on 
the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for 
updates.

Kevin


On 7/5/19 11:15 PM, Kevin Hrpcek wrote:
We've had some positive feedback and will be moving forward with this user 
group. The first virtual user group meeting is planned for July 24th at 4:30pm 
central European time/10:30am American eastern time. We will keep it to an hour 
in length. The plan is to use the ceph bluejeans video conferencing and it will 
be put on the ceph community calendar. I will send out links when it is closer 
to the 24th.

The goal of this user group is to promote conversations and sharing ideas for 
how ceph is used in the the scientific/hpc/htc communities. Please be willing 
to discuss your use cases, cluster configs, problems you've had, shortcomings 
in ceph, etc... Not everyone pays attention to the ceph lists so feel free to 
share the meeting information with others you know that may be interested in 
joining in.

Contact me if you have questions, comments, suggestions, or want to volunteer a 
topic for meetings. I will be brainstorming some conversation starters but it 
would also be interesting to have people give a deep dive into their use of 
ceph and what they have built around it to support the science being done at 
their facility.

Kevin



On 6/17/19 10:43 AM, Kevin Hrpcek wrote:
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Scientific Computing User Group

2019-07-05 Thread Kevin Hrpcek
We've had some positive feedback and will be moving forward with this user 
group. The first virtual user group meeting is planned for July 24th at 4:30pm 
central European time/10:30am American eastern time. We will keep it to an hour 
in length. The plan is to use the ceph bluejeans video conferencing and it will 
be put on the ceph community calendar. I will send out links when it is closer 
to the 24th.

The goal of this user group is to promote conversations and sharing ideas for 
how ceph is used in the the scientific/hpc/htc communities. Please be willing 
to discuss your use cases, cluster configs, problems you've had, shortcomings 
in ceph, etc... Not everyone pays attention to the ceph lists so feel free to 
share the meeting information with others you know that may be interested in 
joining in.

Contact me if you have questions, comments, suggestions, or want to volunteer a 
topic for meetings. I will be brainstorming some conversation starters but it 
would also be interesting to have people give a deep dive into their use of 
ceph and what they have built around it to support the science being done at 
their facility.

Kevin



On 6/17/19 10:43 AM, Kevin Hrpcek wrote:
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Scientific Computing User Group

2019-06-17 Thread Kevin Hrpcek
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic upgrade failure

2018-09-19 Thread Kevin Hrpcek
   3.03%  libceph-common.so.0   [.] ceph::buffer::list::append
3.02%  libceph-common.so.0   [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo
2.92%  libceph-common.so.0   [.] ceph::buffer::ptr::release
2.65%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::advance
2.57%  ceph-mon  [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo
2.27%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
1.99%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
1.93%  libc-2.17.so  [.] __memcpy_ssse3_back
1.91%  libceph-common.so.0   [.] ceph::buffer::ptr::append
1.87%  libceph-common.so.0   [.] crush_hash32_3@plt
1.84%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::copy
1.75%  libtcmalloc.so.4.4.5  [.] 
tcmalloc::CentralFreeList::FetchFromOneSpans
1.63%  libceph-common.so.0   [.] ceph::encode >, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >
1.57%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
1.55%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
1.47%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length
1.33%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
1.09%  libceph-common.so.0   [.] ceph::decode >, denc_traits >, void> >
1.07%  libtcmalloc.so.4.4.5  [.] operator new[]
1.02%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
1.01%  libtcmalloc.so.4.4.5  [.] tc_posix_memalign
0.85%  ceph-mon  [.] ceph::buffer::ptr::release@plt
0.76%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out@plt
0.74%  libceph-common.so.0   [.] crc32_iscsi_00

strace
munmap(0x7f2eda736000, 2463941) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) = 429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", 
{st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0
mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000
close(429)  = 0
munmap(0x7f2ea8c97000, 2468005) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) = 429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", 
{st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0
mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000
close(429)  = 0
munmap(0x7f2ee21dc000, 2472343) = 0

Kevin


On 09/19/2018 06:50 AM, Sage Weil wrote:

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:


Sage,

Unfortunately the mon election problem came back yesterday and it makes
it really hard to get a cluster to stay healthy. A brief unexpected
network outage occurred and sent the cluster into a frenzy and when I
had it 95% healthy the mons started their nonstop reelections. In the
previous logs I sent were you able to identify why the mons are
constantly electing? The elections seem to be triggered by the below
paxos message but do you know which lease timeout is being reached or
why the lease isn't renewed instead of calling for an election?

One thing I tried was to shutdown the entire cluster and bring up only
the mon and mgr. The mons weren't able to hold their quorum with no osds
running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a
time.



This is odd... with no other dameons running I'm not sure what would be
eating up the CPU.  Can you run a 'perf top -p `pidof ceph-mon`' (or
similar) on the machine to see what the process is doing?  You might need
to install ceph-mon-dbg or ceph-debuginfo to get better symbols.



2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos
active c 133382665..133383355) lease_timeout -- calling new election



A workaround is probably to increase the lease timeout.  Try setting
mon_lease = 15 (default is 5... could also go higher than 15) in the
ceph.conf for all of the mons.  This is a bit of a band-aid but should
help you keep the mons in quorum until we sort out what is going on.

sage





Thanks
Kevin

On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:



Update for the list archive.

I went ahead and finished the mimic upgrade with the osds 

Re: [ceph-users] Mimic upgrade failure

2018-09-12 Thread Kevin Hrpcek
I couldn't find any sign of a networking issue at the OS or switches. No 
changes have been made in those to get the cluster stable again. I 
looked through a couple OSD logs and here is a selection of some of most 
frequent errors they were getting. Maybe something below is more obvious 
to you.


2018-09-09 18:17:33.245 7feb92079700  2 osd.84 991324 ms_handle_refused 
con 0x560e428b9800 session 0x560eb26b0060
2018-09-09 18:17:33.245 7feb9307b700  2 osd.84 991324 ms_handle_refused 
con 0x560ea639f000 session 0x560eb26b0060


2018-09-09 18:18:55.919 7feb9307b700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e424a3600 for osd.20, reopening
2018-09-09 18:18:55.919 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560e447df600 session 0x560e9ec37680
2018-09-09 18:18:55.919 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e427a5600 session 0x560e9ec37680
2018-09-09 18:18:55.935 7feb92079700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e40afcc00 for osd.18, reopening
2018-09-09 18:18:55.935 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e44398c00 session 0x560e6a3a0620
2018-09-09 18:18:55.935 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560e42f4ea00 session 0x560e6a3a0620
2018-09-09 18:18:55.939 7feb9307b700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e424c1e00 for osd.9, reopening
2018-09-09 18:18:55.940 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560ea4d09600 session 0x560e115e8120
2018-09-09 18:18:55.940 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e424a3600 session 0x560e115e8120
2018-09-09 18:18:55.956 7febadf54700 20 osd.84 991337 share_map_peer 
0x560e411ca600 already has epoch 991337


2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362  new session 
0x560e40b5ce00 con=0x560e42471800 addr=10.1.9.13:6836/2276068
2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362  session 
0x560e40b5ce00 osd.376 has caps osdcap[grant(*)] 'allow *'
2018-09-09 18:24:59.596 7feb9407d700  2 osd.84 991362 ms_handle_reset 
con 0x560e42471800 session 0x560e40b5ce00
2018-09-09 18:24:59.606 7feb9407d700  2 osd.84 991362 ms_handle_refused 
con 0x560e42d04600 session 0x560e10dfd000
2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362 
OSD::ms_get_authorizer type=osd
2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362 ms_get_authorizer 
bailing, we are shutting down
2018-09-09 18:24:59.633 7febad753700  0 -- 10.1.9.9:6848/4287624 >> 
10.1.9.12:6801/2269104 conn(0x560e42326a00 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=18630 cs=1 
l=0).handle_connect_reply connect got BADAUTHORIZER


2018-09-09 18:22:56.434 7febadf54700  0 cephx: verify_authorizer could 
not decrypt ticket info: error: bad magic in decode_decrypt, 
3995972256093848467 != 18374858748799134293


2018-09-09 18:22:56.434 7febadf54700  0 -- 10.1.9.9:6848/4287624 >> 
10.1.9.12:6801/2269104 conn(0x560e41fad600 :6848 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg: got bad authorizer


2018-09-10 03:30:17.324 7ff0ab678700 -1 osd.84 992286 heartbeat_check: 
no reply from 10.1.9.28:6843 osd.578 since back 2018-09-10 
03:15:35.358240 front 2018-09-10 03:15:47.879015 (cutoff 2018-09-10 
03:29:17.326329)


Kevin


On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage
  




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:


Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and mimic
did not play well together for some reason. Maybe it has to do with the scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
has scaled to this size.

Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:

Nothing too crazy for non default settings. Some of those osd settings were
in place while I was testing recovery speeds and need to be brought

Re: [ceph-users] Mimic upgrade failure

2018-09-10 Thread Kevin Hrpcek

Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a 
fluctuating state of up and down. The cluster did start to normalize a 
lot easier after everything was on mimic since the random mass OSD 
heartbeat failures stopped and the constant mon election problem went 
away. I'm still battling with the cluster reacting poorly to host 
reboots or small map changes, but I feel like my current pg:osd ratio 
may be playing a factor in that since we are 2x normal pg count while 
migrating data to new EC pools.


I'm not sure of the root cause but it seems like the mix of luminous and 
mimic did not play well together for some reason. Maybe it has to do 
with the scale of my cluster, 871 osd, or maybe I've missed some some 
tuning as my cluster has scaled to this size.


Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:
Nothing too crazy for non default settings. Some of those osd settings 
were in place while I was testing recovery speeds and need to be 
brought back closer to defaults. I was setting nodown before but it 
seems to mask the problem. While its good to stop the osdmap changes, 
OSDs would come up, get marked up, but at some point go down again 
(but the process is still running) and still stay up in the map. Then 
when I'd unset nodown the cluster would immediately mark 250+ osd down 
again and i'd be back where I started.


This morning I went ahead and finished the osd upgrades to mimic to 
remove that variable. I've looked for networking problems but haven't 
found any. 2 of the mons are on the same switch. I've also tried 
combinations of shutting down a mon to see if a single one was the 
problem, but they keep electing no matter the mix of them that are up. 
Part of it feels like a networking problem but I haven't been able to 
find a culprit yet as everything was working normally before starting 
the upgrade. Other than the constant mon elections, yesterday I had 
the cluster 95% healthy 3 or 4 times, but it doesn't last long since 
at some point the OSDs start trying to fail each other through their 
heartbeats.
2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd 
e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 
10.1.9.3:6884/317908 is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] 
: osd.39 10.1.9.2:6802/168438 reported failed by osd.49 
10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd 
e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 
10.1.9.13:6801/275806 is reporting failure:1


I'm working on getting things mostly good again with everything on 
mimic and will see if it behaves better.


Thanks for your input on this David.


[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health preluminous compat warning = false
osd heartbeat grace = 60
rocksdb cache size = 1342177280

[mds]
mds log max segments = 100
mds log max expiring = 40
mds bal fragment size max = 20
mds cache memory limit = 4294967296

[osd]
osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
osd recovery delay start = 30
osd recovery max active = 5
osd max backfills = 3
osd recovery threads = 2
osd crush initial weight = 0
osd heartbeat interval = 30
osd heartbeat grace = 60


On 09/08/2018 11:24 PM, David Turner wrote:
What osd/mon/etc config settings do you have that are not default? It 
might be worth utilizing nodown to stop osds from marking each other 
down and finish the upgrade to be able to set the minimum osd version 
to mimic. Stop the osds in a node, manually mark them down, start 
them back up in mimic. Depending on how bad things are, setting pause 
on the cluster to just finish the upgrade faster might not be a bad 
idea either.


This should be a simple question, have you confirmed that there are 
no networking problems between the MONs while the elections are 
happening?


On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <mailto:kevin.hrp...@ssec.wisc.edu>> wrote:


Hey Sage,

I've posted the file with my email address for the user. It is
with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
mons are calling for elections about every minute so I let this
run for a few elections and saw this node become the leader a
couple times. Debug logs start around 23:27:30. I had managed to
get about 850/857 osds up, but it seems that within the last 30
min it has all gone bad again due to the OSDs repor

Re: [ceph-users] Mimic upgrade failure

2018-09-09 Thread Kevin Hrpcek
Nothing too crazy for non default settings. Some of those osd settings 
were in place while I was testing recovery speeds and need to be brought 
back closer to defaults. I was setting nodown before but it seems to 
mask the problem. While its good to stop the osdmap changes, OSDs would 
come up, get marked up, but at some point go down again (but the process 
is still running) and still stay up in the map. Then when I'd unset 
nodown the cluster would immediately mark 250+ osd down again and i'd be 
back where I started.


This morning I went ahead and finished the osd upgrades to mimic to 
remove that variable. I've looked for networking problems but haven't 
found any. 2 of the mons are on the same switch. I've also tried 
combinations of shutting down a mon to see if a single one was the 
problem, but they keep electing no matter the mix of them that are up. 
Part of it feels like a networking problem but I haven't been able to 
find a culprit yet as everything was working normally before starting 
the upgrade. Other than the constant mon elections, yesterday I had the 
cluster 95% healthy 3 or 4 times, but it doesn't last long since at some 
point the OSDs start trying to fail each other through their heartbeats.
2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd 
e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 
10.1.9.3:6884/317908 is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] : 
osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd 
e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 
10.1.9.13:6801/275806 is reporting failure:1


I'm working on getting things mostly good again with everything on mimic 
and will see if it behaves better.


Thanks for your input on this David.


[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health preluminous compat warning = false
osd heartbeat grace = 60
rocksdb cache size = 1342177280

[mds]
mds log max segments = 100
mds log max expiring = 40
mds bal fragment size max = 20
mds cache memory limit = 4294967296

[osd]
osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
osd recovery delay start = 30
osd recovery max active = 5
osd max backfills = 3
osd recovery threads = 2
osd crush initial weight = 0
osd heartbeat interval = 30
osd heartbeat grace = 60


On 09/08/2018 11:24 PM, David Turner wrote:
What osd/mon/etc config settings do you have that are not default? It 
might be worth utilizing nodown to stop osds from marking each other 
down and finish the upgrade to be able to set the minimum osd version 
to mimic. Stop the osds in a node, manually mark them down, start them 
back up in mimic. Depending on how bad things are, setting pause on 
the cluster to just finish the upgrade faster might not be a bad idea 
either.


This should be a simple question, have you confirmed that there are no 
networking problems between the MONs while the elections are happening?


On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <mailto:kevin.hrp...@ssec.wisc.edu>> wrote:


Hey Sage,

I've posted the file with my email address for the user. It is
with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
mons are calling for elections about every minute so I let this
run for a few elections and saw this node become the leader a
couple times. Debug logs start around 23:27:30. I had managed to
get about 850/857 osds up, but it seems that within the last 30
min it has all gone bad again due to the OSDs reporting each other
as failed. We relaxed the osd_heartbeat_interval to 30 and
osd_heartbeat_grace to 60 in an attempt to slow down how quickly
OSDs are trying to fail each other. I'll put in the
rocksdb_cache_size setting.

Thanks for taking a look.

Kevin

On 09/08/2018 06:04 PM, Sage Weil wrote:

Hi Kevin,

I can't think of any major luminous->mimic changes off the top of my head
that would impact CPU usage, but it's always possible there is something
subtle.  Can you ceph-post-file a the full log from one of your mons
(preferbably the leader)?

You might try adjusting the rocksdb cache size.. try setting

  rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB

on the mons and restarting?

Thanks!
sage

On Sat, 8 Sep 2018, Kevin Hrpcek wrote:


Hello,

I've had a Lumin

Re: [ceph-users] Mimic upgrade failure

2018-09-08 Thread Kevin Hrpcek

Hey Sage,

I've posted the file with my email address for the user. It is with 
debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The mons are 
calling for elections about every minute so I let this run for a few 
elections and saw this node become the leader a couple times. Debug logs 
start around 23:27:30. I had managed to get about 850/857 osds up, but 
it seems that within the last 30 min it has all gone bad again due to 
the OSDs reporting each other as failed. We relaxed the 
osd_heartbeat_interval to 30 and osd_heartbeat_grace to 60 in an attempt 
to slow down how quickly OSDs are trying to fail each other. I'll put in 
the rocksdb_cache_size setting.


Thanks for taking a look.

Kevin

On 09/08/2018 06:04 PM, Sage Weil wrote:

Hi Kevin,

I can't think of any major luminous->mimic changes off the top of my head
that would impact CPU usage, but it's always possible there is something
subtle.  Can you ceph-post-file a the full log from one of your mons
(preferbably the leader)?

You might try adjusting the rocksdb cache size.. try setting

  rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB

on the mons and restarting?

Thanks!
sage

On Sat, 8 Sep 2018, Kevin Hrpcek wrote:


Hello,

I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck
with almost all pgs down. One problem is that the mons have started to
re-elect a new quorum leader almost every minute. This is making it difficult
to monitor the cluster and even run any commands on it since at least half the
time a ceph command times out or takes over a minute to return results. I've
looked at the debug logs and it appears there is some timeout occurring with
paxos of about a minute. The msg_dispatch thread of the mons is often running
a core at 100% for about a minute(user time, no iowait). Running strace on it
shows the process is going through all of the mon db files (about 6gb in
store.db/*.sst). Does anyone have an idea of what this timeout is or why my
mons are always reelecting? One theory I have is that the msg_dispatch can't
process the SST's fast enough and hits some timeout for a health check and the
mon drops itself from the quorum since it thinks it isn't healthy. I've been
thinking of introducing a new mon to the cluster on hardware with a better cpu
to see if that can process the SSTs within this timeout.

My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 11/41 osd
servers on luminous. The original problem started when I restarted the osds on
one of the hosts. The cluster reacted poorly to them going down and went into
a frenzy of taking down other osds and remapping. I eventually got that stable
and the PGs were 90% good with the finish line in sight and then the mons
started their issue of releecting every minute. Now I can't keep any decent
amount of PGs up for more than a few hours. This started on Wednesday.

Any help would be greatly appreciated.

Thanks,
Kevin

--Debug snippet from a mon at reelection time
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds e14242
maybe_resize_cluster in 1 max 1
2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds e14242
tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8106s
seconds
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim trim_to
13742 would only trim 238 < paxos_service_trim_min 250
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
auth
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
check_rotate updated rotating
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
encode_pending v 120658
2018-09-07 20:08:08.655 7f57b92cd700  5 mon.sephmon2@1(leader).paxos(paxos
updating c 132917556..132918214) queue_pending_finisher 0x55dce8e5b370
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxos(paxos
updating c 132917556..132918214) trigger_propose not active, will propose
later
2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr e2234 tick:
resetting beacon timeouts due to mon delay (slow election?) of 59.8844s
seconds
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 1734
would only trim 221 < paxos_service_trim_min 250
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health
check_member_health
2018-09-07 20:08:08.657 7f57bcdd0700  1 -- 10.1.9.202:6789/0 >> -
conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
l=0)._process_connection sd=447 -
2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health
check_m

[ceph-users] Mimic upgrade failure

2018-09-08 Thread Kevin Hrpcek

Hello,

I've had a Luminous -> Mimic upgrade go very poorly and my cluster is 
stuck with almost all pgs down. One problem is that the mons have 
started to re-elect a new quorum leader almost every minute. This is 
making it difficult to monitor the cluster and even run any commands on 
it since at least half the time a ceph command times out or takes over a 
minute to return results. I've looked at the debug logs and it appears 
there is some timeout occurring with paxos of about a minute. The 
msg_dispatch thread of the mons is often running a core at 100% for 
about a minute(user time, no iowait). Running strace on it shows the 
process is going through all of the mon db files (about 6gb in 
store.db/*.sst). Does anyone have an idea of what this timeout is or why 
my mons are always reelecting? One theory I have is that the 
msg_dispatch can't process the SST's fast enough and hits some timeout 
for a health check and the mon drops itself from the quorum since it 
thinks it isn't healthy. I've been thinking of introducing a new mon to 
the cluster on hardware with a better cpu to see if that can process the 
SSTs within this timeout.


My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 
11/41 osd servers on luminous. The original problem started when I 
restarted the osds on one of the hosts. The cluster reacted poorly to 
them going down and went into a frenzy of taking down other osds and 
remapping. I eventually got that stable and the PGs were 90% good with 
the finish line in sight and then the mons started their issue of 
releecting every minute. Now I can't keep any decent amount of PGs up 
for more than a few hours. This started on Wednesday.


Any help would be greatly appreciated.

Thanks,
Kevin

--Debug snippet from a mon at reelection time
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds 
e14242 maybe_resize_cluster in 1 max 1
2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds 
e14242 tick: resetting beacon timeouts due to mon delay (slow election?) 
of 59.8106s seconds
2018-09-07 20:08:08.655 7f57b92cd700 10 
mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim 
trim_to 13742 would only trim 238 < paxos_service_trim_min 250
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth 
v120657 auth
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth 
v120657 check_rotate updated rotating
2018-09-07 20:08:08.655 7f57b92cd700 10 
mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth 
v120657 encode_pending v 120658
2018-09-07 20:08:08.655 7f57b92cd700  5 
mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) 
queue_pending_finisher 0x55dce8e5b370
2018-09-07 20:08:08.655 7f57b92cd700 10 
mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) 
trigger_propose not active, will propose later
2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr e2234 
tick: resetting beacon timeouts due to mon delay (slow election?) of 
59.8844s seconds
2018-09-07 20:08:08.655 7f57b92cd700 10 
mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 
1734 would only trim 221 < paxos_service_trim_min 250

2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health 
check_member_health
2018-09-07 20:08:08.657 7f57bcdd0700  1 -- 10.1.9.202:6789/0 >> - 
conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 
l=0)._process_connection sd=447 -
2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17 
ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health 
check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 GiB
2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader).health 
check_leader_health
2018-09-07 20:08:08.662 7f57b92cd700 10 
mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim 
trim_to 1715 would only trim 181 < paxos_service_trim_min 250

2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).config tick
2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17 
sync_trim_providers
2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17 
get_health_metrics reporting 1940 slow ops, oldest is osd_failure(failed 
timeout osd.72 10.1.9.9:6800/68904 for 317sec e987498 v987498)
2018-09-07 20:08:08.662 7f57b92cd700  1 
mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) 
accept timeout, calling fresh election

2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 bootstrap
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 
sync_reset_requester
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 
unregister_cluster_logger
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 

Re: [ceph-users] separate monitoring node

2018-06-19 Thread Kevin Hrpcek
I use icinga2 as well with a check_ceph.py that I wrote a couple years 
ago. The method I use is that icinga2 runs the check from the icinga2 
host itself. ceph-common is installed on the icinga2 host since the 
check_ceph script is a wrapper and parser for the ceph command output 
using python's subprocess. The script takes a conf, id, and keyring 
argument so it acts like a ceph client and only the conf and keyring 
need to be present. I added a cephx user for the icinga checks. I also 
use icinga2,nrpe,check_proc to check the correct number of 
osd/mon/mgr/mds are running on a host.


# ceph auth get client.icinga
exported keyring for client.icinga
[client.icinga]
    key = 
    caps mgr = "allow r"
    caps mon = "allow r"


I just realized my script on github is the first or second result when 
googling for icinga2 ceph checks so there is a chance you are trying to 
use the same thing as me.


Kevin


On 06/19/2018 07:17 AM, Denny Fuchs wrote:

Hi,

at the moment, we use Icinga2, check_ceph* and Telegraf with the Ceph 
plugin. I'm asking what I need to have a separate host, which knows 
all about the Ceph cluster health. The reason is, that each OSD node 
has mostly the exact same data, which is transmitted into our database 
(like InfluxDB or MySQL) and wasting space. Also if something is going 
on, we get alerts for each OSD.


So my idea is, to have a separate VM (on external host) and we use 
only this host for monitoring the global cluster state and measurements.
Is it correct, that I need only to have mon and mgr as services ? Or 
should I do monitoring in a different way?


cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Reweighting causes whole cluster to peer/activate

2018-06-14 Thread Kevin Hrpcek

Hello,

I'm seeing something that seems to be odd behavior when reweighting 
OSDs. I've just upgraded to 12.2.5 and am adding in a new osd server to 
the cluster. I gradually weight the 10TB OSDs into the cluster by doing 
a +1, letting things backfill for a while, then +1 until I reach my 
desired weight. This hasn't been a problem in the past, a proportionate 
amount of PGs would get remapped, peer and activate across this cluster. 
Now on 12.2.5 when I do this, almost all PGs peer and reactivate. 
Sometimes it recovers within a minute, other times it takes longer, this 
last time actually saw some OSDs on the new node crash and caused a 
longer time for peering/activating. Regardless of recovery time, this is 
a fairly violent reaction to reweighting.


Has anyone else seen behavior like this or have any ideas what's going on?

For example...

[root@sephmon1 ~]# ceph -s
  cluster:
    id: bc2a1488-74f8-4d87-b2f6-615ae26bf7c9
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum sephmon1,sephmon2,sephmon3
    mgr: sephmon2(active), standbys: sephmon1, sephmon3
    mds: cephfs1-1/1/1 up  {0=sephmon1=up:active}, 1 up:standby
    osd: 789 osds: 789 up, 789 in

  data:
    pools:   7 pools, 39168 pgs
    objects: 74046k objects, 2989 TB
    usage:   3756 TB used, 1517 TB / 5273 TB avail
    pgs: 39137 active+clean
 26    active+clean+scrubbing+deep
 5 active+clean+scrubbing

  io:
    client:   3522 MB/s rd, 118 MB/s wr, 1295 op/s rd, 833 op/s wr

[root@sephmon1 ~]# for i in {771..779}; do ceph osd crush reweight 
osd.${i} 6.5; done

reweighted item id 771 name 'osd.771' to 6.5 in crush map
reweighted item id 772 name 'osd.772' to 6.5 in crush map
reweighted item id 773 name 'osd.773' to 6.5 in crush map
reweighted item id 774 name 'osd.774' to 6.5 in crush map
reweighted item id 775 name 'osd.775' to 6.5 in crush map
reweighted item id 776 name 'osd.776' to 6.5 in crush map
reweighted item id 777 name 'osd.777' to 6.5 in crush map
reweighted item id 778 name 'osd.778' to 6.5 in crush map
reweighted item id 779 name 'osd.779' to 6.5 in crush map
[root@sephmon1 ~]# ceph -s
  cluster:
    id: bc2a1488-74f8-4d87-b2f6-615ae26bf7c9
    health: HEALTH_WARN
    2 osds down
    78219/355096089 objects misplaced (0.022%)
    Reduced data availability: 668 pgs inactive, 1920 pgs down, 
551 pgs peering, 29 pgs incomplete
    Degraded data redundancy: 803425/355096089 objects degraded 
(0.226%), 204 pgs degraded

    3 slow requests are blocked > 32 sec

  services:
    mon: 3 daemons, quorum sephmon1,sephmon2,sephmon3
    mgr: sephmon2(active), standbys: sephmon1, sephmon3
    mds: cephfs1-1/1/1 up  {0=sephmon1=up:active}, 1 up:standby
    osd: 789 osds: 787 up, 789 in; 257 remapped pgs

  data:
    pools:   7 pools, 39168 pgs
    objects: 73964k objects, 2985 TB
    usage:   3756 TB used, 1517 TB / 5273 TB avail
    pgs: 0.028% pgs unknown
 94.904% pgs not active
 803425/355096089 objects degraded (0.226%)
 78219/355096089 objects misplaced (0.022%)
 20215 peering
 14335 activating
 1882  active+clean
 1788  down
 205   remapped+peering
 167   stale+peering
 142   activating+undersized+degraded
 127   activating+undersized
 126   stale+down
 57    active+undersized+degraded
 39    stale+active+clean
 27    incomplete
 17    activating+remapped
 11    unknown
 7 stale+activating
 6 down+remapped
 3 stale+activating+undersized
 2 stale+incomplete
 2 active+undersized
 2 stale+activating+undersized+degraded
 1 activating+undersized+degraded+remapped
 1 stale+active+undersized+degraded
 1 remapped
 1 active+clean+scrubbing
 1 active+undersized+degraded+remapped+backfill_wait
 1 stale+remapped+peering
 1 active+clean+remapped
 1 active+remapped+backfilling

  io:
    client:   3896 GB/s rd, 339 GB/s wr, 8004 kop/s rd, 320 kop/s wr
    recovery: 726 GB/s, 11172 objects/s


Thanks,
Kevin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados python pool alignment size write failures

2018-04-03 Thread Kevin Hrpcek
Thanks for the input Greg, we've submitted the patch to the ceph github 
repo https://github.com/ceph/ceph/pull/21222


Kevin

On 04/02/2018 01:10 PM, Gregory Farnum wrote:
On Mon, Apr 2, 2018 at 8:21 AM Kevin Hrpcek 
<kevin.hrp...@ssec.wisc.edu <mailto:kevin.hrp...@ssec.wisc.edu>> wrote:


Hello,

We use python librados bindings for object operations on our
cluster. For a long time we've been using 2 ec pools with k=4 m=1
and a fixed 4MB read/write size with the python bindings. During
preparations for migrating all of our data to a k=6 m=2 pool we've
discovered that ec pool alignment size is dynamic and the librados
bindings for python and go fail to write objects because they are
not aware of the the pool alignment size and therefore cannot
adjust the write block size to be a multiple of that. The ec pool
alignment size seems to be (k value * 4K) on new pools, but is
only 4K on old pools from the hammer days. We haven't been able to
find much useful documentation for this pool alignment setting
other than the librados docs
(http://docs.ceph.com/docs/master/rados/api/librados)
rados_ioctx_pool_requires_alingment,
rados_ioctx_pool_requires_alignment2,
rados_ioctx_pool_required_alignment,
rados_ioctx_pool_required_alignment2. After going through the
rados binary source we found that the binary is rounding the write
op size for an ec pool to a multiple of the pool alignment size
(line ~1945
https://github.com/ceph/ceph/blob/master/src/tools/rados/rados.cc#L1945).
The min write op size can be figured out by writing to an ec pool
like this to get the binary to round up and print it out `rados -b
1k -p $pool put .`. All of the support for being alignment
aware is obviously available but simply isn't available in the
bindings, we've only tested python and go.

We've gone ahead and submitted a patch and pull request to the
pycradox project which seems to be what was merged into the ceph
project for python bindings
https://github.com/sileht/pycradox/pull/4. It replicates getting
the alignment size of the pool in the python bindings so that we
can then calculate the proper op sizes for writing to a pool

We find it hard to believe that we're the only ones to have run
into this problem when using the bindings. Have we missed
something obvious for cluster configuration? Or maybe we're just
doing things different compared to most users... Any insight would
be appreciated as we'd prefer to use an official solution rather
than our bindings fix for long term use.


It's not impossible you're the only user both using the python 
bindings and targeting EC pools. Even now with overwrites they're 
limited in terms of object class and omap support, and I think all the 
direct-access users I've heard about required at least one of omap or 
overwrites.


Just submit the patch to the Ceph github repo and it'll get fixed up! :)
-Greg


Tested on Luminous 12.2.2 and 12.2.4.

Thanks,
Kevin

-- 
Kevin Hrpcek

Linux Systems Administrator
NASA SNPP Atmospheric SIPS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librados python pool alignment size write failures

2018-04-02 Thread Kevin Hrpcek

Hello,

We use python librados bindings for object operations on our cluster. 
For a long time we've been using 2 ec pools with k=4 m=1 and a fixed 4MB 
read/write size with the python bindings. During preparations for 
migrating all of our data to a k=6 m=2 pool we've discovered that ec 
pool alignment size is dynamic and the librados bindings for python and 
go fail to write objects because they are not aware of the the pool 
alignment size and therefore cannot adjust the write block size to be a 
multiple of that. The ec pool alignment size seems to be (k value * 4K) 
on new pools, but is only 4K on old pools from the hammer days. We 
haven't been able to find much useful documentation for this pool 
alignment setting other than the librados docs 
(http://docs.ceph.com/docs/master/rados/api/librados) 
rados_ioctx_pool_requires_alingment, 
rados_ioctx_pool_requires_alignment2, 
rados_ioctx_pool_required_alignment, 
rados_ioctx_pool_required_alignment2. After going through the rados 
binary source we found that the binary is rounding the write op size for 
an ec pool to a multiple of the pool alignment size (line ~1945 
https://github.com/ceph/ceph/blob/master/src/tools/rados/rados.cc#L1945). 
The min write op size can be figured out by writing to an ec pool like 
this to get the binary to round up and print it out `rados -b 1k -p 
$pool put .`. All of the support for being alignment aware is 
obviously available but simply isn't available in the bindings, we've 
only tested python and go.


We've gone ahead and submitted a patch and pull request to the pycradox 
project which seems to be what was merged into the ceph project for 
python bindings https://github.com/sileht/pycradox/pull/4. It replicates 
getting the alignment size of the pool in the python bindings so that we 
can then calculate the proper op sizes for writing to a pool


We find it hard to believe that we're the only ones to have run into 
this problem when using the bindings. Have we missed something obvious 
for cluster configuration? Or maybe we're just doing things different 
compared to most users... Any insight would be appreciated as we'd 
prefer to use an official solution rather than our bindings fix for long 
term use.


Tested on Luminous 12.2.2 and 12.2.4.

Thanks,
Kevin

--
Kevin Hrpcek
Linux Systems Administrator
NASA SNPP Atmospheric SIPS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph luminous - throughput performance issue

2018-01-31 Thread Kevin Hrpcek

Steven,

I've recently done some performance testing on dell hardware. Here are 
some of my messy results. I was mainly testing the effects of the R0 
stripe sizing on the perc card. Each disk has it's own R0 so that write 
back is enabled. VDs were created like this but with different 
stripesize `omconfig storage controller controller=1 action=createvdisk 
raid=r0 size=max pdisk=0:0:0 name=sdb readpolicy=ra writepolicy=wb 
stripesize=1mb`.


I have a few generations of perc cards in my cluster and it seems to me 
that a single disk R0 with at least a 64k stripesize works well. R0 is 
better for writes than the non-raid jbod option of some perc cards 
because it uses the write back cache. Especially in my situation where 
there are no SSD journals in place. The stripesize does make a 
difference, larger seems better to a certain point for mixed cluster 
use. There are a ton of different configurations to test but I only did 
a few focused on writes.


Kevin


R440, Perc H840 with 2 MD1400 attached with 12 10TB NLSAS drives per 
md1400. Xfs filestore with 10gb journal lv on each 10tb disk. Ceph 
cluster set up as a single mon/mgr/osd server for testing. These tables 
pasted well in my email client, hopefully they stay that way.


rados bench options




stripe
120 write -b 4M -t 16   
avg mbs 1109avg lat: 0.057676   
512 bytes
120 write -b 4M -t 16 –nocleanup
Avg 1098 mb/s   avg lat:0.0582565   
512 bytes
120 seq -t 16   
Avg 993 mb/savg lat: 0.0634972  avg iops 248512 bytes
120 rand -t 16  
avg 1089mb/savg lat: 0.05789avg iops 272512 bytes












120 write -b 4M -t 16   
Avg 1012 mb/s   avg lat 0.0631924   avg iop 252 128
120 write -b 4M -t 16 –nocleanup
Avg 923 mb/savg lat 0.069259avg ios 230 128
120 seq -t 16   
avg 930mb/s avg lat 0.0678104   avg iops 232128
120 rand -t 16  
Avg 1076mb/savg lat: 0.0585474  avg iops 269128



rados bench options
stripe  mb/siopslatency 
120 write -b 4M -t 16   1m  1121.9  272 0.0570402   

64k 1121.84 280 0.0570439   

256k1122285 0.0570363   
bench 120 write -b 64K -t 16256k909.451 14551   0.00109852  

64k 726.114 11617   0.00137608  

1m  879.748 14075   0.00113562  
120 rand -t 16 --run-name seph341m  731 182 0.0863446   
120 seq -t 16 --run-name seph34 1m  587 146 0.10759 
120 seq -t 16 --run-name seph35 1m  806 200 0.157   2 hosts 
same time
120 write -b 4M -t 16 --run-name seph34 --no-cleanup 	64k 	1179 	294 
0.10848 	2 hosts same time




Another set of testing using R740xd, perc h740p, 24 1.2TB 10K SAS. 
Filestore and bluestore testing, filestore has 10gb journal LV. Cluster 
is a single node mon/mgr/osd server. This hardware was being testing for 
a small rbd pool so rbd bench was used.



filestore   stripe
iops
bytes/s
seconds
bench --io-type write --io-size 8K --io-threads 16 --io-total 100G 
--io-pattern seq benchmark/img1 	128k 	69972.04 	573210992.44 	187
bench --io-type write --io-size 8K --io-threads 32 --io-total 100G 
--io-pattern seq benchmark/img1 	128k 	70382.53 	576573665.28 	186
bench --io-type write --io-size 8K --io-threads 16 --io-total 100G 
--io-pattern seq benchmark/img1 	512k 	79604.55 	652120481.6 	164
bench --io-type write --io-size 8K --io-threads 32 --io-total 100G 
--io-pattern seq benchmark/img1 	512k 	75002.82 	614423091.87 	174
bench --io-type write --io-size 8K --io-threads 16 --io-total 100G 
--io-pattern seq benchmark/img1 	1m 	71811.46 	588279455.86 	182
bench --io-type write --io-size 8K --io-threads 32 --io-total 100G 
--io-pattern seq benchmark/img1 	1m 	87000.07 	712704574.26 	150






4k  



bench --io-type write --io-size 4K --io-threads 16 --io-total 100G 
--io-pattern seq benchmark/img1 	128k 	86682.94 	355053334.01 	302
bench --io-type write --io-size 4K --io-threads 32 --io-total 100G 
--io-pattern seq benchmark/img1 	128k 	97065.03 	397578370.73 	270
bench --io-type write --io-size 4K --io-threads 16 --io-total 100G 
--io-pattern seq benchmark/img1 	512k 	87254.94 	357396223.51 	300
bench --io-type write --io-size 4K --io-threads 32 --io-total 100G 
--io-pattern seq benchmark/img1 	512k 	87607.66 	358840973.73 	299
bench --io-type write --io-size 4K --io-threads 16 --io-total 100G 
--io-pattern seq benchmark/img1 	1m 	78349.87 	320921084.1 	334
bench --io-type write --io-size 4K --io-threads 32 --io-total 100G 
--io-pattern seq benchmark/img1 	1m 	95970.79 	393096346.89 	273




 

Re: [ceph-users] Pool shard/stripe settings for file too large files?

2017-11-09 Thread Kevin Hrpcek

Marc,

If you're running luminous you may need to increase osd_max_object_size. 
This snippet is from the Luminous change log.


"The default maximum size for a single RADOS object has been reduced 
from 100GB to 128MB. The 100GB limit was completely impractical in 
practice while the 128MB limit is a bit high but not unreasonable. If 
you have an application written directly to librados that is using 
objects larger than 128MB you may need to adjust osd_max_object_size"


Kevin

On 11/09/2017 02:01 PM, Marc Roos wrote:
  
I would like store objects with


rados -p ec32 put test2G.img test2G.img

error putting ec32/test2G.img: (27) File too large

Changing the pool application from custom to rgw did not help









___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster Down from reweight-by-utilization

2017-11-06 Thread Kevin Hrpcek
An update for the list archive and if people have similar issues in the 
future.


My cluster took about 18 hours after resetting noup for all of the OSDs 
to get to the current epoch. In the end there were 5 that took a few 
hours longer than the others. Other small issues came up during the 
process such as ceph logs filling up /var and memory/swap filling 
probably caused this all to take longer than it should have. Simply 
restarting the OSDs when memory/swap was filling up allowed them to 
catch up faster. The daemons probably generated a bit under 1tb of logs 
throughout the whole process, so /var got expanded.


Once the OSDs all had current epoch I unset noup and let the cluster 
peer/activate PGs. This took another ~6 hours and was likely slowed by 
some of oldest undersized OSD servers not having enough cpu/memory to 
handle it. Throughout the peering/activating I periodically briefly 
unset nodown as a way to see if there were OSDs that were having 
problems and then addressed those.


In the end everything came back and the cluster is healthy and there are 
no existing PG problems. How the reweight triggered a problem this 
severe is still unknown.


A couple takeaways:
- CPU and memory may not be highly utilized in daily operations but is 
very important for large recovery operations. Having a bit more memory 
and cores would have probably saved hours of time from the recovery 
process and may have prevented my problem altogether.
- Slowing the map changes by quickly setting nodown,noout,noup when 
everything is already down will help as well.


Sage, thanks again for your input and advice.

Kevin



On 11/04/2017 11:54 PM, Sage Weil wrote:

On Sat, 4 Nov 2017, Kevin Hrpcek wrote:

Hey Sage,

Thanks for getting back to me this late on a weekend.

Do you now why the OSDs were going down?  Are there any crash dumps in the
osd logs, or is the OOM killer getting them?

That's a part I can't nail down yet. OSDs didn't crash, after the 
reweight-by-utilization OSDs on some of our earlier gen
servers started spinning 100% cpu and were overwhelmed. Admittedly these early 
gen osd servers are undersized on cpu which is
probably why they got overwhelmed, but it hasn't escalated like this before. 
Heartbeats among the cluster's OSDs started
failing on those OSDs first and then the osd 100% cpu  problem seemed to 
snowball to all hosts. I'm still trying to figure out
why the relatively small reweighting caused this problem.

The usual strategy here is to set 'noup' and get all of the OSDs to catch
up on osdmaps (you can check progress via the above status command).  Once
they are all caught up, unset noup and let them all peer at once.

I tried having noup set for a few hours earlier to see if stopping the moving 
osdmap target would help but I eventually unset
it while doing more troubleshooting. I'll set it again and let it go overnight. 
Patience is probably needed with a cluster this
size. I saw this similar situation and was trying your previous solution
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040030.html


The problem that has come up here in the past is when the cluster has been
unhealthy for a long time and the past intervals use too much memory.  I
don't see anything in your description about memory usage, though.  If
that does rear its head there's a patch we can apply to kraken to work
around it (this is fixed in luminous).

Memory usage doesn't seem too bad, a little tight on some of those early gen 
servers, but I haven't seen OOM killing things off
yet. I think I saw mention of that patch and luminous handling this type of 
situation better while googling the issue...larger
osdmap increments or something similar if i recall correctly. My cluster is a 
few weeks away from a luminous upgrade.

That's good.  You mgiht also try setting nobackfill and norecover just to
keep the load off the cluster while it's peering.

s


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster Down from reweight-by-utilization

2017-11-04 Thread Kevin Hrpcek

Hey Sage,

Thanks for getting back to me this late on a weekend.

Do you now why the OSDs were going down?  Are there any crash dumps in the
osd logs, or is the OOM killer getting them?
That's a part I can't nail down yet. OSDs didn't crash, after the 
reweight-by-utilization OSDs on some of our earlier gen servers started 
spinning 100% cpu and were overwhelmed. Admittedly these early gen osd 
servers are undersized on cpu which is probably why they got 
overwhelmed, but it hasn't escalated like this before. Heartbeats among 
the cluster's OSDs started failing on those OSDs first and then the osd 
100% cpu problem seemed to snowball to all hosts. I'm still trying to 
figure out why the relatively small reweighting caused this problem.

The usual strategy here is to set 'noup' and get all of the OSDs to catch
up on osdmaps (you can check progress via the above status command).  Once
they are all caught up, unset noup and let them all peer at once.
I tried having noup set for a few hours earlier to see if stopping the 
moving osdmap target would help but I eventually unset it while doing 
more troubleshooting. I'll set it again and let it go overnight. 
Patience is probably needed with a cluster this size. I saw this similar 
situation and was trying your previous solution 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040030.html


The problem that has come up here in the past is when the cluster has been
unhealthy for a long time and the past intervals use too much memory.  I
don't see anything in your description about memory usage, though.  If
that does rear its head there's a patch we can apply to kraken to work
around it (this is fixed in luminous).
Memory usage doesn't seem too bad, a little tight on some of those early 
gen servers, but I haven't seen OOM killing things off yet. I think I 
saw mention of that patch and luminous handling this type of situation 
better while googling the issue...larger osdmap increments or something 
similar if i recall correctly. My cluster is a few weeks away from a 
luminous upgrade.


Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster Down from reweight-by-utilization

2017-11-04 Thread Kevin Hrpcek
ed
 742 stale+down
 691 stale+peering
 494 remapped+peering
 397 active+clean
 347 stale+active+clean
 187 peering
  58 stale+activating+undersized+degraded
  36 stale+active+undersized+degraded
  14 stale+remapped
  12 activating+undersized+degraded
  11 stale+activating+undersized+degraded+remapped
   7 remapped
   6 stale
   5 activating
   4 active+remapped
   3 active+undersized+degraded+remapped
   2 stale+activating+remapped
   2 stale+active+remapped+backfill_wait
   2 stale+activating
   1 stale+active+clean+scrubbing
   1 active+recovering+undersized+degraded
   1 stale+active+remapped+backfilling
   1 inactive
   1 active+clean+scrubbing
   1 stale+active+clean+scrubbing+deep
   1 active+undersized+degraded+remapped+backfilling


--
Kevin Hrpcek
Linux Systems Administrator
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com