[ceph-users] Mon crashes virtual void LogMonitor::update_from_paxos(bool*)

2020-01-15 Thread Kevin Hrpcek
Hey all,

One of my mons has been having a rough time for the last day or so. It started 
with a crash and restart I didn't notice about a day ago and now it won't 
start. Where it crashes has changed over time but it is now stuck on the last 
error below. I've tried to get some more information out of it with debug 
logging and gdb but I haven't seen anything that makes the root cause of this 
obvious.

Right now it is crashing at line 103 in 
https://github.com/ceph/ceph/blob/mimic/src/mon/LogMonitor.cc#L103. This is 
part of the mon preinit step. Best that I can tell right now is that it is 
having a problem with a map version. I'm considering rebuilding the mon's store 
though I don't see any clear signs of corruption.

It bails at assert(err == 0);

  // walk through incrementals
  while (version > summary.version) {
bufferlist bl;
int err = get_version(summary.version+1, bl);
assert(err == 0);
assert(bl.length());

Has anyone seen similar or have any ideas?

ceph 13.2.8

Thanks!
Kevin


The first crash/restart

Jan 14 20:47:11 sephmon5 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/Monitor.cc:
 In function 'bool Monitor::_scrub(ScrubResult*, 
std::pair, std::basic_string >*, int*)' thread 
7f5b54680700 time 2020-01-14 20:47:11.618368
Jan 14 20:47:11 sephmon5 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/Monitor.cc:
 5225: FAILED assert(err == 0)
Jan 14 20:47:11 sephmon5 ceph-mon: ceph version 13.2.8 
(5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
Jan 14 20:47:11 sephmon5 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*, 
char const*, int, char const*)+0x14b) [0x7f5b6440b87b]
Jan 14 20:47:11 sephmon5 ceph-mon: 2: (()+0x26fa07) [0x7f5b6440ba07]
Jan 14 20:47:11 sephmon5 ceph-mon: 3: (Monitor::_scrub(ScrubResult*, 
std::pair*, int*)+0xfa6) [0x55c3230a1896]
Jan 14 20:47:11 sephmon5 ceph-mon: 4: 
(Monitor::handle_scrub(boost::intrusive_ptr)+0x25e) 
[0x55c3230aa01e]
Jan 14 20:47:11 sephmon5 ceph-mon: 5: 
(Monitor::dispatch_op(boost::intrusive_ptr)+0xcaf) 
[0x55c3230c73ff]
Jan 14 20:47:11 sephmon5 ceph-mon: 6: (Monitor::_ms_dispatch(Message*)+0x732) 
[0x55c3230c8152]
Jan 14 20:47:11 sephmon5 ceph-mon: 7: (Monitor::ms_dispatch(Message*)+0x23) 
[0x55c3230edcc3]
Jan 14 20:47:11 sephmon5 ceph-mon: 8: (DispatchQueue::entry()+0xb7a) 
[0x7f5b644ca24a]
Jan 14 20:47:11 sephmon5 ceph-mon: 9: 
(DispatchQueue::DispatchThread::entry()+0xd) [0x7f5b645684bd]
Jan 14 20:47:11 sephmon5 ceph-mon: 10: (()+0x7e65) [0x7f5b63749e65]
Jan 14 20:47:11 sephmon5 ceph-mon: 11: (clone()+0x6d) [0x7f5b6025d88d]

Then a couple more crashes/restarts about 11 hours later with this trace

-10001> 2020-01-15 09:36:35.796 7f9600fc7700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc:
 In function 'void LogMonitor::_create_sub_incremental(MLog*, int, version_t)' 
thread 7f9600fc7700 time 2020-01-15 09:36:35.796354
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8/rpm/el7/BUILD/ceph-13.2.8/src/mon/LogMonitor.cc:
 673: FAILED assert(err == 0)

 ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14b) [0x7f9610d5287b]
 2: (()+0x26fa07) [0x7f9610d52a07]
 3: (LogMonitor::_create_sub_incremental(MLog*, int, unsigned long)+0xb54) 
[0x55aeb09e2f94]
 4: (LogMonitor::check_sub(Subscription*)+0x506) [0x55aeb09e3806]
 5: (Monitor::handle_subscribe(boost::intrusive_ptr)+0x10ed) 
[0x55aeb098973d]
 6: (Monitor::dispatch_op(boost::intrusive_ptr)+0x3cd) 
[0x55aeb09b0b1d]
 7: (Monitor::_ms_dispatch(Message*)+0x732) [0x55aeb09b2152]
 8: (Monitor::ms_dispatch(Message*)+0x23) [0x55aeb09d7cc3]
 9: (DispatchQueue::entry()+0xb7a) [0x7f9610e1124a]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f9610eaf4bd]
 11: (()+0x7e65) [0x7f9610090e65]
 12: (clone()+0x6d) [0x7f960cba488d]

-10001> 2020-01-15 09:36:35.797 7f95fffc5700  1 -- 10.1.9.205:6789/0 >> - 
conn(0x55aec5dd0600 :6789 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection 
sd=47 -
-10001> 2020-01-15 09:36:35.798 7f9600fc7700 -1 *** Caught signal (Aborted) **
 in thread 7f9600fc7700 thread_name:ms_dispatch


And now the mon no longer starts with this trace

  -261> 2020-01-15 16:36:46.084 7f0946674a00 10 
mon.sephmon5@-1(probing).paxosservice(logm 0..86521000) refresh
  -261> 2020-01-15 16:36:46.084 7f0946674a00 10 mon.sephmon5@-1(probing).log 
v86521000 update_from_paxos
  -261> 2020-01-15 16:36:46.084 7

[ceph-users] January Ceph Science Group Virtual Meeting

2020-01-13 Thread Kevin Hrpcek
Hello,

We will be having a Ceph science/research/big cluster call on Wednesday January 
22nd. If anyone wants to discuss something specific they can add it to the pad 
linked below. If you have questions or comments you can contact me.

This is an informal open call of community members mostly from hpc/htc/research 
environments where we discuss whatever is on our minds regarding ceph. Updates, 
outages, features, maintenance, etc...there is no set presenter but I do 
attempt to keep the conversation lively.

https://pad.ceph.com/p/Ceph_Science_User_Group_20200122

Ceph calendar event details:

January 22, 2020
9am US Central
4pm Central Eurpean

We try to keep it to an hour or less.

Description:Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1579363980705000=AOvVaw2SfHjvt23rQFRJn8z4_zJ8>
Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1579363980705000=AOvVaw2CgJXpvLRSOlaaWC5rc3id>
To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1579363980705000=AOvVaw3mDPS_6nmD3yh_9Sw7Z7So>
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US)
See all numbers: 
https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1579363980705000=AOvVaw2aHSwR3wGU0yTs-bCsUFoC>
2.) Enter Meeting ID: 908675367
3.) Press #
Want to test your video connection? 
https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1579363980705000=AOvVaw3UlW-AxGCX7TXfn8VAGfH4>


Kevin


--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Science User Group Call October

2019-10-21 Thread Kevin Hrpcek
Hello,

This Wednesday we'll have a ceph science user group call. This is an informal 
conversation focused on using ceph in htc/hpc and scientific research 
environments.

Call details copied from the event:

Wednesday October 23rd
14:00 UTC
4:00PM Central European
10:00AM Eastern American

Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1572095869727000=AOvVaw2s9XswFrmihEuDdJMRHxy6>
Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1572095869727000=AOvVaw0EMESnNO_RKhONBQ8sgKI2>
To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1572095869727000=AOvVaw350zlzIKbJ0pjXk5apTWwi>
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: 
https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1572095869727000=AOvVaw0Gycb74NLeUaeZuvSg4pgy>
2.) Enter Meeting ID: 908675367
3.) Press #
Want to test your video connection? 
https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1572095869727000=AOvVaw1bRfUtekflHoeS36FKwXw2>

--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow ops for mon slowly increasing

2019-09-20 Thread Kevin Olbrich
OK, looks like clock skew is the problem. I thought this is caused by the
reboot but it did not fix itself after some minutes (mon3 was 6 seconds
ahead).
After forcing time sync from the same server, it seems to be solved now.

Kevin

Am Fr., 20. Sept. 2019 um 07:33 Uhr schrieb Kevin Olbrich :

> Hi!
>
> Today some OSDs went down, a temporary problem that was solved easily.
> The mimic cluster is working and all OSDs are complete, all active+clean.
>
> Completely new for me is this:
> > 25 slow ops, oldest one blocked for 219 sec, mon.mon03 has slow ops
>
> The cluster itself looks fine, monitoring for the VMs that use RBD are
> fine.
>
> I thought that might be (https://tracker.ceph.com/issues/24531) but I've
> restarted the mon service (and the node as a whole) but both did not help.
> The slop ops slowly increase.
>
> Example:
>
> {
> "description": "auth(proto 0 30 bytes epoch 0)",
> "initiated_at": "2019-09-20 05:31:52.295858",
> "age": 7.851164,
> "duration": 7.900068,
> "type_data": {
> "events": [
> {
> "time": "2019-09-20 05:31:52.295858",
> "event": "initiated"
> },
> {
> "time": "2019-09-20 05:31:52.295858",
> "event": "header_read"
> },
> {
> "time": "2019-09-20 05:31:52.295864",
> "event": "throttled"
> },
> {
> "time": "2019-09-20 05:31:52.295875",
> "event": "all_read"
> },
> {
> "time": "2019-09-20 05:31:52.296075",
> "event": "dispatched"
> },
> {
> "time": "2019-09-20 05:31:52.296089",
> "event": "mon:_ms_dispatch"
> },
> {
> "time": "2019-09-20 05:31:52.296097",
> "event": "mon:dispatch_op"
> },
> {
> "time": "2019-09-20 05:31:52.296098",
> "event": "psvc:dispatch"
> },
> {
> "time": "2019-09-20 05:31:52.296172",
> "event": "auth:wait_for_readable"
> },
> {
> "time": "2019-09-20 05:31:52.296177",
> "event": "auth:wait_for_readable/paxos"
> },
> {
> "time": "2019-09-20 05:31:52.296232",
> "event": "paxos:wait_for_readable"
> }
> ],
> "info": {
> "seq": 1708,
> "src_is_mon": false,
> "source": "client.?
> [fd91:462b:4243:47e::1:3]:0/2365414961",
> "forwarded_to_leader": false
> }
> }
> },
> {
> "description": "auth(proto 0 30 bytes epoch 0)",
> "initiated_at": "2019-09-20 05:31:52.314892",
> "age": 7.832131,
> "duration": 7.881230,
> "type_data": {
> "events": [
> {
> "time": "2019-09-20 05:31:52.314892",
> "event": "initiated"
> },
> {
> "time": "2019-09-20 05:31:52.314892",
> "event": "header_read"
> },
> {
> "time": "2019-09-20 05:31:52.3

[ceph-users] slow ops for mon slowly increasing

2019-09-19 Thread Kevin Olbrich
"event": "mon:dispatch_op"
},
{
"time": "2019-09-20 05:31:52.315083",
    "event": "psvc:dispatch"
},
{
"time": "2019-09-20 05:31:52.315161",
"event": "auth:wait_for_readable"
},
{
"time": "2019-09-20 05:31:52.315167",
"event": "auth:wait_for_readable/paxos"
},
{
"time": "2019-09-20 05:31:52.315230",
"event": "paxos:wait_for_readable"
}
],
"info": {
"seq": 1709,
"src_is_mon": false,
"source": "client.?
[fd91:462b:4243:47e::1:3]:0/997594187",
"forwarded_to_leader": false
}
}
}



This is a new situation for me. What am I supposed to do in this case?

Thank you!

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Scientific Computing User Group

2019-08-27 Thread Kevin Hrpcek
The first ceph + htc/hpc/science virtual user group meeting is tomorrow 
Wednesday August 28th at 10:30am us eastern/4:30pm eu central time. Duration 
will be kept to <= 1 hour.

I'd like this to be conducted as a user group and not only one person 
talking/presenting. For this first meeting I'd like to get input from everyone 
on the call regarding what field they are in and how ceph is used as a solution 
for their implementation. We'll see where it goes from there. Use the pad link 
below to get to a url for live meeting notes.

Meeting connection details from the ceph community calendar:

Description: Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index<https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group_Index=D=1567360416884000=AOvVaw1yGuD9c8NbJk7MX4uqnsN2>

Meetings will be recorded and posted to the Ceph Youtube channel.

To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink<https://www.google.com/url?q=https://bluejeans.com/908675367?src%3DcalendarLink=D=1567360416884000=AOvVaw2gv2hpS9KWJGhGkC6WTqzz>

To join from a Red Hat Deskphone or Softphone, dial: 84336.

Connecting directly from a room system?
1.) Dial: 199.48.152.152 or 
bjn.vc<https://www.google.com/url?q=http://bjn.vc=D=1567360416884000=AOvVaw0SvLNonjp6O--t7_XUO18j>
2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
1.) Dial one of the following numbers: 408-915-6466 (US)
See all numbers: 
https://www.redhat.com/en/conference-numbers<https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers=D=1567360416884000=AOvVaw2zh8KetLc01bmQWGSDY9lK>
2.) Enter Meeting ID: 908675367 3.) Press #

Want to test your video connection? 
https://bluejeans.com/111<https://www.google.com/url?q=https://bluejeans.com/111=D=1567360416884000=AOvVaw0Euz9flNV7X85AWSYNZ2R->

Kevin


On 8/2/19 12:08 PM, Mike Perez wrote:
We have scheduled the next meeting on the community calendar for August 28 at 
14:30 UTC. Each meeting will then take place on the last Wednesday of each 
month.

Here's the pad to collect agenda/notes: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index

--
Mike Perez (thingee)


On Tue, Jul 23, 2019 at 10:40 AM Kevin Hrpcek 
mailto:kevin.hrp...@ssec.wisc.edu>> wrote:
Update

We're going to hold off until August for this so we can promote it on the Ceph 
twitter with more notice. Sorry for the inconvenience if you were planning on 
the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for 
updates.

Kevin


On 7/5/19 11:15 PM, Kevin Hrpcek wrote:
We've had some positive feedback and will be moving forward with this user 
group. The first virtual user group meeting is planned for July 24th at 4:30pm 
central European time/10:30am American eastern time. We will keep it to an hour 
in length. The plan is to use the ceph bluejeans video conferencing and it will 
be put on the ceph community calendar. I will send out links when it is closer 
to the 24th.

The goal of this user group is to promote conversations and sharing ideas for 
how ceph is used in the the scientific/hpc/htc communities. Please be willing 
to discuss your use cases, cluster configs, problems you've had, shortcomings 
in ceph, etc... Not everyone pays attention to the ceph lists so feel free to 
share the meeting information with others you know that may be interested in 
joining in.

Contact me if you have questions, comments, suggestions, or want to volunteer a 
topic for meetings. I will be brainstorming some conversation starters but it 
would also be interesting to have people give a deep dive into their use of 
ceph and what they have built around it to support the science being done at 
their facility.

Kevin



On 6/17/19 10:43 AM, Kevin Hrpcek wrote:
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
htt

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
I change the crush weights. My 4 second sleep doesn't let peering finish for 
each one before continuing. I'd test with some small steps to get an idea of 
how much remaps when increasing the weight by $x. I've found my cluster is 
comfortable with +1 increases...also it take awhile to get to a weight of 11 if 
I did anything smaller.

for i in {264..311}; do ceph osd crush reweight osd.${i} 11.0;sleep 4;done

Kevin

On 7/24/19 12:33 PM, Xavier Trilla wrote:
Hi Kevin,

Yeah, that makes a lot of sense, and looks even safer than adding OSDs one by 
one. What do you change, the crush weight? Or the reweight? (I guess you change 
the crush weight, I am right?)

Thanks!



El 24 jul 2019, a les 19:17, Kevin Hrpcek 
mailto:kevin.hrp...@ssec.wisc.edu>> va escriure:

I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, 
you can obviously change the weight increase steps to what you are comfortable 
with. This has worked well for me and my workloads. I've sometimes seen peering 
take longer if I do steps too quickly but I don't run any mission critical has 
to be up 100% stuff and I usually don't notice if a pg takes a while to peer.

Add all OSDs with an initial weight of 0. (nothing gets remapped)
Ensure cluster is healthy.
Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep 
between each for peering.
Let the cluster balance and get healthy or close to healthy.
Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am 
at the desired weight.

Kevin

On 7/24/19 11:44 AM, Xavier Trilla wrote:
Hi,

What would be the proper way to add 100 new OSDs to a cluster?

I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.

Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.

Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.

Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.

Thanks!
Xavier.



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, 
you can obviously change the weight increase steps to what you are comfortable 
with. This has worked well for me and my workloads. I've sometimes seen peering 
take longer if I do steps too quickly but I don't run any mission critical has 
to be up 100% stuff and I usually don't notice if a pg takes a while to peer.

Add all OSDs with an initial weight of 0. (nothing gets remapped)
Ensure cluster is healthy.
Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep 
between each for peering.
Let the cluster balance and get healthy or close to healthy.
Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am 
at the desired weight.

Kevin

On 7/24/19 11:44 AM, Xavier Trilla wrote:
Hi,

What would be the proper way to add 100 new OSDs to a cluster?

I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.

Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.

Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.

Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.

Thanks!
Xavier.



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Scientific Computing User Group

2019-07-23 Thread Kevin Hrpcek
Update

We're going to hold off until August for this so we can promote it on the Ceph 
twitter with more notice. Sorry for the inconvenience if you were planning on 
the meeting tomorrow. Keep a watch on the list, twitter, or ceph calendar for 
updates.

Kevin


On 7/5/19 11:15 PM, Kevin Hrpcek wrote:
We've had some positive feedback and will be moving forward with this user 
group. The first virtual user group meeting is planned for July 24th at 4:30pm 
central European time/10:30am American eastern time. We will keep it to an hour 
in length. The plan is to use the ceph bluejeans video conferencing and it will 
be put on the ceph community calendar. I will send out links when it is closer 
to the 24th.

The goal of this user group is to promote conversations and sharing ideas for 
how ceph is used in the the scientific/hpc/htc communities. Please be willing 
to discuss your use cases, cluster configs, problems you've had, shortcomings 
in ceph, etc... Not everyone pays attention to the ceph lists so feel free to 
share the meeting information with others you know that may be interested in 
joining in.

Contact me if you have questions, comments, suggestions, or want to volunteer a 
topic for meetings. I will be brainstorming some conversation starters but it 
would also be interesting to have people give a deep dive into their use of 
ceph and what they have built around it to support the science being done at 
their facility.

Kevin



On 6/17/19 10:43 AM, Kevin Hrpcek wrote:
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Scientific Computing User Group

2019-07-05 Thread Kevin Hrpcek
We've had some positive feedback and will be moving forward with this user 
group. The first virtual user group meeting is planned for July 24th at 4:30pm 
central European time/10:30am American eastern time. We will keep it to an hour 
in length. The plan is to use the ceph bluejeans video conferencing and it will 
be put on the ceph community calendar. I will send out links when it is closer 
to the 24th.

The goal of this user group is to promote conversations and sharing ideas for 
how ceph is used in the the scientific/hpc/htc communities. Please be willing 
to discuss your use cases, cluster configs, problems you've had, shortcomings 
in ceph, etc... Not everyone pays attention to the ceph lists so feel free to 
share the meeting information with others you know that may be interested in 
joining in.

Contact me if you have questions, comments, suggestions, or want to volunteer a 
topic for meetings. I will be brainstorming some conversation starters but it 
would also be interesting to have people give a deep dive into their use of 
ceph and what they have built around it to support the science being done at 
their facility.

Kevin



On 6/17/19 10:43 AM, Kevin Hrpcek wrote:
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Scientific Computing User Group

2019-06-17 Thread Kevin Hrpcek
Hey all,

At cephalocon some of us who work in scientific computing got together for a 
BoF and had a good conversation. There was some interest in finding a way to 
continue the conversation focused on ceph in scientific computing and htc/hpc 
environments. We are considering putting together monthly video conference user 
group meeting to facilitate sharing thoughts and ideas for this part of the 
ceph community. At cephalocon we mostly had teams present from the EU so I'm 
interested in hearing how much community interest there is in a 
ceph+science/HPC/HTC user group meeting. It will be impossible to pick a time 
that works well for everyone but initially we considered something later in the 
work day for EU countries.

Reply to me if you're interested and please include your timezone.

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU/KVM client compatibility

2019-05-28 Thread Kevin Olbrich
Am Di., 28. Mai 2019 um 10:20 Uhr schrieb Wido den Hollander :

>
>
> On 5/28/19 10:04 AM, Kevin Olbrich wrote:
> > Hi Wido,
> >
> > thanks for your reply!
> >
> > For CentOS 7, this means I can switch over to the "rpm-nautilus/el7"
> > repository and Qemu uses a nautilus compatible client?
> > I just want to make sure, I understand correctly.
> >
>
> Yes, that is correct. Keep in mind though that you will need to
> Stop/Start the VMs or (Live) Migrate them to a different hypervisor for
> the new packages to be loaded.
>
>
Actually the hosts are Fedora 29 which I need to re-deploy with Fedora 30
to get nautilus on the clients.
I just wanted to unterstand how this works. I always reboot the whole
machine after such a large change to make sure it works.

Thank you for your time!


> Wido
>
> > Thank you very much!
> >
> > Kevin
> >
> > Am Di., 28. Mai 2019 um 09:46 Uhr schrieb Wido den Hollander
> > mailto:w...@42on.com>>:
> >
> >
> >
> > On 5/28/19 7:52 AM, Kevin Olbrich wrote:
> > > Hi!
> > >
> > > How can I determine which client compatibility level (luminous,
> mimic,
> > > nautilus, etc.) is supported in Qemu/KVM?
> > > Does it depend on the version of ceph packages on the system? Or
> do I
> > > need a recent version Qemu/KVM?
> >
> > This is mainly related to librados and librbd on your system. Qemu
> talks
> > to librbd which then talks to librados.
> >
> > Qemu -> librbd -> librados -> Ceph cluster
> >
> > So make sure you keep the librbd and librados packages updated on
> your
> > hypervisor.
> >
> > When upgrading them make sure you either Stop/Start or Live Migrate
> the
> > VMs to a different hypervisor so the VMs are initiated with the new
> > code.
> >
> > Wido
> >
> > > Which component defines, which client level will be supported?
> > >
> > > Thank you very much!
> > >
> > > Kind regards
> > > Kevin
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] QEMU/KVM client compatibility

2019-05-28 Thread Kevin Olbrich
Hi Wido,

thanks for your reply!

For CentOS 7, this means I can switch over to the "rpm-nautilus/el7"
repository and Qemu uses a nautilus compatible client?
I just want to make sure, I understand correctly.

Thank you very much!

Kevin

Am Di., 28. Mai 2019 um 09:46 Uhr schrieb Wido den Hollander :

>
>
> On 5/28/19 7:52 AM, Kevin Olbrich wrote:
> > Hi!
> >
> > How can I determine which client compatibility level (luminous, mimic,
> > nautilus, etc.) is supported in Qemu/KVM?
> > Does it depend on the version of ceph packages on the system? Or do I
> > need a recent version Qemu/KVM?
>
> This is mainly related to librados and librbd on your system. Qemu talks
> to librbd which then talks to librados.
>
> Qemu -> librbd -> librados -> Ceph cluster
>
> So make sure you keep the librbd and librados packages updated on your
> hypervisor.
>
> When upgrading them make sure you either Stop/Start or Live Migrate the
> VMs to a different hypervisor so the VMs are initiated with the new code.
>
> Wido
>
> > Which component defines, which client level will be supported?
> >
> > Thank you very much!
> >
> > Kind regards
> > Kevin
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] QEMU/KVM client compatibility

2019-05-27 Thread Kevin Olbrich
Hi!

How can I determine which client compatibility level (luminous, mimic,
nautilus, etc.) is supported in Qemu/KVM?
Does it depend on the version of ceph packages on the system? Or do I need
a recent version Qemu/KVM?
Which component defines, which client level will be supported?

Thank you very much!

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh

ok this just gives me:

error getting xattr ec31/10004dfce92./parent: (2) No such file 
or directory


Does this mean that the lost object isn't even a file that appears in 
the ceph directory. Maybe a leftover of a file that has not been deleted 
properly? It wouldn't be an issue to mark the object as lost in that case.


On 24.05.19 5:08 nachm., Robert LeBlanc wrote:
You need to use the first stripe of the object as that is the only one 
with the metadata.


Try "rados -p ec31 getxattr 10004dfce92. parent" instead.

Robert LeBlanc

Sent from a mobile device, please excuse any typos.

On Fri, May 24, 2019, 4:42 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote:


Hi,

we already tried "rados -p ec31 getxattr 10004dfce92.003d
parent" but this is just hanging forever if we are looking for
unfound objects. It works fine for all other objects.

We also tried scanning the ceph directory with find -inum
1099593404050 (decimal of 10004dfce92) and found nothing. This is
also working for non unfound objects.

Is there another way to find the corresponding file?

On 24.05.19 11:12 vorm., Burkhard Linke wrote:


Hi,

On 5/24/19 9:48 AM, Kevin Flöh wrote:


We got the object ids of the missing objects with|ceph pg 1.24c
list_missing:|

|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know 
which file(s) is
affected. Is there a way to map the object id to the
corresponding file?



The object name is composed of the file inode id and the chunk
within the file. The first chunk has some metadata you can use to
retrieve the filename. See the 'CephFS object mapping' thread on
the mailing list for more information.


Regards,

Burkhard



___
ceph-users mailing list
ceph-users@lists.ceph.com  <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh

Hi,

we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" 
but this is just hanging forever if we are looking for unfound objects. 
It works fine for all other objects.


We also tried scanning the ceph directory with find -inum 1099593404050 
(decimal of 10004dfce92) and found nothing. This is also working for non 
unfound objects.


Is there another way to find the corresponding file?

On 24.05.19 11:12 vorm., Burkhard Linke wrote:


Hi,

On 5/24/19 9:48 AM, Kevin Flöh wrote:


We got the object ids of the missing objects with|ceph pg 1.24c 
list_missing:|


|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know which file(s) is affected. Is 
there a way to map the object id to the corresponding file?



The object name is composed of the file inode id and the chunk within 
the file. The first chunk has some metadata you can use to retrieve 
the filename. See the 'CephFS object mapping' thread on the mailing 
list for more information.



Regards,

Burkhard



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Kevin Flöh
We got the object ids of the missing objects with|ceph pg 1.24c 
list_missing:|


|{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "10004dfce92.003d",
    "key": "",
    "snapid": -2,
    "hash": 90219084,
    "max": 0,
    "pool": 1,
    "namespace": ""
    },
    "need": "46950'195355",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "36(3)",
    "61(2)"
    ]
    }
    ],
    "more": false
}
|

|we want to give up those objects with:|

ceph  pg  1.24c  mark_unfound_lost  revert But first we would like to know which file(s) is affected. Is 
there a way to map the object id to the corresponding file?


||

On 23.05.19 3:52 nachm., Alexandre Marangone wrote:
The PGs will stay active+recovery_wait+degraded until you solve the 
unfound objects issue.
You can follow this doc to look at which objects are unfound[1]  and 
if no other recourse mark them lost


[1] 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#unfound-objects. 



On Thu, May 23, 2019 at 5:47 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote:


thank you for this idea, it has improved the situation. Nevertheless,
there are still 2 PGs in recovery_wait. ceph -s gives me:

   cluster:
 id: 23e72372-0d44-4cad-b24f-3641b14b86f4
 health: HEALTH_WARN
 3/125481112 objects unfound (0.000%)
 Degraded data redundancy: 3/497011315 objects degraded
(0.000%), 2 pgs degraded

   services:
 mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
 mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
<http://ceph-node01.etp.kit.edu>
 mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu
<http://ceph-node03.etp.kit.edu>=up:active}, 3
up:standby
 osd: 96 osds: 96 up, 96 in

   data:
 pools:   2 pools, 4096 pgs
 objects: 125.48M objects, 259TiB
 usage:   370TiB used, 154TiB / 524TiB avail
 pgs: 3/497011315 objects degraded (0.000%)
  3/125481112 objects unfound (0.000%)
  4083 active+clean
  10   active+clean+scrubbing+deep
  2    active+recovery_wait+degraded
  1    active+clean+scrubbing

   io:
 client:   318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr


and ceph health detail:

HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data
redundancy: 3/497011315 objects degraded (0.000%), 2 p
gs degraded
OBJECT_UNFOUND 3/125481112 objects unfound (0.000%)
 pg 1.24c has 1 unfound objects
 pg 1.779 has 2 unfound objects
PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded
(0.000%), 2 pgs degraded
 pg 1.24c is active+recovery_wait+degraded, acting
[32,4,61,36], 1
unfound
 pg 1.779 is active+recovery_wait+degraded, acting
[50,4,77,62], 2
unfound


also the status changed form HEALTH_ERR to HEALTH_WARN. We also
did ceph
osd down for all OSDs of the degraded PGs. Do you have any further
suggestions on how to proceed?

On 23.05.19 11:08 vorm., Dan van der Ster wrote:
> I think those osds (1, 11, 21, 32, ...) need a little kick to
re-peer
> their degraded PGs.
>
> Open a window with `watch ceph -s`, then in another window slowly do
>
>      ceph osd down 1
>      # then wait a minute or so for that osd.1 to re-peer fully.
>      ceph osd down 11
>      ...
>
> Continue that for each of the osds with stuck requests, or until
there
> are no more recovery_wait/degraded PGs.
>
> After each `ceph osd down...`, you should expect to see several PGs
> re-peer, and then ideally the slow requests will disappear and the
> degraded PGs will become active+clean.
> If anything else happens, you should stop and let us know.
>
>
> -- dan
>
> On Thu, May 23, 2019 at 10:59 AM Kevin Flöh mailto:kevin.fl...@kit.edu>> wrote:
>> This is the current status of ceph:
>>
>>
>>     cluster:
>>       id:     23e72372-0d

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Kevin Flöh
thank you for this idea, it has improved the situation. Nevertheless, 
there are still 2 PGs in recovery_wait. ceph -s gives me:


  cluster:
    id: 23e72372-0d44-4cad-b24f-3641b14b86f4
    health: HEALTH_WARN
    3/125481112 objects unfound (0.000%)
    Degraded data redundancy: 3/497011315 objects degraded 
(0.000%), 2 pgs degraded


  services:
    mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
    mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
    mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu=up:active}, 3 
up:standby

    osd: 96 osds: 96 up, 96 in

  data:
    pools:   2 pools, 4096 pgs
    objects: 125.48M objects, 259TiB
    usage:   370TiB used, 154TiB / 524TiB avail
    pgs: 3/497011315 objects degraded (0.000%)
 3/125481112 objects unfound (0.000%)
 4083 active+clean
 10   active+clean+scrubbing+deep
 2    active+recovery_wait+degraded
 1    active+clean+scrubbing

  io:
    client:   318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr


and ceph health detail:

HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data 
redundancy: 3/497011315 objects degraded (0.000%), 2 p

gs degraded
OBJECT_UNFOUND 3/125481112 objects unfound (0.000%)
    pg 1.24c has 1 unfound objects
    pg 1.779 has 2 unfound objects
PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded 
(0.000%), 2 pgs degraded
    pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1 
unfound
    pg 1.779 is active+recovery_wait+degraded, acting [50,4,77,62], 2 
unfound



also the status changed form HEALTH_ERR to HEALTH_WARN. We also did ceph 
osd down for all OSDs of the degraded PGs. Do you have any further 
suggestions on how to proceed?


On 23.05.19 11:08 vorm., Dan van der Ster wrote:

I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer
their degraded PGs.

Open a window with `watch ceph -s`, then in another window slowly do

 ceph osd down 1
 # then wait a minute or so for that osd.1 to re-peer fully.
 ceph osd down 11
 ...

Continue that for each of the osds with stuck requests, or until there
are no more recovery_wait/degraded PGs.

After each `ceph osd down...`, you should expect to see several PGs
re-peer, and then ideally the slow requests will disappear and the
degraded PGs will become active+clean.
If anything else happens, you should stop and let us know.


-- dan

On Thu, May 23, 2019 at 10:59 AM Kevin Flöh  wrote:

This is the current status of ceph:


cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  9/125481144 objects unfound (0.000%)
  Degraded data redundancy: 9/497011417 objects degraded
(0.000%), 7 pgs degraded
  9 stuck requests are blocked > 4096 sec. Implicated osds
1,11,21,32,43,50,65

services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

data:
  pools:   2 pools, 4096 pgs
  objects: 125.48M objects, 259TiB
  usage:   370TiB used, 154TiB / 524TiB avail
  pgs: 9/497011417 objects degraded (0.000%)
   9/125481144 objects unfound (0.000%)
   4078 active+clean
   11   active+clean+scrubbing+deep
   7active+recovery_wait+degraded

io:
  client:   211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr

On 23.05.19 10:54 vorm., Dan van der Ster wrote:

What's the full ceph status?
Normally recovery_wait just means that the relevant osd's are busy
recovering/backfilling another PG.

On Thu, May 23, 2019 at 10:53 AM Kevin Flöh  wrote:

Hi,

we have set the PGs to recover and now they are stuck in 
active+recovery_wait+degraded and instructing them to deep-scrub does not 
change anything. Hence, the rados report is empty. Is there a way to stop the 
recovery wait to start the deep-scrub and get the output? I guess the 
recovery_wait might be caused by missing objects. Do we need to delete them 
first to get the recovery going?

Kevin

On 22.05.19 6:03 nachm., Robert LeBlanc wrote:

On Wed, May 22, 2019 at 4:31 AM Kevin Flöh  wrote:

Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we have
another problem, there are 7 PGs inconsistent and a cpeh pg repair is
not doing anything. I just get "instructing pg 1.5dd on osd.24 to
repair" and nothing happens. Does somebody know how we can get the PGs
to repair?

Regards,

Kevin

Kevin,

I just fixed an inconsistent PG yesterday. You will need to figure out why they 
are inconsistent. Do these steps and then we can figure out how to proceed.
1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them)
2. Print out the inconsistent report for each inconsistent PG. `rados 
list-inconsistent-obj  --forma

Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Kevin Flöh

This is the current status of ceph:


  cluster:
    id: 23e72372-0d44-4cad-b24f-3641b14b86f4
    health: HEALTH_ERR
    9/125481144 objects unfound (0.000%)
    Degraded data redundancy: 9/497011417 objects degraded 
(0.000%), 7 pgs degraded
    9 stuck requests are blocked > 4096 sec. Implicated osds 
1,11,21,32,43,50,65


  services:
    mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
    mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
    mds: cephfs-1/1/1 up  {0=ceph-node03.etp.kit.edu=up:active}, 3 
up:standby

    osd: 96 osds: 96 up, 96 in

  data:
    pools:   2 pools, 4096 pgs
    objects: 125.48M objects, 259TiB
    usage:   370TiB used, 154TiB / 524TiB avail
    pgs: 9/497011417 objects degraded (0.000%)
 9/125481144 objects unfound (0.000%)
 4078 active+clean
 11   active+clean+scrubbing+deep
 7    active+recovery_wait+degraded

  io:
    client:   211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr

On 23.05.19 10:54 vorm., Dan van der Ster wrote:

What's the full ceph status?
Normally recovery_wait just means that the relevant osd's are busy
recovering/backfilling another PG.

On Thu, May 23, 2019 at 10:53 AM Kevin Flöh  wrote:

Hi,

we have set the PGs to recover and now they are stuck in 
active+recovery_wait+degraded and instructing them to deep-scrub does not 
change anything. Hence, the rados report is empty. Is there a way to stop the 
recovery wait to start the deep-scrub and get the output? I guess the 
recovery_wait might be caused by missing objects. Do we need to delete them 
first to get the recovery going?

Kevin

On 22.05.19 6:03 nachm., Robert LeBlanc wrote:

On Wed, May 22, 2019 at 4:31 AM Kevin Flöh  wrote:

Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we have
another problem, there are 7 PGs inconsistent and a cpeh pg repair is
not doing anything. I just get "instructing pg 1.5dd on osd.24 to
repair" and nothing happens. Does somebody know how we can get the PGs
to repair?

Regards,

Kevin


Kevin,

I just fixed an inconsistent PG yesterday. You will need to figure out why they 
are inconsistent. Do these steps and then we can figure out how to proceed.
1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of them)
2. Print out the inconsistent report for each inconsistent PG. `rados 
list-inconsistent-obj  --format=json-pretty`
3. You will want to look at the error messages and see if all the shards have 
the same data.

Robert LeBlanc


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-23 Thread Kevin Flöh

Hi,

we have set the PGs to recover and now they are stuck in 
active+recovery_wait+degraded and instructing them to deep-scrub does 
not change anything. Hence, the rados report is empty. Is there a way to 
stop the recovery wait to start the deep-scrub and get the output? I 
guess the recovery_wait might be caused by missing objects. Do we need 
to delete them first to get the recovery going?


Kevin

On 22.05.19 6:03 nachm., Robert LeBlanc wrote:
On Wed, May 22, 2019 at 4:31 AM Kevin Flöh <mailto:kevin.fl...@kit.edu>> wrote:


Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we
have
another problem, there are 7 PGs inconsistent and a cpeh pg repair is
not doing anything. I just get "instructing pg 1.5dd on osd.24 to
repair" and nothing happens. Does somebody know how we can get the
PGs
to repair?

Regards,

Kevin


Kevin,

I just fixed an inconsistent PG yesterday. You will need to figure out 
why they are inconsistent. Do these steps and then we can figure out 
how to proceed.
1. Do a deep-scrub on each PG that is inconsistent. (This may fix some 
of them)
2. Print out the inconsistent report for each inconsistent PG. `rados 
list-inconsistent-obj  --format=json-pretty`
3. You will want to look at the error messages and see if all the 
shards have the same data.


Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-22 Thread Kevin Flöh

Hi,

thank you, it worked. The PGs are not incomplete anymore. Still we have 
another problem, there are 7 PGs inconsistent and a cpeh pg repair is 
not doing anything. I just get "instructing pg 1.5dd on osd.24 to 
repair" and nothing happens. Does somebody know how we can get the PGs 
to repair?


Regards,

Kevin

On 21.05.19 4:52 nachm., Wido den Hollander wrote:


On 5/21/19 4:48 PM, Kevin Flöh wrote:

Hi,

we gave up on the incomplete pgs since we do not have enough complete
shards to restore them. What is the procedure to get rid of these pgs?


You need to start with marking the OSDs as 'lost' and then you can
force_create_pg to get the PGs back (empty).

Wido


regards,

Kevin

On 20.05.19 9:22 vorm., Kevin Flöh wrote:

Hi Frederic,

we do not have access to the original OSDs. We exported the remaining
shards of the two pgs but we are only left with two shards (of
reasonable size) per pg. The rest of the shards displayed by ceph pg
query are empty. I guess marking the OSD as complete doesn't make
sense then.

Best,
Kevin

On 17.05.19 2:36 nachm., Frédéric Nass wrote:


Le 14/05/2019 à 10:04, Kevin Flöh a écrit :

On 13.05.19 11:21 nachm., Dan van der Ster wrote:

Presumably the 2 OSDs you marked as lost were hosting those
incomplete PGs?
It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)

yes, but as written in my other mail, we still have enough shards,
at least I think so.


If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.

Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on
them, they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on
a healthy OSD) seems to be the only way to recover your lost data, as
with 4 hosts and 2 OSDs lost, you're left with 2 chunks of
data/parity when you actually need 3 to access it. Reducing min_size
to 3 will not help.

Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html


This is probably the best way you want to follow form now on.

Regards,
Frédéric.


If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan

would this let us recover at least some of the data on the pgs? If
not we would just set up a new ceph directly without fixing the old
one and copy whatever is left.

Best regards,

Kevin




On Mon, May 13, 2019 at 4:20 PM Kevin Flöh 
wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster,
let me
first show you the current ceph status:

     cluster:
   id: 23e72372-0d44-4cad-b24f-3641b14b86f4
   health: HEALTH_ERR
   1 MDSs report slow metadata IOs
   1 MDSs report slow requests
   1 MDSs behind on trimming
   1/126319678 objects unfound (0.000%)
   19 scrub errors
   Reduced data availability: 2 pgs inactive, 2 pgs
incomplete
   Possible data damage: 7 pgs inconsistent
   Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
   118 stuck requests are blocked > 4096 sec.
Implicated osds
24,32,91

     services:
   mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
   mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
   mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
   osd: 96 osds: 96 up, 96 in

     data:
   pools:   2 pools, 4096 pgs
   objects: 126.32M objects, 260TiB
   usage:   372TiB used, 152TiB / 524TiB avail
   pgs: 0.049% pgs not active
    1/500333881 objects degraded (0.000%)
    1/126319678 objects unfound (0.000%)
    4076 active+clean
    10   active+clean+scrubbing+deep
    7    active+clean+inconsistent
    2    incomplete
    1    active+recovery_wait+degraded

     io:
   client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow
requests;
1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data dam

Re: [ceph-users] Major ceph disaster

2019-05-21 Thread Kevin Flöh

Hi,

we gave up on the incomplete pgs since we do not have enough complete 
shards to restore them. What is the procedure to get rid of these pgs?


regards,

Kevin

On 20.05.19 9:22 vorm., Kevin Flöh wrote:

Hi Frederic,

we do not have access to the original OSDs. We exported the remaining 
shards of the two pgs but we are only left with two shards (of 
reasonable size) per pg. The rest of the shards displayed by ceph pg 
query are empty. I guess marking the OSD as complete doesn't make 
sense then.


Best,
Kevin

On 17.05.19 2:36 nachm., Frédéric Nass wrote:



Le 14/05/2019 à 10:04, Kevin Flöh a écrit :


On 13.05.19 11:21 nachm., Dan van der Ster wrote:
Presumably the 2 OSDs you marked as lost were hosting those 
incomplete PGs?

It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, 
at least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on 
them, they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on 
a healthy OSD) seems to be the only way to recover your lost data, as 
with 4 hosts and 2 OSDs lost, you're left with 2 chunks of 
data/parity when you actually need 3 to access it. Reducing min_size 
to 3 will not help.


Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html 

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html 



This is probably the best way you want to follow form now on.

Regards,
Frédéric.



If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If 
not we would just set up a new ceph directly without fixing the old 
one and copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  
wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, 
let me

first show you the current ceph status:

    cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs 
incomplete

  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. 
Implicated osds

24,32,91

    services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

    data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7    active+clean+inconsistent
   2    incomplete
   1    active+recovery_wait+degraded

    io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow 
requests;

1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are 
blocked > 30 sec

MDS_TRIM 1 MDSs behind on trimming
  mdsceph-no

Re: [ceph-users] Major ceph disaster

2019-05-20 Thread Kevin Flöh

Hi Frederic,

we do not have access to the original OSDs. We exported the remaining 
shards of the two pgs but we are only left with two shards (of 
reasonable size) per pg. The rest of the shards displayed by ceph pg 
query are empty. I guess marking the OSD as complete doesn't make sense 
then.


Best,
Kevin

On 17.05.19 2:36 nachm., Frédéric Nass wrote:



Le 14/05/2019 à 10:04, Kevin Flöh a écrit :


On 13.05.19 11:21 nachm., Dan van der Ster wrote:
Presumably the 2 OSDs you marked as lost were hosting those 
incomplete PGs?

It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, at 
least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


Hi Kevin,

You want to make sure of this.

Unless you recreated the OSDs 4 and 23 and had new data written on 
them, they should still host the data you need.
What Dan suggested (export the 7 inconsistent PGs and import them on a 
healthy OSD) seems to be the only way to recover your lost data, as 
with 4 hosts and 2 OSDs lost, you're left with 2 chunks of data/parity 
when you actually need 3 to access it. Reducing min_size to 3 will not 
help.


Have a look here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019673.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023736.html 



This is probably the best way you want to follow form now on.

Regards,
Frédéric.



If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If 
not we would just set up a new ceph directly without fixing the old 
one and copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, let me
first show you the current ceph status:

    cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs 
incomplete

  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. Implicated 
osds

24,32,91

    services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

    data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7    active+clean+inconsistent
   2    incomplete
   1    active+recovery_wait+degraded

    io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow 
requests;

1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are 
blocked > 30 sec

MDS_TRIM 1 MDSs behind on trimming
  mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming 
(46034/128)

max_segments: 128, num_segments: 46034
OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
  pg 1.24c has 1 unfound objects
OSD_SCRUB_ERRORS 19 scrub errors
P

Re: [ceph-users] Major ceph disaster

2019-05-17 Thread Kevin Flöh
We tried to export the shards from the OSDs but there are only two 
shards left for each of the pgs, so we decided to give up these pgs. 
Will the files of these pgs be deleted from the mds or do we have to 
delete them manually. Is this the correct command to mark the pgs as lost:


ceph pg {pg-id} mark_unfound_lost revert|delete

Cheers,
Kevin

On 15.05.19 8:55 vorm., Kevin Flöh wrote:
The hdds of OSDs 4 and 23 are completely lost, we cannot access them 
in any way. Is it possible to use the shards which are maybe stored on 
working OSDs as shown in the all_participants list?


On 14.05.19 5:24 nachm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 5:13 PM Kevin Flöh  wrote:

ok, so now we see at least a diffrence in the recovery state:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-14 14:15:15.650517",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-14 14:15:15.243756",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59580",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  },
  {
  "osd": 24,
  "shard": 0
  },
  {
  "osd": 72,
  "shard": 1
  },
  {
  "osd": 79,
  "shard": 3
  }
  ],
  "intervals": [
  {
  "first": "59562",
  "last": "59563",
  "acting": "4(1),24(0),79(3)"
  },
  {
  "first": "59564",
  "last": "59567",
  "acting": "23(2),24(0),79(3)"
  },
  {
  "first": "59570",
  "last": "59574",
  "acting": "4(1),23(2),79(3)"
  },
  {
  "first": "59577",
  "last": "59580",
  "acting": "4(1),23(2),24(0)"
  }
  ]
  }
  ],
  "probing_osds": [
  "2(0)",
  "4(1)",
  "23(2)",
  "24(0)",
  "72(1)",
  "79(3)"
      ],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": []
  },
  {
  "name": "Started",
  "enter_time": "2019-05-14 14:15:15.243663"
  }
  ],

the peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?

What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG shards
and import to another operable OSD.

-- dan





On 14.05.19 11:02 vorm., Dan van der Ster wrote:
On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  
wrote:

On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  
wrote:


On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure 
coding. [...]

Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost a

Re: [ceph-users] Major ceph disaster

2019-05-15 Thread Kevin Flöh

ceph osd pool get ec31 min_size
min_size: 3

On 15.05.19 9:09 vorm., Konstantin Shalygin wrote:

ceph osd pool get ec31 min_size

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-15 Thread Kevin Flöh
The hdds of OSDs 4 and 23 are completely lost, we cannot access them in 
any way. Is it possible to use the shards which are maybe stored on 
working OSDs as shown in the all_participants list?


On 14.05.19 5:24 nachm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 5:13 PM Kevin Flöh  wrote:

ok, so now we see at least a diffrence in the recovery state:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-14 14:15:15.650517",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-14 14:15:15.243756",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59580",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  },
  {
  "osd": 24,
  "shard": 0
  },
  {
  "osd": 72,
  "shard": 1
  },
  {
  "osd": 79,
  "shard": 3
  }
  ],
  "intervals": [
  {
  "first": "59562",
  "last": "59563",
  "acting": "4(1),24(0),79(3)"
  },
  {
  "first": "59564",
  "last": "59567",
  "acting": "23(2),24(0),79(3)"
  },
  {
  "first": "59570",
  "last": "59574",
  "acting": "4(1),23(2),79(3)"
  },
  {
  "first": "59577",
  "last": "59580",
  "acting": "4(1),23(2),24(0)"
  }
  ]
  }
  ],
  "probing_osds": [
  "2(0)",
  "4(1)",
  "23(2)",
  "24(0)",
  "72(1)",
  "79(3)"
  ],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": []
  },
  {
  "name": "Started",
      "enter_time": "2019-05-14 14:15:15.243663"
  }
  ],

the peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?

What is the state of the hdd's which had osds 4 & 23?
You may be able to use ceph-objectstore-tool to export those PG shards
and import to another operable OSD.

-- dan





On 14.05.19 11:02 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:

On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of those

Re: [ceph-users] Major ceph disaster

2019-05-15 Thread Kevin Flöh

Hi,

since we have 3+1 ec I didn't try before. But when I run the command you 
suggested I get the following error:


ceph osd pool set ec31 min_size 2
Error EINVAL: pool min_size must be between 3 and 4

On 14.05.19 6:18 nachm., Konstantin Shalygin wrote:



  peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?



Try to reduce min_size for problem pool as 'health detail' suggested: 
`ceph osd pool set ec31 min_size 2`.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh

ok, so now we see at least a diffrence in the recovery state:

    "recovery_state": [
    {
    "name": "Started/Primary/Peering/Incomplete",
    "enter_time": "2019-05-14 14:15:15.650517",
    "comment": "not enough complete instances of this PG"
    },
    {
    "name": "Started/Primary/Peering",
    "enter_time": "2019-05-14 14:15:15.243756",
    "past_intervals": [
    {
    "first": "49767",
    "last": "59580",
    "all_participants": [
    {
    "osd": 2,
    "shard": 0
    },
    {
    "osd": 4,
    "shard": 1
    },
    {
    "osd": 23,
    "shard": 2
    },
    {
    "osd": 24,
    "shard": 0
    },
    {
    "osd": 72,
    "shard": 1
    },
    {
    "osd": 79,
    "shard": 3
    }
    ],
    "intervals": [
    {
    "first": "59562",
    "last": "59563",
    "acting": "4(1),24(0),79(3)"
    },
    {
    "first": "59564",
    "last": "59567",
    "acting": "23(2),24(0),79(3)"
    },
    {
    "first": "59570",
    "last": "59574",
    "acting": "4(1),23(2),79(3)"
    },
    {
    "first": "59577",
    "last": "59580",
    "acting": "4(1),23(2),24(0)"
    }
    ]
    }
    ],
    "probing_osds": [
    "2(0)",
    "4(1)",
    "23(2)",
    "24(0)",
    "72(1)",
    "79(3)"
    ],
        "down_osds_we_would_probe": [],
    "peering_blocked_by": []
    },
    {
    "name": "Started",
    "enter_time": "2019-05-14 14:15:15.243663"
    }
    ],

the peering does not seem to be blocked anymore. But still there is no 
recovery going on. Is there anything else we can try?



On 14.05.19 11:02 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:59 AM Kevin Flöh  wrote:


On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
that the recovery of the first was finished before the second failed.
Nonetheless, both problematic pgs have been on both OSDs. We think, that
we still have enough shards left. For one of the pgs, the recovery state
looks like this:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-09 16:11:48.625966",
  "comment": "n

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh


On 14.05.19 10:08 vorm., Dan van der Ster wrote:

On Tue, May 14, 2019 at 10:02 AM Kevin Flöh  wrote:

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and
therefore we decided to mark the osd as lost and set it up from
scratch. Ceph started recovering and then we lost another osd with
the same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time.
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence
for the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

OK, so the 2 OSDs (4,23) failed shortly one after the other but we think
that the recovery of the first was finished before the second failed.
Nonetheless, both problematic pgs have been on both OSDs. We think, that
we still have enough shards left. For one of the pgs, the recovery state
looks like this:

  "recovery_state": [
  {
  "name": "Started/Primary/Peering/Incomplete",
  "enter_time": "2019-05-09 16:11:48.625966",
  "comment": "not enough complete instances of this PG"
  },
  {
  "name": "Started/Primary/Peering",
  "enter_time": "2019-05-09 16:11:48.611171",
  "past_intervals": [
  {
  "first": "49767",
  "last": "59313",
  "all_participants": [
  {
  "osd": 2,
  "shard": 0
  },
  {
  "osd": 4,
  "shard": 1
  },
  {
  "osd": 23,
  "shard": 2
  },
  {
  "osd": 24,
  "shard": 0
  },
  {
  "osd": 72,
  "shard": 1
  },
  {
  "osd": 79,
  "shard": 3
  }
  ],
  "intervals": [
  {
  "first": "58860",
  "last": "58861",
  "acting": "4(1),24(0),79(3)"
  },
  {
  "first": "58875",
  "last": "58877",
  "acting": "4(1),23(2),24(0)"
  },
  {
  "first": "59002",
  "last": "59009",
  "acting": "4(1),23(2),79(3)"
  },
  {
  "first": "59010",
  "last": "59012",
  "acting": "2(0),4(1),23(2),79(3)"
  },
  {
  "first": "59197",
  "last": "59233",
  "acting": "23(2),24(0),79(3)"
  },
  {
  "first": "59234",
  "last": "59313",
  "acting": "23(2),24(0),72(1),79(3)"
  }
  ]
  }
  ],
  "probing_osds": [
  "2(0)",
  "4(1)",
  "23(2)",
  "24(0)",
  "72(1)",
  "79(3)"
  ],
  "down_osds_we_would_probe": [],
  "peering_blocked_by": [],
  &q

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh


On 13.05.19 11:21 nachm., Dan van der Ster wrote:

Presumably the 2 OSDs you marked as lost were hosting those incomplete PGs?
It would be useful to double confirm that: check with `ceph pg 
query` and `ceph pg dump`.
(If so, this is why the ignore history les thing isn't helping; you
don't have the minimum 3 stripes up for those 3+1 PGs.)


yes, but as written in my other mail, we still have enough shards, at 
least I think so.




If those "lost" OSDs by some miracle still have the PG data, you might
be able to export the relevant PG stripes with the
ceph-objectstore-tool. I've never tried this myself, but there have
been threads in the past where people export a PG from a nearly dead
hdd, import to another OSD, then backfilling works.

guess that is not possible.


If OTOH those PGs are really lost forever, and someone else should
confirm what I say here, I think the next step would be to force
recreate the incomplete PGs then run a set of cephfs scrub/repair
disaster recovery cmds to recover what you can from the cephfs.

-- dan


would this let us recover at least some of the data on the pgs? If not 
we would just set up a new ceph directly without fixing the old one and 
copy whatever is left.


Best regards,

Kevin





On Mon, May 13, 2019 at 4:20 PM Kevin Flöh  wrote:

Dear ceph experts,

we have several (maybe related) problems with our ceph cluster, let me
first show you the current ceph status:

cluster:
  id: 23e72372-0d44-4cad-b24f-3641b14b86f4
  health: HEALTH_ERR
  1 MDSs report slow metadata IOs
  1 MDSs report slow requests
  1 MDSs behind on trimming
  1/126319678 objects unfound (0.000%)
  19 scrub errors
  Reduced data availability: 2 pgs inactive, 2 pgs incomplete
  Possible data damage: 7 pgs inconsistent
  Degraded data redundancy: 1/500333881 objects degraded
(0.000%), 1 pg degraded
  118 stuck requests are blocked > 4096 sec. Implicated osds
24,32,91

services:
  mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
  mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
  mds: cephfs-1/1/1 up  {0=ceph-node02.etp.kit.edu=up:active}, 3
up:standby
  osd: 96 osds: 96 up, 96 in

data:
  pools:   2 pools, 4096 pgs
  objects: 126.32M objects, 260TiB
  usage:   372TiB used, 152TiB / 524TiB avail
  pgs: 0.049% pgs not active
   1/500333881 objects degraded (0.000%)
   1/126319678 objects unfound (0.000%)
   4076 active+clean
   10   active+clean+scrubbing+deep
   7active+clean+inconsistent
   2incomplete
   1active+recovery_wait+degraded

io:
  client:   449KiB/s rd, 42.9KiB/s wr, 152op/s rd, 0op/s wr


and ceph health detail:


HEALTH_ERR 1 MDSs report slow metadata IOs; 1 MDSs report slow requests;
1 MDSs behind on trimming; 1/126319687 objects unfound (0.000%); 19
scrub errors; Reduced data availability: 2 pgs inactive, 2 pgs
incomplete; Possible data damage: 7 pgs inconsistent; Degraded data
redundancy: 1/500333908 objects degraded (0.000%), 1 pg degraded; 118
stuck requests are blocked > 4096 sec. Implicated osds 24,32,91
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
  mdsceph-node02.etp.kit.edu(mds.0): 100+ slow metadata IOs are
blocked > 30 secs, oldest blocked for 351193 secs
MDS_SLOW_REQUEST 1 MDSs report slow requests
  mdsceph-node02.etp.kit.edu(mds.0): 4 slow requests are blocked > 30 sec
MDS_TRIM 1 MDSs behind on trimming
  mdsceph-node02.etp.kit.edu(mds.0): Behind on trimming (46034/128)
max_segments: 128, num_segments: 46034
OBJECT_UNFOUND 1/126319687 objects unfound (0.000%)
  pg 1.24c has 1 unfound objects
OSD_SCRUB_ERRORS 19 scrub errors
PG_AVAILABILITY Reduced data availability: 2 pgs inactive, 2 pgs incomplete
  pg 1.5dd is incomplete, acting [24,4,23,79] (reducing pool ec31
min_size from 3 may help; search ceph.com/docs for 'incomplete')
  pg 1.619 is incomplete, acting [91,23,4,81] (reducing pool ec31
min_size from 3 may help; search ceph.com/docs for 'incomplete')
PG_DAMAGED Possible data damage: 7 pgs inconsistent
  pg 1.17f is active+clean+inconsistent, acting [65,49,25,4]
  pg 1.1e0 is active+clean+inconsistent, acting [11,32,4,81]
  pg 1.203 is active+clean+inconsistent, acting [43,49,4,72]
  pg 1.5d3 is active+clean+inconsistent, acting [37,27,85,4]
  pg 1.779 is active+clean+inconsistent, acting [50,4,77,62]
  pg 1.77c is active+clean+inconsistent, acting [21,49,40,4]
  pg 1.7c3 is active+clean+inconsistent, acting [1,14,68,4]
PG_DEGRADED Degraded data redundancy: 1/500333908 objects degraded
(0.000%), 1 pg degraded
  pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1
unfound
REQUEST_STUCK 118 stuck requests are blocked > 4096 sec. Implicated osds
24,32,91
  118 ops

Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Kevin Flöh

On 13.05.19 10:51 nachm., Lionel Bouton wrote:

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :

Dear ceph experts,

[...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
Here is what happened: One osd daemon could not be started and 
therefore we decided to mark the osd as lost and set it up from 
scratch. Ceph started recovering and then we lost another osd with 
the same behavior. We did the same as for the first osd.


With 3+1 you only allow a single OSD failure per pg at a given time. 
You have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2 
separate servers (assuming standard crush rules) is a death sentence 
for the data on some pgs using both of those OSD (the ones not fully 
recovered before the second failure).


OK, so the 2 OSDs (4,23) failed shortly one after the other but we think 
that the recovery of the first was finished before the second failed. 
Nonetheless, both problematic pgs have been on both OSDs. We think, that 
we still have enough shards left. For one of the pgs, the recovery state 
looks like this:


    "recovery_state": [
    {
    "name": "Started/Primary/Peering/Incomplete",
    "enter_time": "2019-05-09 16:11:48.625966",
    "comment": "not enough complete instances of this PG"
    },
    {
    "name": "Started/Primary/Peering",
    "enter_time": "2019-05-09 16:11:48.611171",
    "past_intervals": [
    {
    "first": "49767",
    "last": "59313",
    "all_participants": [
    {
    "osd": 2,
    "shard": 0
    },
    {
    "osd": 4,
    "shard": 1
    },
    {
    "osd": 23,
    "shard": 2
    },
    {
    "osd": 24,
    "shard": 0
    },
    {
    "osd": 72,
    "shard": 1
    },
    {
    "osd": 79,
    "shard": 3
    }
    ],
    "intervals": [
    {
    "first": "58860",
    "last": "58861",
    "acting": "4(1),24(0),79(3)"
    },
    {
    "first": "58875",
    "last": "58877",
    "acting": "4(1),23(2),24(0)"
    },
    {
    "first": "59002",
    "last": "59009",
    "acting": "4(1),23(2),79(3)"
    },
    {
    "first": "59010",
    "last": "59012",
    "acting": "2(0),4(1),23(2),79(3)"
    },
    {
    "first": "59197",
    "last": "59233",
    "acting": "23(2),24(0),79(3)"
    },
    {
    "first": "59234",
    "last": "59313",
    "acting": "23(2),24(0),72(1),79(3)"
    }
    ]
    }
    ],
    "probing_osds": [
    "2(0)",
    "4(1)",
    "23(2)",
    "24(0)",
    "72(1)",
    "79(3)"
    ],
    "down_osds_we_would_probe": [],
    "peering_blocked_by": [],
    "peering_blocked_by_detail": [
    {
    "detail": "peering_blocked_by_history_les_bound"
    }
    ]
    },
    {
    "name": "Started",
   

[ceph-users] Major ceph disaster

2019-05-13 Thread Kevin Flöh
d: One osd daemon could not be started and therefore 
we decided to mark the osd as lost and set it up from scratch. Ceph 
started recovering and then we lost another osd with the same behavior. 
We did the same as for the first osd. And now we are stuck with 2 pgs in 
incomplete. Ceph pg query gives the following problem:


    "down_osds_we_would_probe": [],
    "peering_blocked_by": [],
    "peering_blocked_by_detail": [
    {
    "detail": "peering_blocked_by_history_les_bound"
    }

We already tried to set "osd_find_best_info_ignore_history_les": "true" 
for the affected osds, which had no effect. Furthermore, the cluster is 
behind on trimming by more than 40,000 segments and we have folders and 
files which cannot be deleted or moved. (which are not on the 2 
incomplete pgs). Is there any way to solve these problems?


Best regards,

Kevin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster is not stable

2019-03-12 Thread Kevin Olbrich
Are you sure that firewalld is stopped and disabled?
Looks exactly like that when I missed one host in a test cluster.

Kevin


Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou :

> Hi,
>
> I deployed a ceph cluster with good performance. But the logs
> indicate that the cluster is not as stable as I think it should be.
>
> The log shows the monitors mark some osd as down periodly:
> [image: image.png]
>
> I didn't find any useful information in osd logs.
>
> ceph version 13.2.4 mimic (stable)
> OS version CentOS 7.6.1810
> kernel version 5.0.0-2.el7
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-26 Thread Kevin Olbrich
dd 0.90999  1.0  932GiB  335GiB  597GiB 35.96 0.79  91
12   hdd 0.90999  1.0  932GiB  357GiB  575GiB 38.28 0.84  96
35   hdd 0.90970  1.0  932GiB  318GiB  614GiB 34.14 0.75  86
 6   ssd 0.43700  1.0  447GiB  278GiB  170GiB 62.08 1.36  63
 7   ssd 0.43700  1.0  447GiB  256GiB  191GiB 57.17 1.25  60
 8   ssd 0.43700  1.0  447GiB  291GiB  156GiB 65.01 1.42  57
31   ssd 0.43660  1.0  447GiB  246GiB  201GiB 54.96 1.20  51
34   ssd 0.43660  1.0  447GiB  189GiB  258GiB 42.22 0.92  46
36   ssd 0.87329  1.0  894GiB  389GiB  506GiB 43.45 0.95  91
37   ssd 0.87329  1.0  894GiB  390GiB  504GiB 43.63 0.96  85
42   ssd 0.87329  1.0  894GiB  401GiB  493GiB 44.88 0.98  92
43   ssd 0.87329  1.0  894GiB  455GiB  439GiB 50.89 1.11  89
17   hdd 0.90999  1.0  932GiB  368GiB  563GiB 39.55 0.87 100
18   hdd 0.90999  1.0  932GiB  350GiB  582GiB 37.56 0.82  95
24   hdd 0.90999  1.0  932GiB  359GiB  572GiB 38.58 0.84  97
26   hdd 0.90999  1.0  932GiB  388GiB  544GiB 41.62 0.91 105
13   ssd 0.43700  1.0  447GiB  322GiB  125GiB 72.12 1.58  80
14   ssd 0.43700  1.0  447GiB  291GiB  156GiB 65.16 1.43  70
15   ssd 0.43700  1.0  447GiB  350GiB 96.9GiB 78.33 1.72  78 <--
16   ssd 0.43700  1.0  447GiB  268GiB  179GiB 60.05 1.31  71
23   hdd 0.90999  1.0  932GiB  364GiB  567GiB 39.08 0.86  98
25   hdd 0.90999  1.0  932GiB  391GiB  541GiB 41.92 0.92 106
27   hdd 0.90999  1.0  932GiB  393GiB  538GiB 42.21 0.92 106
28   hdd 0.90970  1.0  932GiB  467GiB  464GiB 50.14 1.10 126
19   ssd 0.43700  1.0  447GiB  310GiB  137GiB 69.36 1.52  76
20   ssd 0.43700  1.0  447GiB  316GiB  131GiB 70.66 1.55  76
21   ssd 0.43700  1.0  447GiB  323GiB  125GiB 72.13 1.58  80
22   ssd 0.43700  1.0  447GiB  283GiB  164GiB 63.39 1.39  69
38   ssd 0.43660  1.0  447GiB  146GiB  302GiB 32.55 0.71  46
39   ssd 0.43660  1.0  447GiB  142GiB  305GiB 31.84 0.70  43
40   ssd 0.87329  1.0  894GiB  407GiB  487GiB 45.53 1.00  98
41   ssd 0.87329  1.0  894GiB  353GiB  541GiB 39.51 0.87 102
TOTAL 29.9TiB 13.7TiB 16.3TiB 45.66
MIN/MAX VAR: 0.63/1.72  STDDEV: 13.59




Kevin

Am So., 6. Jan. 2019 um 07:34 Uhr schrieb Konstantin Shalygin :
>
> On 1/5/19 4:17 PM, Kevin Olbrich wrote:
> > root@adminnode:~# ceph osd tree
> > ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
> >   -1   30.82903 root default
> > -16   30.82903 datacenter dc01
> > -19   30.82903 pod dc01-agg01
> > -10   17.43365 rack dc01-rack02
> >   -47.20665 host node1001
> >0   hdd  0.90999 osd.0 up  1.0 1.0
> >1   hdd  0.90999 osd.1 up  1.0 1.0
> >5   hdd  0.90999 osd.5 up  1.0 1.0
> >   29   hdd  0.90970 osd.29up  1.0 1.0
> >   32   hdd  0.90970 osd.32  down0 1.0
> >   33   hdd  0.90970 osd.33up  1.0 1.0
> >2   ssd  0.43700 osd.2 up  1.0 1.0
> >3   ssd  0.43700 osd.3 up  1.0 1.0
> >4   ssd  0.43700 osd.4 up  1.0 1.0
> >   30   ssd  0.43660 osd.30up  1.0 1.0
> >   -76.29724 host node1002
> >9   hdd  0.90999 osd.9 up  1.0 1.0
> >   10   hdd  0.90999 osd.10up  1.0 1.0
> >   11   hdd  0.90999 osd.11up  1.0 1.0
> >   12   hdd  0.90999 osd.12up  1.0 1.0
> >   35   hdd  0.90970 osd.35up  1.0 1.0
> >6   ssd  0.43700 osd.6 up  1.0 1.0
> >7   ssd  0.43700 osd.7 up  1.0 1.0
> >8   ssd  0.43700 osd.8 up  1.0 1.0
> >   31   ssd  0.43660 osd.31up  1.0 1.0
> > -282.18318 host node1005
> >   34   ssd  0.43660 osd.34up  1.0 1.0
> >   36   ssd  0.87329 osd.36up  1.0 1.0
> >   37   ssd  0.87329 osd.37up  1.0 1.0
> > -291.74658 host node1006
> >   42   ssd  0.87329 osd.42up  1.0 1.0
> >   43   ssd  0.87329 osd.43up  1.0 1.0
> > -11   13.39537 rack dc01-rack03
> > -225.38794 host node100

Re: [ceph-users] Rezising an online mounted ext4 on a rbd - failed

2019-01-26 Thread Kevin Olbrich
Am Sa., 26. Jan. 2019 um 13:43 Uhr schrieb Götz Reinicke
:
>
> Hi,
>
> I have a fileserver which mounted a 4TB rbd, which is ext4 formatted.
>
> I grow that rbd and ext4 starting with an 2TB rbd that way:
>
> rbd resize testpool/disk01--size 4194304
>
> resize2fs /dev/rbd0
>
> Today I wanted to extend that ext4 to 8 TB and did:
>
> rbd resize testpool/disk01--size 8388608
>
> resize2fs /dev/rbd0
>
> => which gives an error: The filesystem is already 1073741824 blocks. Nothing 
> to do.
>
>
> I bet I missed something very simple. Any hint? Thanks and regards . 
> Götz

Try "partprobe" to read device metrics again.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-18 Thread KEVIN MICHAEL HRPCEK


On 1/18/19 7:26 AM, Igor Fedotov wrote:

Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:
Hey,

I recall reading about this somewhere but I can't find it in the docs or list 
archive and confirmation from a dev or someone who knows for sure would be 
nice. What I recall is that bluestore has a max 4GB file size limit based on 
the design of bluestore not the osd_max_object_size setting. The bluestore 
source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, 
giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing 
the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe 
I'm reading all this wrong..?

You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)

Thanks for confirming that!


If bluestore has a hard 4GB object limit using radosstriper to break up an 
object would work, but does using an EC pool that breaks up the object to 
shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get 
around a 4GB limit? We use rados directly and would like to move to bluestore 
but we have some large objects <= 13G that may need attention if this 4GB limit 
does exist and an ec pool doesn't get around it.
Theoretically object split using EC might help. But I'm not sure whether one 
needs to adjust osd_max_object_size greater than 4Gb to permit 13Gb object 
usage in EC pool. If it's needed than tosd_max_object_size <= OBJECT_MAX_SIZE 
constraint is violated and BlueStore wouldn't start.
In my experience I had to increase osd_max_object_size from the 128M default it 
changed to a couple versions ago to ~20G to be able to write our largest 
objects with some margin. Do you think there is another way to handle 
osd_max_object_size > OBJECT_MAX_SIZE so that bluestore will start and EC pools 
or striping can be used to write objects that are greater than OBJECT_MAX_SIZE 
but each stripe/shard ends up smaller than OBJECT_MAX_SIZE after striping or 
being in an ec pool?



https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395

 // sanity check(s)
  auto osd_max_object_size =
cct->_conf.get_val("osd_max_object_size");
  if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
  << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  
std::dec << dendl;
return -EINVAL;
  }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
  if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
  } else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
  }

Thanks!
Kevin


--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore 32bit max_object_size limit

2019-01-17 Thread KEVIN MICHAEL HRPCEK
Hey,

I recall reading about this somewhere but I can't find it in the docs or list 
archive and confirmation from a dev or someone who knows for sure would be 
nice. What I recall is that bluestore has a max 4GB file size limit based on 
the design of bluestore not the osd_max_object_size setting. The bluestore 
source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, 
giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing 
the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe 
I'm reading all this wrong..?

If bluestore has a hard 4GB object limit using radosstriper to break up an 
object would work, but does using an EC pool that breaks up the object to 
shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get 
around a 4GB limit? We use rados directly and would like to move to bluestore 
but we have some large objects <= 13G that may need attention if this 4GB limit 
does exist and an ec pool doesn't get around it.


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395

 // sanity check(s)
  auto osd_max_object_size =
cct->_conf.get_val("osd_max_object_size");
  if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
  << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  
std::dec << dendl;
return -EINVAL;
  }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
  if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
  } else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
  }

Thanks!
Kevin


--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck in creating+peering state

2019-01-17 Thread Kevin Olbrich
Are you sure, no service like firewalld is running?
Did you check that all machines have the same MTU and jumbo frames are
enabled if needed?

I had this problem when I first started with ceph and forgot to
disable firewalld.
Replication worked perfectly fine but the OSD was kicked out every few seconds.

Kevin

Am Do., 17. Jan. 2019 um 11:57 Uhr schrieb Johan Thomsen :
>
> Hi,
>
> I have a sad ceph cluster.
> All my osds complain about failed reply on heartbeat, like so:
>
> osd.10 635 heartbeat_check: no reply from 192.168.160.237:6810 osd.42
> ever on either front or back, first ping sent 2019-01-16
> 22:26:07.724336 (cutoff 2019-01-16 22:26:08.225353)
>
> .. I've checked the network sanity all I can, and all ceph ports are
> open between nodes both on the public network and the cluster network,
> and I have no problems sending traffic back and forth between nodes.
> I've tried tcpdump'ing and traffic is passing in both directions
> between the nodes, but unfortunately I don't natively speak the ceph
> protocol, so I can't figure out what's going wrong in the heartbeat
> conversation.
>
> Still:
>
> # ceph health detail
>
> HEALTH_WARN nodown,noout flag(s) set; Reduced data availability: 1072
> pgs inactive, 1072 pgs peering
> OSDMAP_FLAGS nodown,noout flag(s) set
> PG_AVAILABILITY Reduced data availability: 1072 pgs inactive, 1072 pgs peering
> pg 7.3cd is stuck inactive for 245901.560813, current state
> creating+peering, last acting [13,41,1]
> pg 7.3ce is stuck peering for 245901.560813, current state
> creating+peering, last acting [1,40,7]
> pg 7.3cf is stuck peering for 245901.560813, current state
> creating+peering, last acting [0,42,9]
> pg 7.3d0 is stuck peering for 245901.560813, current state
> creating+peering, last acting [20,8,38]
> pg 7.3d1 is stuck peering for 245901.560813, current state
> creating+peering, last acting [10,20,42]
>()
>
>
> I've set "noout" and "nodown" to prevent all osd's from being removed
> from the cluster. They are all running and marked as "up".
>
> # ceph osd tree
>
> ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
>  -1   249.73434 root default
> -25   166.48956 datacenter m1
> -2483.24478 pod kube1
> -3541.62239 rack 10
> -3441.62239 host ceph-sto-p102
>  40   hdd   7.27689 osd.40 up  1.0 1.0
>  41   hdd   7.27689 osd.41 up  1.0 1.0
>  42   hdd   7.27689 osd.42 up  1.0 1.0
>()
>
> I'm at a point where I don't know which options and what logs to check 
> anymore?
>
> Any debug hint would be very much appreciated.
>
> btw. I have no important data in the cluster (yet), so if the solution
> is to drop all osd and recreate them, it's ok for now. But I'd really
> like to know how the cluster ended in this state.
>
> /Johan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with CephFS - No space left on device

2019-01-08 Thread Kevin Olbrich
It would but you should not:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html

Kevin

Am Di., 8. Jan. 2019 um 15:35 Uhr schrieb Rodrigo Embeita
:
>
> Thanks again Kevin.
> If I reduce the size flag to a value of 2, that should fix the problem?
>
> Regards
>
> On Tue, Jan 8, 2019 at 11:28 AM Kevin Olbrich  wrote:
>>
>> You use replication 3 failure-domain host.
>> OSD 2 and 4 are full, thats why your pool is also full.
>> You need to add two disks to pf-us1-dfs3 or swap one from the larger
>> nodes to this one.
>>
>> Kevin
>>
>> Am Di., 8. Jan. 2019 um 15:20 Uhr schrieb Rodrigo Embeita
>> :
>> >
>> > Hi Yoann, thanks for your response.
>> > Here are the results of the commands.
>> >
>> > root@pf-us1-dfs2:/var/log/ceph# ceph osd df
>> > ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
>> > 0   hdd 7.27739  1.0 7.3 TiB 6.7 TiB 571 GiB 92.33 1.74 310
>> > 5   hdd 7.27739  1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.18 1.45 271
>> > 6   hdd 7.27739  1.0 7.3 TiB 609 GiB 6.7 TiB  8.17 0.15  49
>> > 8   hdd 7.27739  1.0 7.3 TiB 2.5 GiB 7.3 TiB  0.030  42
>> > 1   hdd 7.27739  1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.28 1.45 285
>> > 3   hdd 7.27739  1.0 7.3 TiB 6.9 TiB 371 GiB 95.02 1.79 296
>> > 7   hdd 7.27739  1.0 7.3 TiB 360 GiB 6.9 TiB  4.84 0.09  53
>> > 9   hdd 7.27739  1.0 7.3 TiB 4.1 GiB 7.3 TiB  0.06 0.00  38
>> > 2   hdd 7.27739  1.0 7.3 TiB 6.7 TiB 576 GiB 92.27 1.74 321
>> > 4   hdd 7.27739  1.0 7.3 TiB 6.1 TiB 1.2 TiB 84.10 1.58 351
>> >TOTAL  73 TiB  39 TiB  34 TiB 53.13
>> > MIN/MAX VAR: 0/1.79  STDDEV: 41.15
>> >
>> >
>> > root@pf-us1-dfs2:/var/log/ceph# ceph osd pool ls detail
>> > pool 1 'poolcephfs' replicated size 3 min_size 2 crush_rule 0 object_hash 
>> > rjenkins pg_num 128 pgp_num 128 last_change 471 fla
>> > gs hashpspool,full stripe_width 0
>> > pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash 
>> > rjenkins pg_num 256 pgp_num 256 last_change 471 lf
>> > or 0/439 flags hashpspool,full stripe_width 0 application cephfs
>> > pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
>> > object_hash rjenkins pg_num 256 pgp_num 256 last_change 47
>> > 1 lfor 0/448 flags hashpspool,full stripe_width 0 application cephfs
>> > pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash 
>> > rjenkins pg_num 8 pgp_num 8 last_change 471 flags ha
>> > shpspool,full stripe_width 0 application rgw
>> > pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 
>> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 47
>> > 1 flags hashpspool,full stripe_width 0 application rgw
>> > pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 
>> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 f
>> > lags hashpspool,full stripe_width 0 application rgw
>> > pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 
>> > object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 fl
>> > ags hashpspool,full stripe_width 0 application rgw
>> >
>> >
>> > root@pf-us1-dfs2:/var/log/ceph# ceph osd tree
>> > ID CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
>> > -1   72.77390 root default
>> > -3   29.10956 host pf-us1-dfs1
>> > 0   hdd  7.27739 osd.0up  1.0 1.0
>> > 5   hdd  7.27739 osd.5up  1.0 1.0
>> > 6   hdd  7.27739 osd.6up  1.0 1.0
>> > 8   hdd  7.27739 osd.8up  1.0 1.0
>> > -5   29.10956 host pf-us1-dfs2
>> > 1   hdd  7.27739 osd.1up  1.0 1.0
>> > 3   hdd  7.27739 osd.3up  1.0 1.0
>> > 7   hdd  7.27739 osd.7up  1.0 1.0
>> > 9   hdd  7.27739 osd.9up  1.0 1.0
>> > -7   14.55478 host pf-us1-dfs3
>> > 2   hdd  7.27739 osd.2up  1.0 1.0
>> > 4   hdd  7.27739 osd.4up  1.0 1.0
>> >
>> >
>> > Thanks for your help guys.
>> >
>> >
>> > On Tue, Jan 8, 2019 at 10:36 AM Yoann Moulin  wrote:
>> >>
>> >> Hello,
>> >>
>> >> > Hi guys, I need your help.
>> >> > I'm new with Cephfs and we started using it 

Re: [ceph-users] Problem with CephFS - No space left on device

2019-01-08 Thread Kevin Olbrich
You use replication 3 failure-domain host.
OSD 2 and 4 are full, thats why your pool is also full.
You need to add two disks to pf-us1-dfs3 or swap one from the larger
nodes to this one.

Kevin

Am Di., 8. Jan. 2019 um 15:20 Uhr schrieb Rodrigo Embeita
:
>
> Hi Yoann, thanks for your response.
> Here are the results of the commands.
>
> root@pf-us1-dfs2:/var/log/ceph# ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
> 0   hdd 7.27739  1.0 7.3 TiB 6.7 TiB 571 GiB 92.33 1.74 310
> 5   hdd 7.27739  1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.18 1.45 271
> 6   hdd 7.27739  1.0 7.3 TiB 609 GiB 6.7 TiB  8.17 0.15  49
> 8   hdd 7.27739  1.0 7.3 TiB 2.5 GiB 7.3 TiB  0.030  42
> 1   hdd 7.27739  1.0 7.3 TiB 5.6 TiB 1.7 TiB 77.28 1.45 285
> 3   hdd 7.27739  1.0 7.3 TiB 6.9 TiB 371 GiB 95.02 1.79 296
> 7   hdd 7.27739  1.0 7.3 TiB 360 GiB 6.9 TiB  4.84 0.09  53
> 9   hdd 7.27739  1.0 7.3 TiB 4.1 GiB 7.3 TiB  0.06 0.00  38
> 2   hdd 7.27739  1.0 7.3 TiB 6.7 TiB 576 GiB 92.27 1.74 321
> 4   hdd 7.27739  1.0 7.3 TiB 6.1 TiB 1.2 TiB 84.10 1.58 351
>TOTAL  73 TiB  39 TiB  34 TiB 53.13
> MIN/MAX VAR: 0/1.79  STDDEV: 41.15
>
>
> root@pf-us1-dfs2:/var/log/ceph# ceph osd pool ls detail
> pool 1 'poolcephfs' replicated size 3 min_size 2 crush_rule 0 object_hash 
> rjenkins pg_num 128 pgp_num 128 last_change 471 fla
> gs hashpspool,full stripe_width 0
> pool 2 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash 
> rjenkins pg_num 256 pgp_num 256 last_change 471 lf
> or 0/439 flags hashpspool,full stripe_width 0 application cephfs
> pool 3 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 
> object_hash rjenkins pg_num 256 pgp_num 256 last_change 47
> 1 lfor 0/448 flags hashpspool,full stripe_width 0 application cephfs
> pool 4 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash 
> rjenkins pg_num 8 pgp_num 8 last_change 471 flags ha
> shpspool,full stripe_width 0 application rgw
> pool 5 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 47
> 1 flags hashpspool,full stripe_width 0 application rgw
> pool 6 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 f
> lags hashpspool,full stripe_width 0 application rgw
> pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 471 fl
> ags hashpspool,full stripe_width 0 application rgw
>
>
> root@pf-us1-dfs2:/var/log/ceph# ceph osd tree
> ID CLASS WEIGHT   TYPE NAMESTATUS REWEIGHT PRI-AFF
> -1   72.77390 root default
> -3   29.10956 host pf-us1-dfs1
> 0   hdd  7.27739 osd.0up  1.0 1.0
> 5   hdd  7.27739 osd.5up  1.0 1.0
> 6   hdd  7.27739 osd.6up  1.0 1.0
> 8   hdd  7.27739 osd.8up  1.0 1.0
> -5   29.10956 host pf-us1-dfs2
> 1   hdd  7.27739 osd.1up  1.0 1.0
> 3   hdd  7.27739 osd.3up  1.0 1.0
> 7   hdd  7.27739 osd.7up  1.0 1.0
> 9   hdd  7.27739 osd.9up  1.0 1.0
> -7   14.55478 host pf-us1-dfs3
> 2   hdd  7.27739 osd.2up  1.0 1.0
> 4   hdd  7.27739 osd.4up  1.0 1.0
>
>
> Thanks for your help guys.
>
>
> On Tue, Jan 8, 2019 at 10:36 AM Yoann Moulin  wrote:
>>
>> Hello,
>>
>> > Hi guys, I need your help.
>> > I'm new with Cephfs and we started using it as file storage.
>> > Today we are getting no space left on device but I'm seeing that we have 
>> > plenty space on the filesystem.
>> > Filesystem  Size  Used Avail Use% Mounted on
>> > 192.168.51.8,192.168.51.6,192.168.51.118:6789:/pagefreezer/smhosts   73T   
>> > 39T   35T  54% /mnt/cephfs
>> >
>> > We have 35TB of disk space. I've added 2 additional OSD disks with 7TB 
>> > each but I'm getting the error "No space left on device" every time that
>> > I want to add a new file.
>> > After adding the 2 additional OSD disks I'm seeing that the load is beign 
>> > distributed among the cluster.
>> > Please I need your help.
>>
>> Could you give us the output of
>>
>> ceph osd df
>> ceph osd pool ls detail
>> ceph osd tree
>>
>> Best regards,
>>
>> --
>> Yoann Moulin
>> EPFL IC-IT
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with CephFS - No space left on device

2019-01-08 Thread Kevin Olbrich
Looks like the same problem like mine:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-January/032054.html

The free space is total while Ceph uses the smallest free space (worst OSD).
Please check your (re-)weights.

Kevin

Am Di., 8. Jan. 2019 um 14:32 Uhr schrieb Rodrigo Embeita
:
>
> Hi guys, I need your help.
> I'm new with Cephfs and we started using it as file storage.
> Today we are getting no space left on device but I'm seeing that we have 
> plenty space on the filesystem.
> Filesystem  Size  Used Avail Use% Mounted on
> 192.168.51.8,192.168.51.6,192.168.51.118:6789:/pagefreezer/smhosts   73T   
> 39T   35T  54% /mnt/cephfs
>
> We have 35TB of disk space. I've added 2 additional OSD disks with 7TB each 
> but I'm getting the error "No space left on device" every time that I want to 
> add a new file.
> After adding the 2 additional OSD disks I'm seeing that the load is beign 
> distributed among the cluster.
> Please I need your help.
>
> root@pf-us1-dfs1:/etc/ceph# ceph -s
>  cluster:
>id: 609e9313-bdd3-449e-a23f-3db8382e71fb
>health: HEALTH_ERR
>2 backfillfull osd(s)
>1 full osd(s)
>7 pool(s) full
>197313040/508449063 objects misplaced (38.807%)
>Degraded data redundancy: 2/508449063 objects degraded (0.000%), 2 
> pgs degraded
>Degraded data redundancy (low space): 16 pgs backfill_toofull, 3 
> pgs recovery_toofull
>
>  services:
>mon: 3 daemons, quorum pf-us1-dfs2,pf-us1-dfs1,pf-us1-dfs3
>mgr: pf-us1-dfs3(active), standbys: pf-us1-dfs2
>mds: pagefs-2/2/2 up  {0=pf-us1-dfs3=up:active,1=pf-us1-dfs1=up:active}, 1 
> up:standby
>osd: 10 osds: 10 up, 10 in; 189 remapped pgs
>rgw: 1 daemon active
>
>  data:
>pools:   7 pools, 416 pgs
>objects: 169.5 M objects, 3.6 TiB
>usage:   39 TiB used, 34 TiB / 73 TiB avail
>pgs: 2/508449063 objects degraded (0.000%)
> 197313040/508449063 objects misplaced (38.807%)
> 224 active+clean
> 168 active+remapped+backfill_wait
> 16  active+remapped+backfill_wait+backfill_toofull
> 5   active+remapped+backfilling
> 2   active+recovery_toofull+degraded
> 1   active+recovery_toofull
>
>  io:
>recovery: 1.1 MiB/s, 31 objects/s
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer=on with crush-compat mode

2019-01-05 Thread Kevin Olbrich
If I understand the balancer correct, it balances PGs not data.
This worked perfectly fine in your case.

I prefer a PG count of ~100 per OSD, you are at 30. Maybe it would
help to bump the PGs.

Kevin

Am Sa., 5. Jan. 2019 um 14:39 Uhr schrieb Marc Roos :
>
>
> I have straw2, balancer=on, crush-compat and it gives worst spread over
> my ssd drives (4 only) being used by only 2 pools. One of these pools
> has pg 8. Should I increase this to 16 to create a better result, or
> will it never be any better.
>
> For now I like to stick to crush-compat, so I can use a default centos7
> kernel.
>
> Luminous 12.2.8, 3.10.0-862.14.4.el7.x86_64, CentOS Linux release
> 7.5.1804 (Core)
>
>
>
> [@c01 ~]# cat balancer-1-before.txt | egrep '^19|^20|^21|^30'
> 19   ssd 0.48000  1.0  447GiB  164GiB  283GiB 36.79 0.93  31
> 20   ssd 0.48000  1.0  447GiB  136GiB  311GiB 30.49 0.77  32
> 21   ssd 0.48000  1.0  447GiB  215GiB  232GiB 48.02 1.22  30
> 30   ssd 0.48000  1.0  447GiB  151GiB  296GiB 33.72 0.86  27
>
> [@c01 ~]# ceph osd df | egrep '^19|^20|^21|^30'
> 19   ssd 0.48000  1.0  447GiB  157GiB  290GiB 35.18 0.87  30
> 20   ssd 0.48000  1.0  447GiB  125GiB  322GiB 28.00 0.69  30
> 21   ssd 0.48000  1.0  447GiB  245GiB  202GiB 54.71 1.35  30
> 30   ssd 0.48000  1.0  447GiB  217GiB  230GiB 48.46 1.20  30
>
> [@c01 ~]# ceph osd pool ls detail | egrep 'fs_meta|rbd.ssd'
> pool 19 'fs_meta' replicated size 3 min_size 2 crush_rule 5 object_hash
> rjenkins pg_num 16 pgp_num 16 last_change 22425 lfor 0/9035 flags
> hashpspool stripe_width 0 application cephfs
> pool 54 'rbd.ssd' replicated size 3 min_size 2 crush_rule 5 object_hash
> rjenkins pg_num 8 pgp_num 8 last_change 24666 flags hashpspool
> stripe_width 0 application rbd
>
> [@c01 ~]# ceph df |egrep 'ssd|fs_meta'
> fs_meta   19  170MiB  0.07
> 240GiB 2451382
> fs_data.ssd   33  0B 0
> 240GiB   0
> rbd.ssd   54  266GiB 52.57
> 240GiB   75902
> fs_data.ec21.ssd  55  0B 0
> 480GiB   0
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-05 Thread Kevin Olbrich
osd.33
  2   ssd  0.43700  1.0  447GiB  271GiB  176GiB 60.67 1.30  50
osd.2
  3   ssd  0.43700  1.0  447GiB  249GiB  198GiB 55.62 1.19  58
osd.3
  4   ssd  0.43700  1.0  447GiB  297GiB  150GiB 66.39 1.42  56
osd.4
 30   ssd  0.43660  1.0  447GiB  236GiB  211GiB 52.85 1.13  48
osd.30
 -76.29724- 6.29TiB 2.74TiB 3.55TiB 43.53 0.93   -
host node1002
  9   hdd  0.90999  1.0  932GiB  354GiB  578GiB 37.96 0.81  95
osd.9
 10   hdd  0.90999  1.0  932GiB  357GiB  575GiB 38.28 0.82  96
osd.10
 11   hdd  0.90999  1.0  932GiB  318GiB  613GiB 34.18 0.73  86
osd.11
 12   hdd  0.90999  1.0  932GiB  373GiB  558GiB 40.09 0.86 100
osd.12
 35   hdd  0.90970  1.0  932GiB  343GiB  588GiB 36.83 0.79  92
osd.35
  6   ssd  0.43700  1.0  447GiB  269GiB  178GiB 60.20 1.29  60
osd.6
  7   ssd  0.43700  1.0  447GiB  249GiB  198GiB 55.69 1.19  56
osd.7
  8   ssd  0.43700  1.0  447GiB  286GiB  161GiB 63.95 1.37  56
osd.8
 31   ssd  0.43660  1.0  447GiB  257GiB  190GiB 57.47 1.23  55
osd.31
-282.18318- 2.18TiB  968GiB 1.24TiB 43.29 0.93   -
host node1005
 34   ssd  0.43660  1.0  447GiB  202GiB  245GiB 45.14 0.97  47
osd.34
 36   ssd  0.87329  1.0  894GiB  405GiB  489GiB 45.28 0.97  91
osd.36
 37   ssd  0.87329  1.0  894GiB  361GiB  533GiB 40.38 0.87  79
osd.37
-291.74658- 1.75TiB  888GiB  900GiB 49.65 1.06   -
host node1006
 42   ssd  0.87329  1.0  894GiB  417GiB  477GiB 46.68 1.00  92
osd.42
 43   ssd  0.87329  1.0  894GiB  471GiB  424GiB 52.63 1.13  90
osd.43
-11   13.39537- 13.4TiB 6.64TiB 6.75TiB 49.60 1.06   -
rack dc01-rack03
-225.38794- 5.39TiB 2.70TiB 2.69TiB 50.14 1.07   -
host node1003
 17   hdd  0.90999  1.0  932GiB  371GiB  560GiB 39.83 0.85 100
osd.17
 18   hdd  0.90999  1.0  932GiB  390GiB  542GiB 41.82 0.90 105
osd.18
 24   hdd  0.90999  1.0  932GiB  352GiB  580GiB 37.77 0.81  94
osd.24
 26   hdd  0.90999  1.0  932GiB  387GiB  545GiB 41.54 0.89 104
osd.26
 13   ssd  0.43700  1.0  447GiB  319GiB  128GiB 71.32 1.53  77
osd.13
 14   ssd  0.43700  1.0  447GiB  303GiB  144GiB 67.76 1.45  70
osd.14
 15   ssd  0.43700  1.0  447GiB  361GiB 86.4GiB 80.67 1.73  77
osd.15
 16   ssd  0.43700  1.0  447GiB  283GiB  164GiB 63.29 1.36  71
osd.16
-255.38765- 5.39TiB 2.83TiB 2.56TiB 52.55 1.13   -
host node1004
 23   hdd  0.90999  1.0  932GiB  382GiB  549GiB 41.05 0.88 102
osd.23
 25   hdd  0.90999  1.0  932GiB  412GiB  520GiB 44.20 0.95 111
osd.25
 27   hdd  0.90999  1.0  932GiB  385GiB  546GiB 41.36 0.89 103
osd.27
 28   hdd  0.90970  1.0  932GiB  462GiB  469GiB 49.64 1.06 124
osd.28
 19   ssd  0.43700  1.0  447GiB  314GiB  133GiB 70.22 1.51  75
osd.19
 20   ssd  0.43700  1.0  447GiB  327GiB  120GiB 73.06 1.57  76
osd.20
 21   ssd  0.43700  1.0  447GiB  324GiB  123GiB 72.45 1.55  77
osd.21
 22   ssd  0.43700  1.0  447GiB  292GiB  156GiB 65.21 1.40  68
osd.22
-302.61978- 2.62TiB 1.11TiB 1.51TiB 42.43 0.91   -
host node1007
 38   ssd  0.43660  1.0  447GiB  165GiB  283GiB 36.82 0.79  46
osd.38
 39   ssd  0.43660  1.0  447GiB  156GiB  292GiB 34.79 0.75  42
osd.39
 40   ssd  0.87329  1.0  894GiB  429GiB  466GiB 47.94 1.03  98
osd.40
 41   ssd  0.87329  1.0  894GiB  389GiB  505GiB 43.55 0.93 103
osd.41
  TOTAL 29.9TiB 14.0TiB 16.0TiB 46.65
MIN/MAX VAR: 0.65/1.73  STDDEV: 13.30

=
root@adminnode:~# ceph df && ceph -v
GLOBAL:
SIZEAVAIL   RAW USED %RAW USED
29.9TiB 16.0TiB  14.0TiB 46.65
POOLS:
NAME  ID USED%USED MAX AVAIL OBJECTS
rbd_vms_ssd   2   986GiB 49.83993GiB  262606
rbd_vms_hdd   3  3.76TiB 48.94   3.92TiB  992255
rbd_vms_ssd_014   372KiB 0662GiB 148
rbd_vms_ssd_01_ec 6  2.85TiB 68.81   1.29TiB  770506

ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

Kevin

Am Sa., 5. Jan. 2019 um 05:12 Uhr schrieb Konstantin Shalygin :
>
> On 1/5/19 1:51 AM, Kevin Olbrich wrote:
> &

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
Hi Arun,

actually deleting was no good idea, thats why I wrote, that the OSDs
should be "out".
You have down PGs, that because the data is on OSDs that are
unavailable but known by the cluster.
This can be checked by using "ceph pg 0.5 query" (change PG name).

Because your PG count is so much oversized, the overdose limits get
hit on every recovery on your cluster.
I had the same problem on a medium cluster when I added to many new
disks at once.
You already got this info from Caspar earlier in this thread.

https://ceph.com/planet/placement-groups-with-ceph-luminous-stay-in-activating-state/
https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/

The second link shows one of the config params you need to inject to
all your OSDs like this:
ceph tell osd.* injectargs --mon_max_pg_per_osd 1

This might help you getting these PGs some sort of "active"
(+recovery/+degraded/+inconsistent/etc.).

The down PGs will most likely never come back. It would bet, you will
find OSD IDs that are invalid in the acting set, meaning that
non-existent OSDs hold your data.
I had a similar problem on a test cluster with erasure code pools
where too many disks failed at the same time, you will then see
negative values as OSD IDs.

Maybe this helps a little bit.

Kevin

Am Sa., 5. Jan. 2019 um 00:20 Uhr schrieb Arun POONIA
:
>
> Hi Kevin,
>
> I tried deleting newly added server from Ceph Cluster and looks like Ceph is 
> not recovering. I agree with unfound data but it doesn't say about unfound 
> data. It says inactive/down for PGs and I can't bring them up.
>
>
> [root@fre101 ~]# ceph health detail
> 2019-01-04 15:17:05.711641 7f27b0f31700 -1 asok(0x7f27ac0017a0) 
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
> bind the UNIX domain socket to 
> '/var/run/ceph-guests/ceph-client.admin.129552.139808366139728.asok': (2) No 
> such file or directory
> HEALTH_ERR 3 pools have many more objects per pg than average; 
> 523656/12393978 objects misplaced (4.225%); 6517 PGs pending on creation; 
> Reduced data availability: 6585 pgs inactive, 1267 pgs down, 2 pgs peering, 
> 2703 pgs stale; Degraded data redundancy: 86858/12393978 objects degraded 
> (0.701%), 717 pgs degraded, 21 pgs undersized; 99059 slow requests are 
> blocked > 32 sec; 4834 stuck requests are blocked > 4096 sec; too many PGs 
> per OSD (3003 > max 200)
> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
> pool glance-images objects per pg (10478) is more than 92.7257 times 
> cluster average (113)
> pool vms objects per pg (4722) is more than 41.7876 times cluster average 
> (113)
> pool volumes objects per pg (1220) is more than 10.7965 times cluster 
> average (113)
> OBJECT_MISPLACED 523656/12393978 objects misplaced (4.225%)
> PENDING_CREATING_PGS 6517 PGs pending on creation
> osds 
> [osd.0,osd.1,osd.10,osd.11,osd.12,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd.20,osd.21,osd.22,osd.23,osd.24,osd.25,osd.26,osd.27,osd.28,osd.29,osd.3,osd.30,osd.31,osd.32,osd.33,osd.34,osd.35,osd.4,osd.5,osd.6,osd.7,osd.8,osd.9]
>  have pending PGs.
> PG_AVAILABILITY Reduced data availability: 6585 pgs inactive, 1267 pgs down, 
> 2 pgs peering, 2703 pgs stale
> pg 10.90e is stuck inactive for 94928.999109, current state activating, 
> last acting [2,6]
> pg 10.913 is stuck inactive for 95094.175400, current state activating, 
> last acting [9,5]
> pg 10.915 is stuck inactive for 94929.184177, current state activating, 
> last acting [30,26]
> pg 11.907 is stuck stale for 9612.906582, current state 
> stale+active+clean, last acting [38,24]
> pg 11.910 is stuck stale for 11822.359237, current state stale+down, last 
> acting [21]
> pg 11.915 is stuck stale for 9612.906604, current state 
> stale+active+clean, last acting [38,31]
> pg 11.919 is stuck inactive for 95636.716568, current state activating, 
> last acting [25,12]
> pg 12.902 is stuck stale for 10810.497213, current state 
> stale+activating, last acting [36,14]
> pg 13.901 is stuck stale for 94889.512234, current state 
> stale+active+clean, last acting [1,31]
> pg 13.904 is stuck stale for 10745.279158, current state 
> stale+active+clean, last acting [37,8]
> pg 13.908 is stuck stale for 10745.279176, current state 
> stale+active+clean, last acting [37,19]
> pg 13.909 is stuck inactive for 95370.129659, current state activating, 
> last acting [34,19]
> pg 13.90e is stuck inactive for 95370.379694, current state activating, 
> last acting [21,20]
> pg 13.911 is stuck inactive for 98449.317873, current state activating, 
> last acting [25,22]
> pg 13.914 is stuck stale for 11827.503651, current state sta

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
I don't think this will help you. Unfound means, the cluster is unable
to find the data anywhere (it's lost).
It would be sufficient to shut down the new host - the OSDs will then be out.

You can also force-heal the cluster, something like "do your best possible":

ceph pg 2.5 mark_unfound_lost revert|delete

Src: http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/

Kevin

Am Fr., 4. Jan. 2019 um 20:47 Uhr schrieb Arun POONIA
:
>
> Hi Kevin,
>
> Can I remove newly added server from Cluster and see if it heals cluster ?
>
> When I check Hard Disk Iops on new server which are very low compared to 
> existing cluster server.
>
> Indeed this is a critical cluster but I don't have expertise to make it 
> flawless.
>
> Thanks
> Arun
>
> On Fri, Jan 4, 2019 at 11:35 AM Kevin Olbrich  wrote:
>>
>> If you realy created and destroyed OSDs before the cluster healed
>> itself, this data will be permanently lost (not found / inactive).
>> Also your PG count is so much oversized, the calculation for peering
>> will most likely break because this was never tested.
>>
>> If this is a critical cluster, I would start a new one and bring back
>> the backups (using a better PG count).
>>
>> Kevin
>>
>> Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA
>> :
>> >
>> > Can anyone comment on this issue please, I can't seem to bring my cluster 
>> > healthy.
>> >
>> > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA  
>> > wrote:
>> >>
>> >> Hi Caspar,
>> >>
>> >> Number of IOPs are also quite low. It used be around 1K Plus on one of 
>> >> Pool (VMs) now its like close to 10-30 .
>> >>
>> >> Thansk
>> >> Arun
>> >>
>> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA 
>> >>  wrote:
>> >>>
>> >>> Hi Caspar,
>> >>>
>> >>> Yes and No, numbers are going up and down. If I run ceph -s command I 
>> >>> can see it decreases one time and later it increases again. I see there 
>> >>> are so many blocked/slow requests. Almost all the OSDs have slow 
>> >>> requests. Around 12% PGs are inactive not sure how to activate them 
>> >>> again.
>> >>>
>> >>>
>> >>> [root@fre101 ~]# ceph health detail
>> >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) 
>> >>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
>> >>> to bind the UNIX domain socket to 
>> >>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': 
>> >>> (2) No such file or directory
>> >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than 
>> >>> average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on 
>> >>> creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, 
>> >>> 86 pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 
>> >>> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 
>> >>> slow requests are blocked > 32 sec; 551 stuck requests are blocked > 
>> >>> 4096 sec; too many PGs per OSD (2709 > max 200)
>> >>> OSD_DOWN 1 osds down
>> >>> osd.28 (root=default,host=fre119) is down
>> >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
>> >>> pool glance-images objects per pg (10478) is more than 92.7257 times 
>> >>> cluster average (113)
>> >>> pool vms objects per pg (4717) is more than 41.7434 times cluster 
>> >>> average (113)
>> >>> pool volumes objects per pg (1220) is more than 10.7965 times 
>> >>> cluster average (113)
>> >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
>> >>> PENDING_CREATING_PGS 3610 PGs pending on creation
>> >>> osds 
>> >>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
>> >>>  have pending PGs.
>> >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs 
>> >>> down, 86 pgs peering, 850 pgs stale
>> >>> pg 10.900 is down, acting [18]
>> >>> pg 10.90e is stuck inactive for 60266.030164, current state 
>> >&g

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
If you realy created and destroyed OSDs before the cluster healed
itself, this data will be permanently lost (not found / inactive).
Also your PG count is so much oversized, the calculation for peering
will most likely break because this was never tested.

If this is a critical cluster, I would start a new one and bring back
the backups (using a better PG count).

Kevin

Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA
:
>
> Can anyone comment on this issue please, I can't seem to bring my cluster 
> healthy.
>
> On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA  
> wrote:
>>
>> Hi Caspar,
>>
>> Number of IOPs are also quite low. It used be around 1K Plus on one of Pool 
>> (VMs) now its like close to 10-30 .
>>
>> Thansk
>> Arun
>>
>> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA  
>> wrote:
>>>
>>> Hi Caspar,
>>>
>>> Yes and No, numbers are going up and down. If I run ceph -s command I can 
>>> see it decreases one time and later it increases again. I see there are so 
>>> many blocked/slow requests. Almost all the OSDs have slow requests. Around 
>>> 12% PGs are inactive not sure how to activate them again.
>>>
>>>
>>> [root@fre101 ~]# ceph health detail
>>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) 
>>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
>>> bind the UNIX domain socket to 
>>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2) 
>>> No such file or directory
>>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than average; 
>>> 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on creation; 
>>> Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86 pgs 
>>> peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 objects 
>>> degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow 
>>> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec; 
>>> too many PGs per OSD (2709 > max 200)
>>> OSD_DOWN 1 osds down
>>> osd.28 (root=default,host=fre119) is down
>>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
>>> pool glance-images objects per pg (10478) is more than 92.7257 times 
>>> cluster average (113)
>>> pool vms objects per pg (4717) is more than 41.7434 times cluster 
>>> average (113)
>>> pool volumes objects per pg (1220) is more than 10.7965 times cluster 
>>> average (113)
>>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
>>> PENDING_CREATING_PGS 3610 PGs pending on creation
>>> osds 
>>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
>>>  have pending PGs.
>>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs 
>>> down, 86 pgs peering, 850 pgs stale
>>> pg 10.900 is down, acting [18]
>>> pg 10.90e is stuck inactive for 60266.030164, current state activating, 
>>> last acting [2,38]
>>> pg 10.913 is stuck stale for 1887.552862, current state stale+down, 
>>> last acting [9]
>>> pg 10.915 is stuck inactive for 60266.215231, current state activating, 
>>> last acting [30,38]
>>> pg 11.903 is stuck inactive for 59294.465961, current state activating, 
>>> last acting [11,38]
>>> pg 11.910 is down, acting [21]
>>> pg 11.919 is down, acting [25]
>>> pg 12.902 is stuck inactive for 57118.544590, current state activating, 
>>> last acting [36,14]
>>> pg 13.8f8 is stuck inactive for 60707.167787, current state activating, 
>>> last acting [29,37]
>>> pg 13.901 is stuck stale for 60226.543289, current state 
>>> stale+active+clean, last acting [1,31]
>>> pg 13.905 is stuck inactive for 60266.050940, current state activating, 
>>> last acting [2,36]
>>> pg 13.909 is stuck inactive for 60707.160714, current state activating, 
>>> last acting [34,36]
>>> pg 13.90e is stuck inactive for 60707.410749, current state activating, 
>>> last acting [21,36]
>>> pg 13.911 is down, acting [25]
>>> pg 13.914 is stale+down, acting [29]
>>> pg 13.917 is stuck stale for 580.224688, current state stale+down, last 
>>> acting [16]
>>> pg 14.901 is stuck inactive for 60266.037762, current state 
>

Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-04 Thread Kevin Olbrich
PS: Could be http://tracker.ceph.com/issues/36361
There is one HDD OSD that is out (which will not be replaced because
the SSD pool will get the images and the hdd pool will be deleted).

Kevin

Am Fr., 4. Jan. 2019 um 19:46 Uhr schrieb Kevin Olbrich :
>
> Hi!
>
> I did what you wrote but my MGRs started to crash again:
> root@adminnode:~# ceph -s
>   cluster:
> id: 086d9f80-6249-4594-92d0-e31b6a9c
> health: HEALTH_WARN
> no active mgr
> 105498/6277782 objects misplaced (1.680%)
>
>   services:
> mon: 3 daemons, quorum mon01,mon02,mon03
> mgr: no daemons active
> osd: 44 osds: 43 up, 43 in
>
>   data:
> pools:   4 pools, 1616 pgs
> objects: 1.88M objects, 7.07TiB
> usage:   13.2TiB used, 16.7TiB / 29.9TiB avail
> pgs: 105498/6277782 objects misplaced (1.680%)
>  1606 active+clean
>  8active+remapped+backfill_wait
>  2active+remapped+backfilling
>
>   io:
> client:   5.51MiB/s rd, 3.38MiB/s wr, 33op/s rd, 317op/s wr
> recovery: 60.3MiB/s, 15objects/s
>
>
> MON 1 log:
>-13> 2019-01-04 14:05:04.432186 7fec56a93700  4 mgr ms_dispatch
> active mgrdigest v1
>-12> 2019-01-04 14:05:04.432194 7fec56a93700  4 mgr ms_dispatch mgrdigest 
> v1
>-11> 2019-01-04 14:05:04.822041 7fec434e1700  4 mgr[balancer]
> Optimize plan auto_2019-01-04_14:05:04
>-10> 2019-01-04 14:05:04.822170 7fec434e1700  4 mgr get_config
> get_configkey: mgr/balancer/mode
> -9> 2019-01-04 14:05:04.822231 7fec434e1700  4 mgr get_config
> get_configkey: mgr/balancer/max_misplaced
> -8> 2019-01-04 14:05:04.822268 7fec434e1700  4 ceph_config_get
> max_misplaced not found
> -7> 2019-01-04 14:05:04.822444 7fec434e1700  4 mgr[balancer] Mode
> upmap, max misplaced 0.05
> -6> 2019-01-04 14:05:04.822849 7fec434e1700  4 mgr[balancer] do_upmap
> -5> 2019-01-04 14:05:04.822923 7fec434e1700  4 mgr get_config
> get_configkey: mgr/balancer/upmap_max_iterations
> -4> 2019-01-04 14:05:04.822964 7fec434e1700  4 ceph_config_get
> upmap_max_iterations not found
> -3> 2019-01-04 14:05:04.823013 7fec434e1700  4 mgr get_config
> get_configkey: mgr/balancer/upmap_max_deviation
> -2> 2019-01-04 14:05:04.823048 7fec434e1700  4 ceph_config_get
> upmap_max_deviation not found
> -1> 2019-01-04 14:05:04.823265 7fec434e1700  4 mgr[balancer] pools
> ['rbd_vms_hdd', 'rbd_vms_ssd', 'rbd_vms_ssd_01', 'rbd_vms_ssd_01_ec']
>  0> 2019-01-04 14:05:04.836124 7fec434e1700 -1
> /build/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int
> OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set int>&, OSDMap::Incremental*)' thread 7fec434e1700 time 2019-01-04
> 14:05:04.832885
> /build/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0)
>
>  ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x558c3c0bb572]
>  2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set std::less, std::allocator > const&,
> OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1]
>  3: (()+0x2f3020) [0x558c3bf5d020]
>  4: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971]
>  5: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
>  6: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d]
>  7: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
>  8: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
>  9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
>  10: (()+0x13e370) [0x7fec5e8be370]
>  11: (PyObject_Call()+0x43) [0x7fec5e891273]
>  12: (()+0x1853ac) [0x7fec5e9053ac]
>  13: (PyObject_Call()+0x43) [0x7fec5e891273]
>  14: (PyObject_CallMethod()+0xf4) [0x7fec5e892444]
>  15: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c]
>  16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998]
>  17: (()+0x76ba) [0x7fec5d74c6ba]
>  18: (clone()+0x6d) [0x7fec5c7b841d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> --- logging levels ---
>0/ 5 none
>0/ 1 lockdep
>0/ 1 context
>1/ 1 crush
>1/ 5 mds
>1/ 5 mds_balancer
>1/ 5 mds_locker
>1/ 5 mds_log
>1/ 5 mds_log_expire
>1/ 5 mds_migrator
>0/ 1 buffer
>0/ 1 timer
>0/ 1 filer
>0/ 1 striper
>0/ 1 objecter
>0/ 5 rados
>0/ 5 rbd
>0/ 5 rbd_mirror
>0/ 5 rbd_replay
>0/ 5 journaler
>0/ 5 objectcacher
>0/ 5 client
>1/ 5 osd
>0/ 5 optracker
>0/ 5 objclass
>1/ 3 filestore
>1/ 3 journal
>0/ 5 ms
>1/ 5 mon
>0/10 monc
>   

Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-04 Thread Kevin Olbrich
3c07a5b4]
 2: (()+0x11390) [0x7fec5d756390]
 3: (gsignal()+0x38) [0x7fec5c6e6428]
 4: (abort()+0x16a) [0x7fec5c6e802a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x558c3c0bb6fe]
 6: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&,
OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1]
 7: (()+0x2f3020) [0x558c3bf5d020]
 8: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971]
 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
 10: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d]
 11: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
 12: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
 13: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
 14: (()+0x13e370) [0x7fec5e8be370]
 15: (PyObject_Call()+0x43) [0x7fec5e891273]
 16: (()+0x1853ac) [0x7fec5e9053ac]
 17: (PyObject_Call()+0x43) [0x7fec5e891273]
 18: (PyObject_CallMethod()+0xf4) [0x7fec5e892444]
 19: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c]
 20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998]
 21: (()+0x76ba) [0x7fec5d74c6ba]
 22: (clone()+0x6d) [0x7fec5c7b841d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- begin dump of recent events ---
 0> 2019-01-04 14:05:05.032479 7fec434e1700 -1 *** Caught signal
(Aborted) **
 in thread 7fec434e1700 thread_name:balancer

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)
 1: (()+0x4105b4) [0x558c3c07a5b4]
 2: (()+0x11390) [0x7fec5d756390]
 3: (gsignal()+0x38) [0x7fec5c6e6428]
 4: (abort()+0x16a) [0x7fec5c6e802a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x558c3c0bb6fe]
 6: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&,
OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1]
 7: (()+0x2f3020) [0x558c3bf5d020]
 8: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971]
 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
 10: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d]
 11: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
 12: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
 13: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
 14: (()+0x13e370) [0x7fec5e8be370]
 15: (PyObject_Call()+0x43) [0x7fec5e891273]
 16: (()+0x1853ac) [0x7fec5e9053ac]
 17: (PyObject_Call()+0x43) [0x7fec5e891273]
 18: (PyObject_CallMethod()+0xf4) [0x7fec5e892444]
 19: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c]
 20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998]
 21: (()+0x76ba) [0x7fec5d74c6ba]
 22: (clone()+0x6d) [0x7fec5c7b841d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mgr.mon01.ceph01.srvfarm.net.log
--- end dump of recent events ---



Kevin


Am Mi., 2. Jan. 2019 um 17:35 Uhr schrieb Konstantin Shalygin :
>
> On a medium sized cluster with device-classes, I am experiencing a
> problem with the SSD pool:
>
> root at adminnode:~# ceph osd df | grep ssd
> ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
>  2   ssd 0.43700  1.0  447GiB  254GiB  193GiB 56.77 1.28  50
>  3   ssd 0.43700  1.0  447GiB  208GiB  240GiB 46.41 1.04  58
>  4   ssd 0.43700  1.0  447GiB  266GiB  181GiB 59.44 1.34  55
> 30   ssd 0.43660  1.0  447GiB  222GiB  225GiB 49.68 1.12  49
>  6   ssd 0.43700  1.0  447GiB  238GiB  209GiB 53.28 1.20  59
>  7   ssd 0.43700  1.0  447GiB  228GiB  220GiB 50.88 1.14  56
>  8   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.16 1.35  57
> 31   ssd 0.43660  1.0  447GiB  231GiB  217GiB 51.58 1.16  56
> 34   ssd 0.43660  1.0  447GiB  186GiB  261GiB 41.65 0.94  49
> 36   ssd 0.87329  1.0  894GiB  364GiB  530GiB 40.68 0.92  91
> 37   ssd 0.87329  1.0  894GiB  321GiB  573GiB 35.95 0.81  78
> 42   ssd 0.87329  1.0  894GiB  375GiB  519GiB 41.91 0.94  92
> 43   ssd 0.87329  1.0  89

[ceph-users] TCP qdisc + congestion control / BBR

2019-01-02 Thread Kevin Olbrich
Hi!

I wonder if changing qdisc and congestion_control (for example fq with
Google BBR) on Ceph servers / clients has positive effects during high
load.
Google BBR: 
https://cloud.google.com/blog/products/gcp/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster

I am running a lot of VMs with BBR but the hypervisors run fq_codel +
cubic (OSDs run Ubuntu defaults).

Did someone test qdisc and congestion control settings?

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usage of devices in SSD pool vary very much

2019-01-02 Thread Kevin Olbrich
Hi!

On a medium sized cluster with device-classes, I am experiencing a
problem with the SSD pool:

root@adminnode:~# ceph osd df | grep ssd
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
 2   ssd 0.43700  1.0  447GiB  254GiB  193GiB 56.77 1.28  50
 3   ssd 0.43700  1.0  447GiB  208GiB  240GiB 46.41 1.04  58
 4   ssd 0.43700  1.0  447GiB  266GiB  181GiB 59.44 1.34  55
30   ssd 0.43660  1.0  447GiB  222GiB  225GiB 49.68 1.12  49
 6   ssd 0.43700  1.0  447GiB  238GiB  209GiB 53.28 1.20  59
 7   ssd 0.43700  1.0  447GiB  228GiB  220GiB 50.88 1.14  56
 8   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.16 1.35  57
31   ssd 0.43660  1.0  447GiB  231GiB  217GiB 51.58 1.16  56
34   ssd 0.43660  1.0  447GiB  186GiB  261GiB 41.65 0.94  49
36   ssd 0.87329  1.0  894GiB  364GiB  530GiB 40.68 0.92  91
37   ssd 0.87329  1.0  894GiB  321GiB  573GiB 35.95 0.81  78
42   ssd 0.87329  1.0  894GiB  375GiB  519GiB 41.91 0.94  92
43   ssd 0.87329  1.0  894GiB  438GiB  456GiB 49.00 1.10  92
13   ssd 0.43700  1.0  447GiB  249GiB  198GiB 55.78 1.25  72
14   ssd 0.43700  1.0  447GiB  290GiB  158GiB 64.76 1.46  71
15   ssd 0.43700  1.0  447GiB  368GiB 78.6GiB 82.41 1.85  78 <
16   ssd 0.43700  1.0  447GiB  253GiB  194GiB 56.66 1.27  70
19   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.21 1.35  70
20   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.81 1.57  77
21   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.77 1.57  77
22   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.10 1.35  67
38   ssd 0.43660  1.0  447GiB  153GiB  295GiB 34.11 0.77  46
39   ssd 0.43660  1.0  447GiB  127GiB  320GiB 28.37 0.64  38
40   ssd 0.87329  1.0  894GiB  386GiB  508GiB 43.17 0.97  97
41   ssd 0.87329  1.0  894GiB  375GiB  520GiB 41.88 0.94 113

This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool).
Currently, the balancer plugin is off because it immediately crashed
the MGR in the past (on 12.2.5).
Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I
am unable to find the bugtracker ID]

Would the balancer plugin correct this situation?
What happens if all MGRs die like they did on 12.2.5 because of the plugin?
Will the balancer take data from the most-unbalanced OSDs first?
Otherwise the OSD may fill up more then FULL which would cause the
whole pool to freeze (because the smallest OSD is taken into account
for free space calculation).
This would be the worst case as over 100 VMs would freeze, causing lot
of trouble. This is also the reason I did not try to enable the
balancer again.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Kevin Olbrich
> > Assuming everything is on LVM including the root filesystem, only moving
> > the boot partition will have to be done outside of LVM.
>
> Since the OP mentioned MS Exchange, I assume the VM is running windows.
> You can do the same LVM-like trick in Windows Server via Disk Manager
> though; add the new ceph RBD disk to the existing data volume as a
> mirror; wait for it to sync, then break the mirror and remove the
> original disk.

Mirrors only work on dynamic disks which are a pain to revert and
cause lot's of problems with backup solutions.
I will keep this in mind as this is still better than shutting down
the whole VM.

@all
Thank you very much for your inputs. I will try some less important
VMs and then start migration of the big one.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Kevin Olbrich
Hi!

Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
cluster (which already holds lot's of images).
The server has access to both local and cluster-storage, I only need
to live migrate the storage, not machine.

I have never used live migration as it can cause more issues and the
VMs that are already migrated, had planned downtime.
Taking the VM offline and convert/import using qemu-img would take
some hours but I would like to still serve clients, even if it is
slower.

The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
BBU). There are two HDDs bound as RAID1 which are constantly under 30%
- 60% load (this goes up to 100% during reboot, updates or login
prime-time).

What happens when either the local compute node or the ceph cluster
fails (degraded)? Or network is unavailable?
Are all writes performed to both locations? Is this fail-safe? Or does
the VM crash in worst case, which can lead to dirty shutdown for MS-EX
DBs?

The node currently has 4GB free RAM and 29GB listed as cache /
available. These numbers need caution because we have "tuned" enabled
which causes de-deplication on RAM and this host runs about 10 Windows
VMs.
During reboots or updates, RAM can get full again.

Maybe I am to cautious about live-storage-migration, maybe I am not.

What are your experiences or advices?

Thank you very much!

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Packages for debian in Ceph repo

2018-11-15 Thread Kevin Olbrich
I now had the time to test and after installing this package, uploads to
rbd are working perfectly.
Thank you very much fur sharing this!

Kevin

Am Mi., 7. Nov. 2018 um 15:36 Uhr schrieb Kevin Olbrich :

> Am Mi., 7. Nov. 2018 um 07:40 Uhr schrieb Nicolas Huillard <
> nhuill...@dolomede.fr>:
>
>>
>> > It lists rbd but still fails with the exact same error.
>>
>> I stumbled upon the exact same error, and since there was no answer
>> anywhere, I figured it was a very simple problem: don't forget to
>> install the qemu-block-extra package (Debian stretch) along with qemu-
>> utils which contains the qemu-img command.
>> This command is actually compiled with rbd support (hence the output
>> above), but need this extra package to pull actual support-code and
>> dependencies...
>>
>
> I have not been able to test this yet but this package was indeed missing
> on my system!
> Thank you for this hint!
>
>
>> --
>> Nicolas Huillard
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling write cache on SATA HDDs reduces write latency 7 times

2018-11-13 Thread Kevin Olbrich
I read the whole thread and it looks like the write cache should always be
disabled as in the worst case, the performance is the same(?).
This is based on this discussion.

I will test some WD4002FYYZ which don't mention "media cache".

Kevin

Am Di., 13. Nov. 2018 um 09:27 Uhr schrieb Виталий Филиппов <
vita...@yourcmc.ru>:

> This may be the explanation:
>
>
> https://serverfault.com/questions/857271/better-performance-when-hdd-write-cache-is-disabled-hgst-ultrastar-7k6000-and
>
> Other manufacturers may have started to do the same, I suppose.
> --
> With best regards,
> Vitaliy Filippov___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph or Gluster for implementing big NAS

2018-11-12 Thread Kevin Olbrich
Hi Dan,

ZFS without sync would be very much identical to ext2/ext4 without journals
or XFS with barriers disabled.
The ARC cache in ZFS is awesome but disbaling sync on ZFS is a very high
risk (using ext4 with kvm-mode unsafe would be similar I think).

Also, ZFS only works as expected with scheduler set to noop as it is
optimized to consume whole, non-shared devices.

Just my 2 cents ;-)

Kevin


Am Mo., 12. Nov. 2018 um 15:08 Uhr schrieb Dan van der Ster <
d...@vanderster.com>:

> We've done ZFS on RBD in a VM, exported via NFS, for a couple years.
> It's very stable and if your use-case permits you can set zfs
> sync=disabled to get very fast write performance that's tough to beat.
>
> But if you're building something new today and have *only* the NAS
> use-case then it would make better sense to try CephFS first and see
> if it works for you.
>
> -- Dan
>
> On Mon, Nov 12, 2018 at 3:01 PM Kevin Olbrich  wrote:
> >
> > Hi!
> >
> > ZFS won't play nice on ceph. Best would be to mount CephFS directly with
> the ceph-fuse driver on the endpoint.
> > If you definitely want to put a storage gateway between the data and the
> compute nodes, then go with nfs-ganesha which can export CephFS directly
> without local ("proxy") mount.
> >
> > I had such a setup with nfs and switched to mount CephFS directly. If
> using NFS with the same data, you must make sure your HA works well to
> avoid data corruption.
> > With ceph-fuse you directly connect to the cluster, one component less
> that breaks.
> >
> > Kevin
> >
> > Am Mo., 12. Nov. 2018 um 12:44 Uhr schrieb Premysl Kouril <
> premysl.kou...@gmail.com>:
> >>
> >> Hi,
> >>
> >>
> >> We are planning to build NAS solution which will be primarily used via
> NFS and CIFS and workloads ranging from various archival application to
> more “real-time processing”. The NAS will not be used as a block storage
> for virtual machines, so the access really will always be file oriented.
> >>
> >>
> >> We are considering primarily two designs and I’d like to kindly ask for
> any thoughts, views, insights, experiences.
> >>
> >>
> >> Both designs utilize “distributed storage software at some level”. Both
> designs would be built from commodity servers and should scale as we grow.
> Both designs involve virtualization for instantiating "access virtual
> machines" which will be serving the NFS and CIFS protocol - so in this
> sense the access layer is decoupled from the data layer itself.
> >>
> >>
> >> First design is based on a distributed filesystem like Gluster or
> CephFS. We would deploy this software on those commodity servers and mount
> the resultant filesystem on the “access virtual machines” and they would be
> serving the mounted filesystem via NFS/CIFS.
> >>
> >>
> >> Second design is based on distributed block storage using CEPH. So we
> would build distributed block storage on those commodity servers, and then,
> via virtualization (like OpenStack Cinder) we would allocate the block
> storage into the access VM. Inside the access VM we would deploy ZFS which
> would aggregate block storage into a single filesystem. And this filesystem
> would be served via NFS/CIFS from the very same VM.
> >>
> >>
> >> Any advices and insights highly appreciated
> >>
> >>
> >> Cheers,
> >>
> >> Prema
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph or Gluster for implementing big NAS

2018-11-12 Thread Kevin Olbrich
Hi!

ZFS won't play nice on ceph. Best would be to mount CephFS directly with
the ceph-fuse driver on the endpoint.
If you definitely want to put a storage gateway between the data and the
compute nodes, then go with nfs-ganesha which can export CephFS directly
without local ("proxy") mount.

I had such a setup with nfs and switched to mount CephFS directly. If using
NFS with the same data, you must make sure your HA works well to avoid data
corruption.
With ceph-fuse you directly connect to the cluster, one component less that
breaks.

Kevin

Am Mo., 12. Nov. 2018 um 12:44 Uhr schrieb Premysl Kouril <
premysl.kou...@gmail.com>:

> Hi,
>
> We are planning to build NAS solution which will be primarily used via NFS
> and CIFS and workloads ranging from various archival application to more
> “real-time processing”. The NAS will not be used as a block storage for
> virtual machines, so the access really will always be file oriented.
>
> We are considering primarily two designs and I’d like to kindly ask for
> any thoughts, views, insights, experiences.
>
> Both designs utilize “distributed storage software at some level”. Both
> designs would be built from commodity servers and should scale as we grow.
> Both designs involve virtualization for instantiating "access virtual
> machines" which will be serving the NFS and CIFS protocol - so in this
> sense the access layer is decoupled from the data layer itself.
>
> First design is based on a distributed filesystem like Gluster or CephFS.
> We would deploy this software on those commodity servers and mount the
> resultant filesystem on the “access virtual machines” and they would be
> serving the mounted filesystem via NFS/CIFS.
>
> Second design is based on distributed block storage using CEPH. So we
> would build distributed block storage on those commodity servers, and then,
> via virtualization (like OpenStack Cinder) we would allocate the block
> storage into the access VM. Inside the access VM we would deploy ZFS which
> would aggregate block storage into a single filesystem. And this filesystem
> would be served via NFS/CIFS from the very same VM.
>
>
> Any advices and insights highly appreciated
>
>
> Cheers,
>
> Prema
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 12.2.9 release

2018-11-07 Thread Kevin Olbrich
Am Mi., 7. Nov. 2018 um 16:40 Uhr schrieb Gregory Farnum :

> On Wed, Nov 7, 2018 at 5:58 AM Simon Ironside 
> wrote:
>
>>
>>
>> On 07/11/2018 10:59, Konstantin Shalygin wrote:
>> >> I wonder if there is any release announcement for ceph 12.2.9 that I
>> missed.
>> >> I just found the new packages on download.ceph.com, is this an
>> official
>> >> release?
>> >
>> > This is because 12.2.9 have a several bugs. You should avoid to use
>> this
>> > release and wait for 12.2.10
>>
>> Argh! What's it doing in the repos then?? I've just upgraded to it!
>> What are the bugs? Is there a thread about them?
>
>
> If you’ve already upgraded and have no issues then you won’t have any
> trouble going forward — except perhaps on the next upgrade, if you do it
> while the cluster is unhealthy.
>
> I agree that it’s annoying when these issues make it out. We’ve had
> ongoing discussions to try and improve the release process so it’s less
> drawn-out and to prevent these upgrade issues from making it through
> testing, but nobody has resolved it yet. If anybody has experience working
> with deb repositories and handling releases, the Ceph upstream could use
> some help... ;)
> -Greg
>
>>
>>
We solve this problem by hosting two repos. One for staging and QA and one
for production.
Every release gets to staging (for example directly after building a scm
tag).

If QA passed, the stage repo is turned into the prod one.
Using symlinks, it would be possible to switch back if problems occure.
Example: https://incoming.debian.org/

Currently I would be unable to deploy new nodes if I use the official
mirrors as apt is unable to use older versions (which does work on yum/dnf).
Thats why we are implementing "mirror-sync" / rsync with a copy of the repo
and the desired packages until such solution is available.

Kevin


>> Simon
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Packages for debian in Ceph repo

2018-11-07 Thread Kevin Olbrich
Am Mi., 7. Nov. 2018 um 07:40 Uhr schrieb Nicolas Huillard <
nhuill...@dolomede.fr>:

>
> > It lists rbd but still fails with the exact same error.
>
> I stumbled upon the exact same error, and since there was no answer
> anywhere, I figured it was a very simple problem: don't forget to
> install the qemu-block-extra package (Debian stretch) along with qemu-
> utils which contains the qemu-img command.
> This command is actually compiled with rbd support (hence the output
> above), but need this extra package to pull actual support-code and
> dependencies...
>

I have not been able to test this yet but this package was indeed missing
on my system!
Thank you for this hint!


> --
> Nicolas Huillard
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy osd creation failed with multipath and dmcrypt

2018-11-06 Thread Kevin Olbrich
I met the same problem. I had to create GPT table for each disk, create
first partition over full space and then fed these to ceph-volume (should
be similar for ceph-deploy).
Also I am not sure if you can combine fs-type btrfs with bluestore (afaik
this is for filestore).

Kevin


Am Di., 6. Nov. 2018 um 14:41 Uhr schrieb Pavan, Krish <
krish.pa...@nuance.com>:

> Trying to created OSD with multipath with dmcrypt and it failed . Any
> suggestion please?.
>
> ceph-deploy --overwrite-conf osd create ceph-store1:/dev/mapper/mpathr
> --bluestore --dmcrypt  -- failed
>
> ceph-deploy --overwrite-conf osd create ceph-store1:/dev/mapper/mpathr
> --bluestore – worked
>
>
>
> the logs for fail
>
> [ceph-store12][WARNIN] command: Running command: /usr/sbin/restorecon -R
> /var/lib/ceph/osd-lockbox/e15f1adc-feff-4890-a617-adc473e7331e/magic.68428.tmp
>
> [ceph-store12][WARNIN] command: Running command: /usr/bin/chown -R
> ceph:ceph
> /var/lib/ceph/osd-lockbox/e15f1adc-feff-4890-a617-adc473e7331e/magic.68428.tmp
>
> [ceph-store12][WARNIN] Traceback (most recent call last):
>
> [ceph-store12][WARNIN]   File "/usr/sbin/ceph-disk", line 9, in 
>
> [ceph-store12][WARNIN] load_entry_point('ceph-disk==1.0.0',
> 'console_scripts', 'ceph-disk')()
>
> [ceph-store12][WARNIN]   File
> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5736, in run
>
> [ceph-store12][WARNIN] main(sys.argv[1:])
>
> [ceph-store12][WARNIN]   File
> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5687, in main
>
> [ceph-store12][WARNIN] args.func(args)
>
> [ceph-store12][WARNIN]   File
> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2108, in main
>
> [ceph-store12][WARNIN] Prepare.factory(args).prepare()
>
> [ceph-store12][WARNIN]   File
> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2097, in prepare
>
> [ceph-store12][WARNIN] self._prepare()
>
> [ceph-store12][WARNIN]   File
> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2171, in _prepare
>
> [ceph-store12][WARNIN] self.lockbox.prepare()
>
> [ceph-store12][WARNIN]   File
> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2861, in prepare
>
> [ceph-store12][WARNIN] self.populate()
>
> [ceph-store12][WARNIN]   File
> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2818, in populate
>
> [ceph-store12][WARNIN] get_partition_base(self.partition.get_dev()),
>
> [ceph-store12][WARNIN]   File
> "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 844, in
> get_partition_base
>
> [ceph-store12][WARNIN] raise Error('not a partition', dev)
>
> [ceph-store12][WARNIN] ceph_disk.main.Error: Error: not a partition:
> /dev/dm-215
>
> [ceph-store12][ERROR ] RuntimeError: command returned non-zero exit
> status: 1
>
> [ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-disk
> -v prepare --dmcrypt --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --bluestore
> --cluster ceph --fs-type btrfs -- /dev/mapper/mpathr
>
> [ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Packages for debian in Ceph repo

2018-10-30 Thread Kevin Olbrich
Hi!

Proxmox has support for rbd as they ship additional packages as well as
ceph via their own repo.

I ran your command and got this:

> qemu-img version 2.8.1(Debian 1:2.8+dfsg-6+deb9u4)
> Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
> Supported formats: blkdebug blkreplay blkverify bochs cloop dmg file ftp
> ftps gluster host_cdrom host_device http https iscsi iser luks nbd nfs
> null-aio null-co parallels qcow qcow2 qed quorum raw rbd replication
> sheepdog ssh vdi vhdx vmdk vpc vvfat


It lists rbd but still fails with the exact same error.

Kevin


Am Di., 30. Okt. 2018 um 17:14 Uhr schrieb David Turner <
drakonst...@gmail.com>:

> What version of qemu-img are you using?  I found [1] this when poking
> around on my qemu server when checking for rbd support.  This version (note
> it's proxmox) has rbd listed as a supported format.
>
> [1]
> # qemu-img -V; qemu-img --help|grep rbd
> qemu-img version 2.11.2pve-qemu-kvm_2.11.2-1
> Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
> Supported formats: blkdebug blkreplay blkverify bochs cloop dmg file ftp
> ftps gluster host_cdrom host_device http https iscsi iser luks nbd null-aio
> null-co parallels qcow qcow2 qed quorum raw rbd replication sheepdog
> throttle vdi vhdx vmdk vpc vvfat zeroinit
> On Tue, Oct 30, 2018 at 12:08 PM Kevin Olbrich  wrote:
>
>> Is it possible to use qemu-img with rbd support on Debian Stretch?
>> I am on Luminous and try to connect my image-buildserver to load images
>> into a ceph pool.
>>
>> root@buildserver:~# qemu-img convert -p -O raw /target/test-vm.qcow2
>>> rbd:rbd_vms_ssd_01/test_vm
>>> qemu-img: Unknown protocol 'rbd'
>>
>>
>> Kevin
>>
>> Am Mo., 3. Sep. 2018 um 12:07 Uhr schrieb Abhishek Lekshmanan <
>> abhis...@suse.com>:
>>
>>> arad...@tma-0.net writes:
>>>
>>> > Can anyone confirm if the Ceph repos for Debian/Ubuntu contain
>>> packages for
>>> > Debian? I'm not seeing any, but maybe I'm missing something...
>>> >
>>> > I'm seeing ceph-deploy install an older version of ceph on the nodes
>>> (from the
>>> > Debian repo) and then failing when I run "ceph-deploy osd ..." because
>>> ceph-
>>> > volume doesn't exist on the nodes.
>>> >
>>> The newer versions of Ceph (from mimic onwards) requires compiler
>>> toolchains supporting c++17 which we unfortunately do not have for
>>> stretch/jessie yet.
>>>
>>> -
>>> Abhishek
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Packages for debian in Ceph repo

2018-10-30 Thread Kevin Olbrich
Is it possible to use qemu-img with rbd support on Debian Stretch?
I am on Luminous and try to connect my image-buildserver to load images
into a ceph pool.

root@buildserver:~# qemu-img convert -p -O raw /target/test-vm.qcow2
> rbd:rbd_vms_ssd_01/test_vm
> qemu-img: Unknown protocol 'rbd'


Kevin

Am Mo., 3. Sep. 2018 um 12:07 Uhr schrieb Abhishek Lekshmanan <
abhis...@suse.com>:

> arad...@tma-0.net writes:
>
> > Can anyone confirm if the Ceph repos for Debian/Ubuntu contain packages
> for
> > Debian? I'm not seeing any, but maybe I'm missing something...
> >
> > I'm seeing ceph-deploy install an older version of ceph on the nodes
> (from the
> > Debian repo) and then failing when I run "ceph-deploy osd ..." because
> ceph-
> > volume doesn't exist on the nodes.
> >
> The newer versions of Ceph (from mimic onwards) requires compiler
> toolchains supporting c++17 which we unfortunately do not have for
> stretch/jessie yet.
>
> -
> Abhishek
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Command to check last change to rbd image?

2018-10-28 Thread Kevin Olbrich
Hi!

Is there an easy way to check when an image was last modified?
I want to make sure, that the images I want to clean up, were not used for
a long time.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] nfs-ganesha version in Ceph repos

2018-10-09 Thread Kevin Olbrich
I had a similar problem:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029698.html

But even the recent 2.6.x releases were not working well for me (many many
segfaults). I am on the master-branch (2.7.x) and that works well with less
crashs.
Cluster is 13.2.1/.2 with nfs-ganesha as standalone VM.

Kevin


Am Di., 9. Okt. 2018 um 19:39 Uhr schrieb Erik McCormick <
emccorm...@cirrusseven.com>:

> On Tue, Oct 9, 2018 at 1:27 PM Erik McCormick
>  wrote:
> >
> > Hello,
> >
> > I'm trying to set up an nfs-ganesha server with the Ceph FSAL, and
> > running into difficulties getting the current stable release running.
> > The versions in the Luminous repo is stuck at 2.6.1, whereas the
> > current stable version is 2.6.3. I've seen a couple of HA issues in
> > pre 2.6.3 versions that I'd like to avoid.
> >
>
> I should have been more specific that the ones I am looking for are for
> Centos 7
>
> > I've also been attempting to build my own from source, but banging my
> > head against a wall as far as dependencies and config options are
> > concerned.
> >
> > If anyone reading this has the ability to kick off a fresh build of
> > the V2.6-stable branch with all the knobs turned properly for Ceph, or
> > can point me to a set of cmake configs and scripts that might help me
> > do it myself, I would be eternally grateful.
> >
> > Thanks,
> > Erik
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)

2018-10-08 Thread Kevin Olbrich
Hi Jakub,

"ceph osd metadata X" this is perfect! This also lists multipath devices
which I was looking for!

Kevin


Am Mo., 8. Okt. 2018 um 21:16 Uhr schrieb Jakub Jaszewski <
jaszewski.ja...@gmail.com>:

> Hi Kevin,
> Have you tried ceph osd metadata OSDid ?
>
> Jakub
>
> pon., 8 paź 2018, 19:32 użytkownik Alfredo Deza 
> napisał:
>
>> On Mon, Oct 8, 2018 at 6:09 AM Kevin Olbrich  wrote:
>> >
>> > Hi!
>> >
>> > Yes, thank you. At least on one node this works, the other node just
>> freezes but this might by caused by a bad disk that I try to find.
>>
>> If it is freezing, you could maybe try running the command where it
>> freezes? (ceph-volume will log it to the terminal)
>>
>>
>> >
>> > Kevin
>> >
>> > Am Mo., 8. Okt. 2018 um 12:07 Uhr schrieb Wido den Hollander <
>> w...@42on.com>:
>> >>
>> >> Hi,
>> >>
>> >> $ ceph-volume lvm list
>> >>
>> >> Does that work for you?
>> >>
>> >> Wido
>> >>
>> >> On 10/08/2018 12:01 PM, Kevin Olbrich wrote:
>> >> > Hi!
>> >> >
>> >> > Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id?
>> >> > Before I migrated from filestore with simple-mode to bluestore with
>> lvm,
>> >> > I was able to find the raw disk with "df".
>> >> > Now, I need to go from LVM LV to PV to disk every time I need to
>> >> > check/smartctl a disk.
>> >> >
>> >> > Kevin
>> >> >
>> >> >
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)

2018-10-08 Thread Kevin Olbrich
Hi!

Yes, thank you. At least on one node this works, the other node just
freezes but this might by caused by a bad disk that I try to find.

Kevin

Am Mo., 8. Okt. 2018 um 12:07 Uhr schrieb Wido den Hollander :

> Hi,
>
> $ ceph-volume lvm list
>
> Does that work for you?
>
> Wido
>
> On 10/08/2018 12:01 PM, Kevin Olbrich wrote:
> > Hi!
> >
> > Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id?
> > Before I migrated from filestore with simple-mode to bluestore with lvm,
> > I was able to find the raw disk with "df".
> > Now, I need to go from LVM LV to PV to disk every time I need to
> > check/smartctl a disk.
> >
> > Kevin
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fastest way to find raw device from OSD-ID? (osd -> lvm lv -> lvm pv -> disk)

2018-10-08 Thread Kevin Olbrich
Hi!

Is there an easy way to find raw disks (eg. sdd/sdd1) by OSD id?
Before I migrated from filestore with simple-mode to bluestore with lvm, I
was able to find the raw disk with "df".
Now, I need to go from LVM LV to PV to disk every time I need to
check/smartctl a disk.

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-08 Thread Kevin Olbrich
nt: (5)
Input/output error
2018-10-08 10:32:17.434 7f6af518e1c0 20 bdev aio_wait 0x55a3a1edb8c0 done
2018-10-08 10:32:17.434 7f6af518e1c0  1 bdev(0x55a3a1d62a80
/var/lib/ceph/osd/ceph-46/block) close
2018-10-08 10:32:17.434 7f6af518e1c0 10 bdev(0x55a3a1d62a80
/var/lib/ceph/osd/ceph-46/block) _aio_stop
2018-10-08 10:32:17.568 7f6add7d3700 10 bdev(0x55a3a1d62a80
/var/lib/ceph/osd/ceph-46/block) _aio_thread end
2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62a80
/var/lib/ceph/osd/ceph-46/block) _discard_stop
2018-10-08 10:32:17.573 7f6adcfd2700 20 bdev(0x55a3a1d62a80
/var/lib/ceph/osd/ceph-46/block) _discard_thread wake
2018-10-08 10:32:17.573 7f6adcfd2700 10 bdev(0x55a3a1d62a80
/var/lib/ceph/osd/ceph-46/block) _discard_thread finish
2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62a80
/var/lib/ceph/osd/ceph-46/block) _discard_stop stopped
2018-10-08 10:32:17.573 7f6af518e1c0  1 bdev(0x55a3a1d62000
/var/lib/ceph/osd/ceph-46/block) close
2018-10-08 10:32:17.573 7f6af518e1c0 10 bdev(0x55a3a1d62000
/var/lib/ceph/osd/ceph-46/block) _aio_stop
2018-10-08 10:32:17.817 7f6ade7d5700 10 bdev(0x55a3a1d62000
/var/lib/ceph/osd/ceph-46/block) _aio_thread end
2018-10-08 10:32:17.822 7f6af518e1c0 10 bdev(0x55a3a1d62000
/var/lib/ceph/osd/ceph-46/block) _discard_stop
2018-10-08 10:32:17.822 7f6addfd4700 20 bdev(0x55a3a1d62000
/var/lib/ceph/osd/ceph-46/block) _discard_thread wake
2018-10-08 10:32:17.822 7f6addfd4700 10 bdev(0x55a3a1d62000
/var/lib/ceph/osd/ceph-46/block) _discard_thread finish
2018-10-08 10:32:17.822 7f6af518e1c0 10 bdev(0x55a3a1d62000
/var/lib/ceph/osd/ceph-46/block) _discard_stop stopped
2018-10-08 10:32:17.823 7f6af518e1c0 -1 osd.46 0 OSD:init: unable to mount
object store
2018-10-08 10:32:17.823 7f6af518e1c0 -1  ** ERROR: osd init failed: (5)
Input/output error


Anything interesting here?

I will try to export the down PGs from the disks. I got a bunch of new
disks to replace all. Most of current disks are of same age.

Kevin

Am Mi., 3. Okt. 2018 um 13:52 Uhr schrieb Paul Emmerich <
paul.emmer...@croit.io>:

> There's "ceph-bluestore-tool repair/fsck"
>
> In your scenario, a few more log files would be interesting: try
> setting debug bluefs to 20/20. And if that's not enough log try also
> setting debug osd, debug bluestore, and debug bdev to 20/20.
>
>
>
> Paul
> Am Mi., 3. Okt. 2018 um 13:48 Uhr schrieb Kevin Olbrich :
> >
> > The disks were deployed with ceph-deploy / ceph-volume using the default
> style (lvm) and not simple-mode.
> >
> > The disks were provisioned as a whole, no resizing. I never touched the
> disks after deployment.
> >
> > It is very strange that this first happened after the update, never met
> such an error before.
> >
> > I found a BUG in the tracker, that also shows such an error with count
> 0. That was closed with „can’t reproduce“ (don’t have the link ready). For
> me this seems like the data itself is fine and I just hit a bad transaction
> in the replay (which maybe caused the crash in the first place).
> >
> > I need one of three disks back. Object corruption would not be a problem
> (regarding drop of a journal), as this cluster hosts backups which will
> fail validation and regenerate. Just marking the OSD lost does not seem to
> be an option.
> >
> > Is there some sort of fsck for BlueFS?
> >
> > Kevin
> >
> >
> > Igor Fedotov  schrieb am Mi. 3. Okt. 2018 um 13:01:
> >>
> >> I've seen somewhat similar behavior in a log from Sergey Malinin in
> another thread ("mimic: 3/4 OSDs crashed...")
> >>
> >> He claimed it happened after LVM volume expansion. Isn't this the case
> for you?
> >>
> >> Am I right that you use LVM volumes?
> >>
> >>
> >> On 10/3/2018 11:22 AM, Kevin Olbrich wrote:
> >>
> >> Small addition: the failing disks are in the same host.
> >> This is a two-host, failure-domain OSD cluster.
> >>
> >>
> >> Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich :
> >>>
> >>> Hi!
> >>>
> >>> Yesterday one of our (non-priority) clusters failed when 3 OSDs went
> down (EC 8+2) together.
> >>> This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two
> hours before.
> >>> They failed exactly at the same moment, rendering the cluster unusable
> (CephFS).
> >>> We are using CentOS 7 with latest updates and ceph repo. No cache
> SSDs, no external journal / wal / db.
> >>>
> >>> OSD 29 (no disk failure in dmesg):
> >>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167
> (ceph:ceph)
> >>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2
> (02899bfda8141

Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Kevin Olbrich
The disks were deployed with ceph-deploy / ceph-volume using the default
style (lvm) and not simple-mode.

The disks were provisioned as a whole, no resizing. I never touched the
disks after deployment.

It is very strange that this first happened after the update, never met
such an error before.

I found a BUG in the tracker, that also shows such an error with count 0.
That was closed with „can’t reproduce“ (don’t have the link ready). For me
this seems like the data itself is fine and I just hit a bad transaction in
the replay (which maybe caused the crash in the first place).

I need one of three disks back. Object corruption would not be a problem
(regarding drop of a journal), as this cluster hosts backups which will
fail validation and regenerate. Just marking the OSD lost does not seem to
be an option.

Is there some sort of fsck for BlueFS?

Kevin


Igor Fedotov  schrieb am Mi. 3. Okt. 2018 um 13:01:

> I've seen somewhat similar behavior in a log from Sergey Malinin in
> another thread ("mimic: 3/4 OSDs crashed...")
>
> He claimed it happened after LVM volume expansion. Isn't this the case for
> you?
>
> Am I right that you use LVM volumes?
>
> On 10/3/2018 11:22 AM, Kevin Olbrich wrote:
>
> Small addition: the failing disks are in the same host.
> This is a two-host, failure-domain OSD cluster.
>
>
> Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich :
>
>> Hi!
>>
>> Yesterday one of our (non-priority) clusters failed when 3 OSDs went down
>> (EC 8+2) together.
>> *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two
>> hours before.*
>> They failed exactly at the same moment, rendering the cluster unusable
>> (CephFS).
>> We are using CentOS 7 with latest updates and ceph repo. No cache SSDs,
>> no external journal / wal / db.
>>
>> *OSD 29 (no disk failure in dmesg):*
>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167 (ceph:ceph)
>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2
>> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process
>> ceph-osd, pid 20899
>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 pidfile_write: ignore empty
>> --pid-file
>> 2018-10-03 09:47:15.100 7fb8835ce1c0  0 load: jerasure load: lrc load:
>> isa
>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev create path
>> /var/lib/ceph/osd/ceph-29/block type kernel
>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
>> GiB) block_size 4096 (4 KiB) rotational
>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
>> kv_ratio 0.5
>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
>> meta 0 kv 1 data 0
>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) close
>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29
>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev create path
>> /var/lib/ceph/osd/ceph-29/block type kernel
>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>> 2018-10-03 09:47:15.359 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
>> GiB) block_size 4096 (4 KiB) rotational
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
>> kv_ratio 0.5
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
>> meta 0 kv 1 data 0
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev create path
>> /var/lib/ceph/osd/ceph-29/block type kernel
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
>> GiB) block_size 4096 (4 KiB) rotational
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs add_block_device bdev 1
>> path /var/lib/ceph/osd/ceph-29/block size 932 GiB
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs mount
>> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file wi

Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Kevin Olbrich
Small addition: the failing disks are in the same host.
This is a two-host, failure-domain OSD cluster.


Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich :

> Hi!
>
> Yesterday one of our (non-priority) clusters failed when 3 OSDs went down
> (EC 8+2) together.
> *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two
> hours before.*
> They failed exactly at the same moment, rendering the cluster unusable
> (CephFS).
> We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, no
> external journal / wal / db.
>
> *OSD 29 (no disk failure in dmesg):*
> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167 (ceph:ceph)
> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2
> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process
> ceph-osd, pid 20899
> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 pidfile_write: ignore empty
> --pid-file
> 2018-10-03 09:47:15.100 7fb8835ce1c0  0 load: jerasure load: lrc load: isa
> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev create path
> /var/lib/ceph/osd/ceph-29/block type kernel
> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
> GiB) block_size 4096 (4 KiB) rotational
> 2018-10-03 09:47:15.101 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
> kv_ratio 0.5
> 2018-10-03 09:47:15.101 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
> meta 0 kv 1 data 0
> 2018-10-03 09:47:15.101 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) close
> 2018-10-03 09:47:15.358 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29
> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev create path
> /var/lib/ceph/osd/ceph-29/block type kernel
> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
> 2018-10-03 09:47:15.359 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
> GiB) block_size 4096 (4 KiB) rotational
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
> kv_ratio 0.5
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
> meta 0 kv 1 data 0
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev create path
> /var/lib/ceph/osd/ceph-29/block type kernel
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
> GiB) block_size 4096 (4 KiB) rotational
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs add_block_device bdev 1
> path /var/lib/ceph/osd/ceph-29/block size 932 GiB
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs mount
> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with link
> count 0: file(ino 519 size 0x31e2f42 mtime 2018-10-02 12:24:22.632397 bdev
> 1 allocated 320 extents
> [1:0x700820+10,1:0x700900+10,1:0x700910+10,1:0x700920+10,1:0x700930+10,1:0x700940+10,1:0x700950+10,1:0x700960+10,1:0x700970+10,1:0x700980+10,1:0x700990+10,1:0x7009a0+10,1:0x7009b0+10,1:0x7009c0+10,1:0x7009d0+10,1:0x7009e0+10,1:0x7009f0+10,1:0x700a00+10,1:0x700a10+10,1:0x700a20+10,1:0x700a30+10,1:0x700a40+10,1:0x700a50+10,1:0x700a60+10,1:0x700a70+10,1:0x700a80+10,1:0x700a90+10,1:0x700aa0+10,1:0x700ab0+10,1:0x700ac0+10,1:0x700ad0+10,1:0x700ae0+10,1:0x700af0+10,1:0x700b00+10,1:0x700b10+10,1:0x700b20+10,1:0x700b30+10,1:0x700b40+10,1:0x700b50+10,1:0x700b60+10,1:0x700b70+10,1:0x700b80+10,1:0x700b90+10,1:0x700ba0+10,1:0x700bb0+10,1:0x700bc0+10,1:0x700bd0+10,1:0x700be0+10,1:0x700bf0+10,1:0x700c00+10])
> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs mount failed to replay log:
> (5) Input/output error
> 2018-10-03 09:47:15.538 7fb8835ce1c0  1 stupidalloc 0x0x561250b8d030
> shutdown
> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1
> bluestore(/var/lib/ceph/osd/ceph-29) _open_db failed bluefs mount: (

[ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Kevin Olbrich
 might be failed.

*OSD 47 (same as above, seems not be died, no dmesg trace):*
2018-10-03 10:02:25.221 7f4d54b611c0  0 set uid:gid to 167:167 (ceph:ceph)
2018-10-03 10:02:25.221 7f4d54b611c0  0 ceph version 13.2.2
(02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process
ceph-osd, pid 8993
2018-10-03 10:02:25.221 7f4d54b611c0  0 pidfile_write: ignore empty
--pid-file
2018-10-03 10:02:25.247 7f4d54b611c0  0 load: jerasure load: lrc load: isa
2018-10-03 10:02:25.248 7f4d54b611c0  1 bdev create path
/var/lib/ceph/osd/ceph-46/block type kernel
2018-10-03 10:02:25.248 7f4d54b611c0  1 bdev(0x564072f96000
/var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block
2018-10-03 10:02:25.248 7f4d54b611c0  1 bdev(0x564072f96000
/var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e080, 932
GiB) block_size 4096 (4 KiB) rotational
2018-10-03 10:02:25.249 7f4d54b611c0  1
bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes kv_min_ratio 1 >
kv_ratio 0.5
2018-10-03 10:02:25.249 7f4d54b611c0  1
bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes cache_size 536870912
meta 0 kv 1 data 0
2018-10-03 10:02:25.249 7f4d54b611c0  1 bdev(0x564072f96000
/var/lib/ceph/osd/ceph-46/block) close
2018-10-03 10:02:25.503 7f4d54b611c0  1
bluestore(/var/lib/ceph/osd/ceph-46) _mount path /var/lib/ceph/osd/ceph-46
2018-10-03 10:02:25.504 7f4d54b611c0  1 bdev create path
/var/lib/ceph/osd/ceph-46/block type kernel
2018-10-03 10:02:25.504 7f4d54b611c0  1 bdev(0x564072f96000
/var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block
2018-10-03 10:02:25.504 7f4d54b611c0  1 bdev(0x564072f96000
/var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e080, 932
GiB) block_size 4096 (4 KiB) rotational
2018-10-03 10:02:25.505 7f4d54b611c0  1
bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes kv_min_ratio 1 >
kv_ratio 0.5
2018-10-03 10:02:25.505 7f4d54b611c0  1
bluestore(/var/lib/ceph/osd/ceph-46) _set_cache_sizes cache_size 536870912
meta 0 kv 1 data 0
2018-10-03 10:02:25.505 7f4d54b611c0  1 bdev create path
/var/lib/ceph/osd/ceph-46/block type kernel
2018-10-03 10:02:25.505 7f4d54b611c0  1 bdev(0x564072f96a80
/var/lib/ceph/osd/ceph-46/block) open path /var/lib/ceph/osd/ceph-46/block
2018-10-03 10:02:25.505 7f4d54b611c0  1 bdev(0x564072f96a80
/var/lib/ceph/osd/ceph-46/block) open size 1000198897664 (0xe8e080, 932
GiB) block_size 4096 (4 KiB) rotational
2018-10-03 10:02:25.505 7f4d54b611c0  1 bluefs add_block_device bdev 1 path
/var/lib/ceph/osd/ceph-46/block size 932 GiB
2018-10-03 10:02:25.505 7f4d54b611c0  1 bluefs mount
2018-10-03 10:02:25.620 7f4d54b611c0 -1 bluefs _replay file with link count
0: file(ino 450 size 0x169964c mtime 2018-10-02 12:24:22.602432 bdev 1
allocated 170 extents
[1:0x6fd950+10,1:0x6fd960+10,1:0x6fd970+10,1:0x6fd980+10,1:0x6fd990+10,1:0x6fd9a0+10,1:0x6fd9b0+10,1:0x6fd9c0+10,1:0x6fd9d0+10,1:0x6fd9e0+10,1:0x6fd9f0+10,1:0x6fda00+10,1:0x6fda10+10,1:0x6fda20+10,1:0x6fda30+10,1:0x6fda40+10,1:0x6fda50+10,1:0x6fda60+10,1:0x6fda70+10,1:0x6fda80+10,1:0x6fda90+10,1:0x6fdaa0+10,1:0x6fdab0+10])
2018-10-03 10:02:25.620 7f4d54b611c0 -1 bluefs mount failed to replay log:
(5) Input/output error
2018-10-03 10:02:25.620 7f4d54b611c0  1 stupidalloc 0x0x564073102fc0
shutdown
2018-10-03 10:02:25.620 7f4d54b611c0 -1
bluestore(/var/lib/ceph/osd/ceph-46) _open_db failed bluefs mount: (5)
Input/output error
2018-10-03 10:02:25.620 7f4d54b611c0  1 bdev(0x564072f96a80
/var/lib/ceph/osd/ceph-46/block) close
2018-10-03 10:02:25.763 7f4d54b611c0  1 bdev(0x564072f96000
/var/lib/ceph/osd/ceph-46/block) close
2018-10-03 10:02:26.010 7f4d54b611c0 -1 osd.46 0 OSD:init: unable to mount
object store
2018-10-03 10:02:26.010 7f4d54b611c0 -1  ** ERROR: osd init failed: (5)
Input/output error

We had failing disks in this cluster before but that was easily recovered
by out + rebalance.
For me, it seems like one disk died (there was large I/O on the cluster
when this happened) and took two additional disks with it.
It is very strange that this happened about two hours after the upgrade +
reboot.

*Any recommendations?*
*I have 8 PGs down, the remeining are active and recovery / rebalance.*

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread KEVIN MICHAEL HRPCEK
Hey, don't lose hope. I just went through 2 3-5 day outages after a mimic 
upgrade with no data loss. I'd recommend looking through the thread about it to 
see how close it is to your issue. From my point of view there seems to be some 
similarities. 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029649.html.

At a similar point of desperation with my cluster I would shut all ceph 
processes down and bring them up in order. Doing this had my cluster almost 
healthy a few times until it fell over again due to mon issues. So solving any 
mon issues is the first priority. It seems like you may also benefit from 
setting mon_osd_cache_size to a very large number if you have enough memory on 
your mon servers.

I'll hop on the irc today.

Kevin

On 09/25/2018 05:53 PM, by morphin wrote:

After I tried too many things with so many helps on IRC. My pool
health is still in ERROR and I think I can't recover from this.
https://paste.ubuntu.com/p/HbsFnfkYDT/
At the end 2 of 3 mons crashed and started at same time and the pool
is offlined. Recovery takes more than 12hours and it is way too slow.
Somehow recovery seems to be not working.

If I can reach my data I will re-create the pool easily.
If I run ceph-object-tool script to regenerate mon store.db can I
acccess the RBD pool again?
by morphin <mailto:morphinwith...@gmail.com>, 25 Eyl 
2018 Sal, 20:03
tarihinde şunu yazdı:



Hi,

Cluster is still down :(

Up to not we have managed to compensate the OSDs. 118s of 160 OSD are
stable and cluster is still in the progress of settling. Thanks for
the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
flapping OSDs stable.

What we learned up now is that this is the cause of unsudden death of
2 monitor servers of 3. And when they come back if they do not start
one by one (each after joining cluster) this can happen. Cluster can
be unhealty and it can take countless hour to come back.

Right now here is our status:
ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/
health detail: https://paste.ubuntu.com/p/w4gccnqZjR/

Since OSDs disks are NL-SAS it can take up to 24 hours for an online
cluster. What is most it has been said that we could be extremely luck
if all the data is rescued.

Most unhappily our strategy is just to sit and wait :(. As soon as the
peering and activating count drops to 300-500 pgs we will restart the
stopped OSDs one by one. For each OSD and we will wait the cluster to
settle down. The amount of data stored is OSD is 33TB. Our most
concern is to export our rbd pool data outside to a backup space. Then
we will start again with clean one.

I hope to justify our analysis with an expert. Any help or advise
would be greatly appreciated.
by morphin <mailto:morphinwith...@gmail.com>, 25 Eyl 
2018 Sal, 15:08
tarihinde şunu yazdı:



After reducing the recovery parameter values did not change much.
There are a lot of OSD still marked down.

I don't know what I need to do after this point.

[osd]
osd recovery op priority = 63
osd client op priority = 1
osd recovery max active = 1
osd max scrubs = 1


ceph -s
  cluster:
id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
health: HEALTH_ERR
42 osds down
1 host (6 osds) down
61/8948582 objects unfound (0.001%)
Reduced data availability: 3837 pgs inactive, 1822 pgs
down, 1900 pgs peering, 6 pgs stale
Possible data damage: 18 pgs recovery_unfound
Degraded data redundancy: 457246/17897164 objects degraded
(2.555%), 213 pgs degraded, 209 pgs undersized
2554 slow requests are blocked > 32 sec
3273 slow ops, oldest one blocked for 1453 sec, daemons
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
have slow ops.

  services:
mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3,
SRV-SEKUARK4
osd: 168 osds: 118 up, 160 in

  data:
pools:   1 pools, 4096 pgs
objects: 8.95 M objects, 17 TiB
usage:   33 TiB used, 553 TiB / 586 TiB avail
pgs: 93.677% pgs not active
 457246/17897164 objects degraded (2.555%)
 61/8948582 objects unfound (0.001%)
 1676 down
 1372 peering
 528  stale+peering
 164  active+undersized+degraded
 145  stale+down
 73   activating
 40   active+clean
 29   stale+activating
 17   active+recovery_unfound+undersized+degraded
 16   stale+active+clean
 16   stale+active+undersized+degraded
 9activating+undersized+degraded
 3active+recovery_wait+degraded
 2activating+undersized
 2activating+degraded
 1creating+down
 1stale+active+recovery_unfound+undersized+degraded
 1stale+active+clean+scrubbing+deep
 1 

Re: [ceph-users] Mimic upgrade failure

2018-09-24 Thread KEVIN MICHAEL HRPCEK
The cluster is healthy and stable. I'll leave a summary for the archive in case 
anyone else has a similar problem.

centos 7.5
ceph mimic 13.2.1
3 mon/mgr/mds hosts, 862 osd (41 hosts)

This was all triggered by an unexpected ~1 min network blip on our 10Gbit 
switch. The ceph cluster lost connectivity to each other and obviously tried to 
remap everything once connectivity returned and tons of OSDs were being marked 
down. This was made worse by the OSDs trying to use large amounts of memory 
while recovering and ending up swapping, hanging, and me ipmi resetting hosts.  
All of this caused a lot of osd map changes and the mons will have stored all 
of them without trimming due to the unhealthy PGs. I was able to get almost all 
PGs active and clean on a few occasions but the cluster would fall over again 
after about 2 hours with cephx auth errors or OSDs trying to mark each other 
down (the mons seemed to not be rotating cephx auth keys). Setting 
'osd_heartbeat_interval = 30' helped a bit, but I eventually disabled process 
cephx auth with 'auth_cluster_required = none'. Setting that stopped the OSDs 
from falling over after 2 hours. From the beginning of this the MONs were 
running 100% on the ms_dispatch thread and constantly reelecting a leader every 
minute and not holding a consistent quorum with paxos lease_timeouts in the 
logs. The ms_dispatch was reading through the 
/var/lib/ceph/mon/mon-$hostname/store.db/*.sst constantly and strace showed 
this taking anywhere from 60 seconds to a couple minutes. This was almost all 
cpu user time and not much iowait. I think what was happening is that the mons 
failed health checks due to spending so much time constantly reading through 
the db and that held up other mon tasks which caused constant reelections.

We eventually reduced the MON reelections by finding the average ms_dispatch 
sst read time on the rank 0 mon took 65 seconds and setting 'mon_lease = 75' so 
that the paxos lease would last longer than ms_dispatch running 100%.  I also 
greatly increased the rocksdb_cache_size and leveldb_cache_size on the mons to 
be big enough to cache the entire db, but that didn't seem to make much 
difference initially. After working with Sage, he set the mon_osd_cache_size = 
20 (default 10). The huge mon_osd_cache_size let the mons cache all osd 
maps on the first read and the ms_dispatch thread was able to use this cache 
instead of spinning 100% on rereading them every minute. This stopped the 
constant elections because the mon stopped failing health checks and was able 
to complete other tasks. Lastly there were some self inflicted osd corruptions 
from the ipmi resets that needed to be dealt with to get all PGs active+clean, 
and the cephx change was rolled back to operate normally.

Sage, thanks again for your assistance with this.

Kevin

tl;dr Cache as much as you can.



On 09/24/2018 09:24 AM, Sage Weil wrote:

Hi Kevin,

Do you have an update on the state of the cluster?

I've opened a ticket http://tracker.ceph.com/issues/36163 to track the
likely root cause we identified, and have a PR open at
https://github.com/ceph/ceph/pull/24247

Thanks!
sage


On Thu, 20 Sep 2018, Sage Weil wrote:


On Thu, 20 Sep 2018, KEVIN MICHAEL HRPCEK wrote:


Top results when both were taken with ms_dispatch at 100%. The mon one
changes alot so I've included 3 snapshots of those. I'll update
mon_osd_cache_size.

After disabling auth_cluster_required and a cluster reboot I am having
less problems keeping OSDs in the cluster since they seem to not be
having auth problems around the 2 hour uptime mark. The mons still have
their problems but 859/861 OSDs are up with 2 crashing. I found a brief
mention on a forum or somewhere that the mons will only trim their
storedb when the cluster is healthy. If that's true do you think it is
likely that once all osds are healthy and unset some no* cluster flags
the mons will be able to trim their db and the result will be that
ms_dispatch no longer takes to long to churn through the db? Our primary
theory here is that ms_dispatch is taking too long and the mons reach a
timeout and then reelect in a nonstop cycle.



It's the PGs that need to all get healthy (active+clean) before the
osdmaps get trimmed.  Other health warnigns (e.g. about noout being set)
aren't related.



ceph-mon
34.24%34.24%  libpthread-2.17.so[.] pthread_rwlock_rdlock
+   34.00%34.00%  libceph-common.so.0   [.] crush_hash32_3



If this is the -g output you need to hit enter on lines like this to see
the call graph...  Or you can do 'perf record -g -p ' and then 'perf
report --stdio' (or similar) to dump it all to a file, fully expanded.

Thanks!
sage



+5.01% 5.01%  libceph-common.so.0   [.] ceph::decode >, 
std::less, mempool::pool_allocator<(mempool::pool_index_t)15, 
std::pair > >, 
std::_Select1st > > >, std::less > >, 
std::_Select1st > > >, std::less::copy
+0.79% 0.79%  libceph-c

Re: [ceph-users] data-pool option for qemu-img / ec pool

2018-09-23 Thread Kevin Olbrich
Hi Paul,

thanks for the hint, I just checked and it works perfectly.

I found this guide:
https://www.reddit.com/r/ceph/comments/72yc9m/ceph_openstack_with_ec/

The works well with one meta/data setup but not with multiple (like
device-class based pools).

The link above uses client-auth, is there a better way?

Kevin

Am So., 23. Sep. 2018 um 18:08 Uhr schrieb Paul Emmerich
:
>
> The usual trick for clients not supporting this natively is the option
> "rbd_default_data_pool" in ceph.conf which should also work here.
>
>
>   Paul
> Am So., 23. Sep. 2018 um 18:03 Uhr schrieb Kevin Olbrich :
> >
> > Hi!
> >
> > Is it possible to set data-pool for ec-pools on qemu-img?
> > For repl-pools I used "qemu-img convert" to convert from e.g. vmdk to raw 
> > and write to rbd/ceph directly.
> >
> > The rbd utility is able to do this for raw or empty images but without 
> > convert (converting 800G and writing it again would now take at least twice 
> > the time).
> >
> > Do I miss a parameter for qemu-kvm?
> >
> > Kind regards
> > Kevin
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] data-pool option for qemu-img / ec pool

2018-09-23 Thread Kevin Olbrich
Hi!

Is it possible to set data-pool for ec-pools on qemu-img?
For repl-pools I used "qemu-img convert" to convert from e.g. vmdk to raw
and write to rbd/ceph directly.

The rbd utility is able to do this for raw or empty images but without
convert (converting 800G and writing it again would now take at least twice
the time).

Do I miss a parameter for qemu-kvm?

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic upgrade failure

2018-09-20 Thread KEVIN MICHAEL HRPCEK
 
denc_traits >, void> >
+2.02% 2.02%  libceph-common.so.0   [.] 
ceph::buffer::ptr::unused_tail_length
+1.99% 1.99%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::copy
+1.67% 0.00%  [unknown] [k] 
 1.64% 1.64%  libstdc++.so.6.0.19   [.] 
std::_Rb_tree_insert_and_rebalance
+1.57% 1.57%  libtcmalloc.so.4.4.5  [.] operator new[]
+1.56% 1.56%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
+1.55% 1.55%  libceph-common.so.0   [.] ceph::buffer::ptr::append
 1.53% 1.53%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::copy@plt
+1.51% 1.51%  [kernel]  [k] rb_insert_color
+1.36% 1.36%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::advance
+1.27% 1.27%  libceph-common.so.0   [.] ceph::encode >, 
std::less, mempool::pool_allocator<(mempool::pool_index_t)15, 
std::pair > >, 
std::_Select1st > > >, std::less' to see where all of the encoding
activity is coming from?  I see two possibilities (the mon attempts to
cache encoded maps, and the MOSDMap message itself will also reencode
if/when that fails).

Also: mon_osd_cache_size = 10 by default... try making that 500 or
something.

sage



On Wed, 19 Sep 2018, Kevin Hrpcek wrote:



Majority of the clients are luminous with a few kraken stragglers. I looked at
ceph features and 'ceph daemon mon.sephmon1 sessions'. Nothing is reporting as
having mimic features, all mon,mgr,osd are running 13.2.1 but are reporting
luminous features, and majority of the luminous clients are reporting jewel
features. I shut down my compute cluster to get rid of majority of the clients
that are reporting jewel features, and there is still a lot of time spent by
ms_dispatch in ceph::decode >,
std::less, mempool::pool_allocator<(mempool::pool_index_t)15,
std::pair > >
6.67%  libceph-common.so.0   [.] ceph::buffer::ptr::release
5.35%  libceph-common.so.0   [.] std::_Rb_tree > >, std::_Select1st > > >,
std::less, mempoo
5.20%  libceph-common.so.0   [.] ceph::buffer::ptr::append
5.12%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl::copy
4.66%  libceph-common.so.0   [.] ceph::buffer::list::append
4.33%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
4.27%  ceph-mon  [.] std::_Rb_tree > >, std::_Select1st > > >,
std::less, mempoo
4.18%  libceph-common.so.0   [.] ceph::buffer::list::append
3.10%  libceph-common.so.0   [.] ceph::decode >,
denc_traits >, void> >
2.90%  libceph-common.so.0   [.] ceph::encode >,
std::less, mempool::pool_allocator<(mempool::pool_index_t)15,
std::pair > >
2.56%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
2.50%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
2.39%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
2.33%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl::advance
2.21%  libtcmalloc.so.4.4.5  [.]
tcmalloc::CentralFreeList::FetchFromOneSpans
1.97%  libtcmalloc.so.4.4.5  [.]
tcmalloc::CentralFreeList::ReleaseToSpans
1.60%  libceph-common.so.0   [.] crc32_iscsi_00
1.42%  libtcmalloc.so.4.4.5  [.] operator new[]
1.29%  libceph-common.so.0   [.] ceph::buffer::ptr::unused_tail_length
1.28%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl::copy_shallow
1.25%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length@plt
1.06%  libceph-common.so.0   [.] ceph::buffer::ptr::end_c_str
1.06%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
0.99%  libc-2.17.so  [.] __memcpy_ssse3_back
0.94%  libc-2.17.so  [.] _IO_default_xsputn
0.89%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl::advance@plt
0.87%  libtcmalloc.so.4.4.5  [.]
tcmalloc::ThreadCache::ReleaseToCentralCache
0.76%  libleveldb.so.1.0.7   [.] leveldb::FindFile
0.72%  [vdso][.] __vdso_clock_gettime
0.67%  ceph-mon  [.] std::_Rb_tree > >, std::_Select1st > > >,
std::less, mempoo
0.63%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
0.59%  libceph-common.so.0   [.] ceph::buffer::list::iterator::advance
0.52%  libceph-common.so.0   [.]
ceph::buffer::list::iterator::get_current_ptr


perf top ms_dispatch
   11.88%  libceph-common.so.0   [.] ceph::decode >,
std::less, mempool::pool_allocator<(mempool::pool_index_t)15,
std::pair > >
   11.23%  [kernel]  [k] system_call_after_swapgs
9.36%  libceph-common.so.0   [.] crush_hash32_3
6.55%  libceph-common.so.0   [.] crush_choose_indep
4.39%  [kernel]  [k] smp_call_function_many
3.17%  libceph-common.so.0   [.] ceph::buffer::list::append
3.03%  libceph-common.so.0   [.] ceph::buffer::list::append
3.02%  libceph-common.so.0   [.] std::_Rb_tree > >, std::_

Re: [ceph-users] Mimic upgrade failure

2018-09-20 Thread KEVIN MICHAEL HRPCEK
The mons have a 300gb  raid 1 on 10k sas. The /var lv is 44% full with the 
/var/lib/ceph/mon directory at 6.7gb. When ms_dispatch is running 100% it is 
all user time with iostat showing 0-2% utilization of the drive. I'm 
considering taking one the mon's raid 1 drives and dropping them into a server 
with better cpu to see if that makes a difference in the time it takes for 
ms_dispatch to do its thing.

OSDs seem to be struggling to update their cephx auth key/ticket ~2hr after a 
cluster reboot. This morning I'm setting auth_cluster_required = none to see if 
that removes this issue until the cluster is stable again.

Kevin

On 09/20/2018 08:13 AM, David Turner wrote:
Out of curiosity, what disks do you have your mons on and how does the disk 
usage, both utilization% and full%, look while this is going on?

On Wed, Sep 19, 2018, 1:57 PM Kevin Hrpcek 
mailto:kevin.hrp...@ssec.wisc.edu>> wrote:
Majority of the clients are luminous with a few kraken stragglers. I looked at 
ceph features and 'ceph daemon mon.sephmon1 sessions'. Nothing is reporting as 
having mimic features, all mon,mgr,osd are running 13.2.1 but are reporting 
luminous features, and majority of the luminous clients are reporting jewel 
features. I shut down my compute cluster to get rid of majority of the clients 
that are reporting jewel features, and there is still a lot of time spent by 
ms_dispatch in ceph::decode >, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >
   6.67%  libceph-common.so.0   [.] ceph::buffer::ptr::release
   5.35%  libceph-common.so.0   [.] std::_Rb_tree > >, 
std::_Select1st > > >, std::less, 
mempoo
   5.20%  libceph-common.so.0   [.] ceph::buffer::ptr::append
   5.12%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::copy
   4.66%  libceph-common.so.0   [.] ceph::buffer::list::append
   4.33%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
   4.27%  ceph-mon  [.] std::_Rb_tree > >, 
std::_Select1st > > >, std::less, 
mempoo
   4.18%  libceph-common.so.0   [.] ceph::buffer::list::append
   3.10%  libceph-common.so.0   [.] ceph::decode >, 
denc_traits >, void> >
   2.90%  libceph-common.so.0   [.] ceph::encode >, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >
   2.56%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
   2.50%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
   2.39%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
   2.33%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::advance
   2.21%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::FetchFromOneSpans
   1.97%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::ReleaseToSpans
   1.60%  libceph-common.so.0   [.] crc32_iscsi_00
   1.42%  libtcmalloc.so.4.4.5  [.] operator new[]
   1.29%  libceph-common.so.0   [.] ceph::buffer::ptr::unused_tail_length
   1.28%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::copy_shallow
   1.25%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length@plt
   1.06%  libceph-common.so.0   [.] ceph::buffer::ptr::end_c_str
   1.06%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
   0.99%  libc-2.17.so<http://libc-2.17.so>  [.] __memcpy_ssse3_back
   0.94%  libc-2.17.so<http://libc-2.17.so>  [.] _IO_default_xsputn
   0.89%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::advance@plt
   0.87%  libtcmalloc.so.4.4.5  [.] tcmalloc::ThreadCache::ReleaseToCentralCache
   0.76%  libleveldb.so.1.0.7   [.] leveldb::FindFile
   0.72%  [vdso][.] __vdso_clock_gettime
   0.67%  ceph-mon  [.] std::_Rb_tree > >, 
std::_Select1st > > >, std::less, 
mempoo
   0.63%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
   0.59%  libceph-common.so.0   [.] ceph::buffer::list::iterator::advance
   0.52%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator::get_current_ptr


perf top ms_dispatch
  11.88%  libceph-common.so.0   [.] ceph::decode >, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >
  11.23%  [kernel]  [k] system_call_after_swapgs
   9.36%  libceph-common.so.0   [.] crush_hash32_3
   6.55%  libceph-common.so.0   [.] crush_choose_indep
   4.39%  [kernel]  [k] smp_call_function_many
   3.17%  libceph-common.so.0   [.] ceph::buffer::list::append
   3.03%  libceph-common.so.0   [.] ceph::buffer::list::append
   3.02%  libceph-common.so.0   [.] std::_Rb_tree > >, 
std::_Select1st > > >, std::less, 
mempoo
   2.92%  libceph-common.so.0   [.] ceph::buffer::ptr::release
   2.65%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::advance
   2.57%  ceph-mon  [.] std::_Rb_tree > >, 
std::_Select1st > > >, std::less, 
mempoo
   2.27%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
   1.99%  libstdc++.so.6.

Re: [ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts

2018-09-20 Thread Kevin Olbrich
Thank you very much Paul.

Kevin


Am Do., 20. Sep. 2018 um 15:19 Uhr schrieb Paul Emmerich <
paul.emmer...@croit.io>:

> Hi,
>
> device classes are internally represented as completely independent
> trees/roots; showing them in one tree is just syntactic sugar.
>
> For example, if you have a hierarchy like root --> host1, host2, host3
> --> nvme/ssd/sata OSDs, then you'll actually have 3 trees:
>
> root~ssd -> host1~ssd, host2~ssd ...
> root~sata -> host~sata, ...
>
>
> Paul
>
> 2018-09-20 14:54 GMT+02:00 Kevin Olbrich :
> > Hi!
> >
> > Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host.
> > I also have replication rules to distinguish between HDD and SSD (and
> > failure-domain set to rack) which are mapped to pools.
> >
> > What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where
> > NVMe will be a new device-class based rule)?
> >
> > Will the crush weight be calculated from the OSDs up to the
> failure-domain
> > based on the crush rule?
> > The only crush-weights I know and see are those shown by "ceph osd tree".
> >
> > Kind regards
> > Kevin
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts

2018-09-20 Thread Kevin Olbrich
To answer my own question:

ceph osd crush tree --show-shadow

Sorry for the noise...

Am Do., 20. Sep. 2018 um 14:54 Uhr schrieb Kevin Olbrich :

> Hi!
>
> Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host.
> I also have replication rules to distinguish between HDD and SSD (and
> failure-domain set to rack) which are mapped to pools.
>
> What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where
> NVMe will be a new device-class based rule)?
>
> Will the crush weight be calculated from the OSDs up to the failure-domain
> based on the crush rule?
> The only crush-weights I know and see are those shown by "ceph osd tree".
>
> Kind regards
> Kevin
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crush distribution with heterogeneous device classes and failure domain hosts

2018-09-20 Thread Kevin Olbrich
Hi!

Currently I have a cluster with four hosts and 4x HDDs + 4 SSDs per host.
I also have replication rules to distinguish between HDD and SSD (and
failure-domain set to rack) which are mapped to pools.

What happens if I add a heterogeneous host with 1x SSD and 1x NVMe (where
NVMe will be a new device-class based rule)?

Will the crush weight be calculated from the OSDs up to the failure-domain
based on the crush rule?
The only crush-weights I know and see are those shown by "ceph osd tree".

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic upgrade failure

2018-09-19 Thread Kevin Hrpcek
Majority of the clients are luminous with a few kraken stragglers. I 
looked at ceph features and 'ceph daemon mon.sephmon1 sessions'. Nothing 
is reporting as having mimic features, all mon,mgr,osd are running 
13.2.1 but are reporting luminous features, and majority of the luminous 
clients are reporting jewel features. I shut down my compute cluster to 
get rid of majority of the clients that are reporting jewel features, 
and there is still a lot of time spent by ms_dispatch in 
ceph::decode

My small mimic test cluster actually shows similar in it's features, 
mon,mgr,mds,osd all report luminous features yet have 13.2.1 installed, 
so maybe that is normal.


Kevin

On 09/19/2018 09:35 AM, Sage Weil wrote:

It's hard to tell exactly from the below, but it looks to me like there is
still a lot of OSDMap reencoding going on.  Take a look at 'ceph features'
output and see who in the cluster is using pre-luminous features.. I'm
guessing all of the clients?  For any of those sessions, fetching OSDMaps
from the cluster will require reencoding.

If it's all clients (well, non-OSDs), I think we could work around it by
avoiding the reencode entirely (it is only really there for OSDs, which
want a perfect OSDMap copy that will match the monitor's CRC).

sage

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:


I set mon lease = 30 yesterday and it had no effect on the quorum election. To 
give you an idea of how much cpu ms_dispatch is using, from the last mon 
restart about 7.5 hours ago, the ms_dispatch thread has 5h 40m of cpu time. 
Below are 2 snippets from perf top. I took them while ms_dispatch was 100% of a 
core, the first is using the pid of the ceph-mon, the second is the pid of the 
ms_dispatch thread. The last thing is a snippet from stracing the ms_dispatch 
pid. It is running through all of the sst files.

perf top ceph-mon
Overhead  Shared Object Symbol
   17.71%  libceph-common.so.0   [.] ceph::decode >, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >
6.67%  libceph-common.so.0   [.] ceph::buffer::ptr::release
5.35%  libceph-common.so.0   [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo
5.20%  libceph-common.so.0   [.] ceph::buffer::ptr::append
5.12%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::copy
4.66%  libceph-common.so.0   [.] ceph::buffer::list::append
4.33%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
4.27%  ceph-mon  [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo
4.18%  libceph-common.so.0   [.] ceph::buffer::list::append
3.10%  libceph-common.so.0   [.] ceph::decode >, denc_traits >, void> >
2.90%  libceph-common.so.0   [.] ceph::encode >, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >
2.56%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
2.50%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
2.39%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
2.33%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::advance
2.21%  libtcmalloc.so.4.4.5  [.] 
tcmalloc::CentralFreeList::FetchFromOneSpans
1.97%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::ReleaseToSpans
1.60%  libceph-common.so.0   [.] crc32_iscsi_00
1.42%  libtcmalloc.so.4.4.5  [.] operator new[]
1.29%  libceph-common.so.0   [.] ceph::buffer::ptr::unused_tail_length
1.28%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::copy_shallow
1.25%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length@plt
1.06%  libceph-common.so.0   [.] ceph::buffer::ptr::end_c_str
1.06%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
0.99%  libc-2.17.so  [.] __memcpy_ssse3_back
0.94%  libc-2.17.so  [.] _IO_default_xsputn
0.89%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::advance@plt
0.87%  libtcmalloc.so.4.4.5  [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
0.76%  libleveldb.so.1.0.7   [.] leveldb::FindFile
0.72%  [vdso][.] __vdso_clock_gettime
0.67%  ceph-mon  [.] std::_Rb_tree > >, std::_Select1st > > >, std::less, mempoo
0.63%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
0.59%  libceph-common.so.0   [.] ceph::buffer::list::iterator::advance
0.52%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator::get_current_ptr


perf top ms_dispatch
   11.88%  libceph-common.so.0   [.] ceph::decode >, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >
   11.23%  [kernel]  [k] system_call_after_swapgs
9.36%  libceph-common.so.0   [.] crush_hash32_3
6.55%  libceph-common.so.0   [.] crush_choose_indep
4.39%  [kernel]  [k] smp_call_function_many
3.17%  libceph-common.so.0   [.] ceph::buffer::list::append
 

Re: [ceph-users] Mimic upgrade failure

2018-09-19 Thread KEVIN MICHAEL HRPCEK
t;, void> >
   1.07%  libtcmalloc.so.4.4.5  [.] operator new[]
   1.02%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
   1.01%  libtcmalloc.so.4.4.5  [.] tc_posix_memalign
   0.85%  ceph-mon  [.] ceph::buffer::ptr::release@plt
   0.76%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out@plt
   0.74%  libceph-common.so.0   [.] crc32_iscsi_00

strace
munmap(0x7f2eda736000, 2463941) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) = 429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", 
{st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0
mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000
close(429)  = 0
munmap(0x7f2ea8c97000, 2468005) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) = 429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", 
{st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0
mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000
close(429)  = 0
munmap(0x7f2ee21dc000, 2472343) = 0

Kevin


On 09/19/2018 06:50 AM, Sage Weil wrote:

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:


Sage,

Unfortunately the mon election problem came back yesterday and it makes
it really hard to get a cluster to stay healthy. A brief unexpected
network outage occurred and sent the cluster into a frenzy and when I
had it 95% healthy the mons started their nonstop reelections. In the
previous logs I sent were you able to identify why the mons are
constantly electing? The elections seem to be triggered by the below
paxos message but do you know which lease timeout is being reached or
why the lease isn't renewed instead of calling for an election?

One thing I tried was to shutdown the entire cluster and bring up only
the mon and mgr. The mons weren't able to hold their quorum with no osds
running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a
time.



This is odd... with no other dameons running I'm not sure what would be
eating up the CPU.  Can you run a 'perf top -p `pidof ceph-mon`' (or
similar) on the machine to see what the process is doing?  You might need
to install ceph-mon-dbg or ceph-debuginfo to get better symbols.



2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos
active c 133382665..133383355) lease_timeout -- calling new election



A workaround is probably to increase the lease timeout.  Try setting
mon_lease = 15 (default is 5... could also go higher than 15) in the
ceph.conf for all of the mons.  This is a bit of a band-aid but should
help you keep the mons in quorum until we sort out what is going on.

sage





Thanks
Kevin

On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:



Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and mimic
did not play well together for some reason. Maybe it has to do with the scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
has scaled to this size.

Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:


Nothing too crazy for non default settings. Some of those osd settings were
in place while I was testing recovery speeds and need to be brought back
closer to defaults. I was setting nodown before but it seems to mask the
problem. While its good to stop the osdmap changes, OSDs would come up, get
marked up, but at some point go down again (but the process is still
running) and still stay up in the map. Then when I'd unset nodown the
cluster would immediately mark 250+ osd down again and i'd be back where I
started.

This morning I went ahead and finished the osd upgrades to mimic to remove
that variabl

Re: [ceph-users] Mimic upgrade failure

2018-09-18 Thread KEVIN MICHAEL HRPCEK
Sage,

Unfortunately the mon election problem came back yesterday and it makes it 
really hard to get a cluster to stay healthy. A brief unexpected network outage 
occurred and sent the cluster into a frenzy and when I had it 95% healthy the 
mons started their nonstop reelections. In the previous logs I sent were you 
able to identify why the mons are constantly electing? The elections seem to be 
triggered by the below paxos message but do you know which lease timeout is 
being reached or why the lease isn't renewed instead of calling for an election?

One thing I tried was to shutdown the entire cluster and bring up only the mon 
and mgr. The mons weren't able to hold their quorum with no osds running and 
the ceph-mon ms_dispatch thread runs at 100% for > 60s at a time.

2018-09-19 03:56:21.729 7f4344ec1700  1 mon.sephmon2@1(peon).paxos(paxos active 
c 133382665..133383355) lease_timeout -- calling new election

Thanks
Kevin

On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:



Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and mimic
did not play well together for some reason. Maybe it has to do with the scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
has scaled to this size.

Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:


Nothing too crazy for non default settings. Some of those osd settings were
in place while I was testing recovery speeds and need to be brought back
closer to defaults. I was setting nodown before but it seems to mask the
problem. While its good to stop the osdmap changes, OSDs would come up, get
marked up, but at some point go down again (but the process is still
running) and still stay up in the map. Then when I'd unset nodown the
cluster would immediately mark 250+ osd down again and i'd be back where I
started.

This morning I went ahead and finished the osd upgrades to mimic to remove
that variable. I've looked for networking problems but haven't found any. 2
of the mons are on the same switch. I've also tried combinations of shutting
down a mon to see if a single one was the problem, but they keep electing no
matter the mix of them that are up. Part of it feels like a networking
problem but I haven't been able to find a culprit yet as everything was
working normally before starting the upgrade. Other than the constant mon
elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it
doesn't last long since at some point the OSDs start trying to fail each
other through their heartbeats.
2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908
is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] :
osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372
10.1.9.13:6801/275806 is reporting failure:1

I'm working on getting things mostly good again with everything on mimic and
will see if it behaves better.

Thanks for your input on this David.


[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon hea

[ceph-users] (no subject)

2018-09-18 Thread Kevin Olbrich
Hi!

is the compressible hint / incompressible hint supported on qemu+kvm?

http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/

If not, only aggressive would work in this case for rbd, right?

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic upgrade failure

2018-09-12 Thread Kevin Hrpcek
I couldn't find any sign of a networking issue at the OS or switches. No 
changes have been made in those to get the cluster stable again. I 
looked through a couple OSD logs and here is a selection of some of most 
frequent errors they were getting. Maybe something below is more obvious 
to you.


2018-09-09 18:17:33.245 7feb92079700  2 osd.84 991324 ms_handle_refused 
con 0x560e428b9800 session 0x560eb26b0060
2018-09-09 18:17:33.245 7feb9307b700  2 osd.84 991324 ms_handle_refused 
con 0x560ea639f000 session 0x560eb26b0060


2018-09-09 18:18:55.919 7feb9307b700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e424a3600 for osd.20, reopening
2018-09-09 18:18:55.919 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560e447df600 session 0x560e9ec37680
2018-09-09 18:18:55.919 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e427a5600 session 0x560e9ec37680
2018-09-09 18:18:55.935 7feb92079700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e40afcc00 for osd.18, reopening
2018-09-09 18:18:55.935 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e44398c00 session 0x560e6a3a0620
2018-09-09 18:18:55.935 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560e42f4ea00 session 0x560e6a3a0620
2018-09-09 18:18:55.939 7feb9307b700 10 osd.84 991337 heartbeat_reset 
failed hb con 0x560e424c1e00 for osd.9, reopening
2018-09-09 18:18:55.940 7feb9307b700  2 osd.84 991337 ms_handle_refused 
con 0x560ea4d09600 session 0x560e115e8120
2018-09-09 18:18:55.940 7feb92079700  2 osd.84 991337 ms_handle_refused 
con 0x560e424a3600 session 0x560e115e8120
2018-09-09 18:18:55.956 7febadf54700 20 osd.84 991337 share_map_peer 
0x560e411ca600 already has epoch 991337


2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362  new session 
0x560e40b5ce00 con=0x560e42471800 addr=10.1.9.13:6836/2276068
2018-09-09 18:24:59.595 7febae755700 10 osd.84 991362  session 
0x560e40b5ce00 osd.376 has caps osdcap[grant(*)] 'allow *'
2018-09-09 18:24:59.596 7feb9407d700  2 osd.84 991362 ms_handle_reset 
con 0x560e42471800 session 0x560e40b5ce00
2018-09-09 18:24:59.606 7feb9407d700  2 osd.84 991362 ms_handle_refused 
con 0x560e42d04600 session 0x560e10dfd000
2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362 
OSD::ms_get_authorizer type=osd
2018-09-09 18:24:59.633 7febad753700 10 osd.84 991362 ms_get_authorizer 
bailing, we are shutting down
2018-09-09 18:24:59.633 7febad753700  0 -- 10.1.9.9:6848/4287624 >> 
10.1.9.12:6801/2269104 conn(0x560e42326a00 :-1 
s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=18630 cs=1 
l=0).handle_connect_reply connect got BADAUTHORIZER


2018-09-09 18:22:56.434 7febadf54700  0 cephx: verify_authorizer could 
not decrypt ticket info: error: bad magic in decode_decrypt, 
3995972256093848467 != 18374858748799134293


2018-09-09 18:22:56.434 7febadf54700  0 -- 10.1.9.9:6848/4287624 >> 
10.1.9.12:6801/2269104 conn(0x560e41fad600 :6848 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg: got bad authorizer


2018-09-10 03:30:17.324 7ff0ab678700 -1 osd.84 992286 heartbeat_check: 
no reply from 10.1.9.28:6843 osd.578 since back 2018-09-10 
03:15:35.358240 front 2018-09-10 03:15:47.879015 (cutoff 2018-09-10 
03:29:17.326329)


Kevin


On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage
  




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:


Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and mimic
did not play well together for some reason. Maybe it has to do with the scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
has scaled to this size.

Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:

Nothing too crazy for non default settings. Some of those osd settings were
in place while I was testing recovery speeds and need to be brought

[ceph-users] nfs-ganesha FSAL CephFS: nfs_health :DBUS :WARN :Health status is unhealthy

2018-09-10 Thread Kevin Olbrich
Hi!

Today one of our nfs-ganesha gateway experienced an outage and since crashs
every time, the client behind it tries to access the data.
This is a Ceph Mimic cluster with nfs-ganesha from ceph-repos:

nfs-ganesha-2.6.2-0.1.el7.x86_64
nfs-ganesha-ceph-2.6.2-0.1.el7.x86_64

There were fixes for this problem in 2.6.3:
https://github.com/nfs-ganesha/nfs-ganesha/issues/339

Can the build in the repos be compiled against this bugfix release?

Thank you very much.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic upgrade failure

2018-09-10 Thread Kevin Hrpcek

Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a 
fluctuating state of up and down. The cluster did start to normalize a 
lot easier after everything was on mimic since the random mass OSD 
heartbeat failures stopped and the constant mon election problem went 
away. I'm still battling with the cluster reacting poorly to host 
reboots or small map changes, but I feel like my current pg:osd ratio 
may be playing a factor in that since we are 2x normal pg count while 
migrating data to new EC pools.


I'm not sure of the root cause but it seems like the mix of luminous and 
mimic did not play well together for some reason. Maybe it has to do 
with the scale of my cluster, 871 osd, or maybe I've missed some some 
tuning as my cluster has scaled to this size.


Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:
Nothing too crazy for non default settings. Some of those osd settings 
were in place while I was testing recovery speeds and need to be 
brought back closer to defaults. I was setting nodown before but it 
seems to mask the problem. While its good to stop the osdmap changes, 
OSDs would come up, get marked up, but at some point go down again 
(but the process is still running) and still stay up in the map. Then 
when I'd unset nodown the cluster would immediately mark 250+ osd down 
again and i'd be back where I started.


This morning I went ahead and finished the osd upgrades to mimic to 
remove that variable. I've looked for networking problems but haven't 
found any. 2 of the mons are on the same switch. I've also tried 
combinations of shutting down a mon to see if a single one was the 
problem, but they keep electing no matter the mix of them that are up. 
Part of it feels like a networking problem but I haven't been able to 
find a culprit yet as everything was working normally before starting 
the upgrade. Other than the constant mon elections, yesterday I had 
the cluster 95% healthy 3 or 4 times, but it doesn't last long since 
at some point the OSDs start trying to fail each other through their 
heartbeats.
2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd 
e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 
10.1.9.3:6884/317908 is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] 
: osd.39 10.1.9.2:6802/168438 reported failed by osd.49 
10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd 
e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 
10.1.9.13:6801/275806 is reporting failure:1


I'm working on getting things mostly good again with everything on 
mimic and will see if it behaves better.


Thanks for your input on this David.


[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health preluminous compat warning = false
osd heartbeat grace = 60
rocksdb cache size = 1342177280

[mds]
mds log max segments = 100
mds log max expiring = 40
mds bal fragment size max = 20
mds cache memory limit = 4294967296

[osd]
osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
osd recovery delay start = 30
osd recovery max active = 5
osd max backfills = 3
osd recovery threads = 2
osd crush initial weight = 0
osd heartbeat interval = 30
osd heartbeat grace = 60


On 09/08/2018 11:24 PM, David Turner wrote:
What osd/mon/etc config settings do you have that are not default? It 
might be worth utilizing nodown to stop osds from marking each other 
down and finish the upgrade to be able to set the minimum osd version 
to mimic. Stop the osds in a node, manually mark them down, start 
them back up in mimic. Depending on how bad things are, setting pause 
on the cluster to just finish the upgrade faster might not be a bad 
idea either.


This should be a simple question, have you confirmed that there are 
no networking problems between the MONs while the elections are 
happening?


On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <mailto:kevin.hrp...@ssec.wisc.edu>> wrote:


Hey Sage,

I've posted the file with my email address for the user. It is
with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
mons are calling for elections about every minute so I let this
run for a few elections and saw this node become the leader a
couple times. Debug logs start around 23:27:30. I had managed to
get about 850/857 osds up, but it seems that within the last 30
min it has all gone bad again due to the OSDs repor

Re: [ceph-users] Mimic upgrade failure

2018-09-09 Thread Kevin Hrpcek
Nothing too crazy for non default settings. Some of those osd settings 
were in place while I was testing recovery speeds and need to be brought 
back closer to defaults. I was setting nodown before but it seems to 
mask the problem. While its good to stop the osdmap changes, OSDs would 
come up, get marked up, but at some point go down again (but the process 
is still running) and still stay up in the map. Then when I'd unset 
nodown the cluster would immediately mark 250+ osd down again and i'd be 
back where I started.


This morning I went ahead and finished the osd upgrades to mimic to 
remove that variable. I've looked for networking problems but haven't 
found any. 2 of the mons are on the same switch. I've also tried 
combinations of shutting down a mon to see if a single one was the 
problem, but they keep electing no matter the mix of them that are up. 
Part of it feels like a networking problem but I haven't been able to 
find a culprit yet as everything was working normally before starting 
the upgrade. Other than the constant mon elections, yesterday I had the 
cluster 95% healthy 3 or 4 times, but it doesn't last long since at some 
point the OSDs start trying to fail each other through their heartbeats.
2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd 
e991282 prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 
10.1.9.3:6884/317908 is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] : 
osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd 
e991282 prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372 
10.1.9.13:6801/275806 is reporting failure:1


I'm working on getting things mostly good again with everything on mimic 
and will see if it behaves better.


Thanks for your input on this David.


[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health preluminous compat warning = false
osd heartbeat grace = 60
rocksdb cache size = 1342177280

[mds]
mds log max segments = 100
mds log max expiring = 40
mds bal fragment size max = 20
mds cache memory limit = 4294967296

[osd]
osd mkfs options xfs = -i size=2048 -d su=512k,sw=1
osd recovery delay start = 30
osd recovery max active = 5
osd max backfills = 3
osd recovery threads = 2
osd crush initial weight = 0
osd heartbeat interval = 30
osd heartbeat grace = 60


On 09/08/2018 11:24 PM, David Turner wrote:
What osd/mon/etc config settings do you have that are not default? It 
might be worth utilizing nodown to stop osds from marking each other 
down and finish the upgrade to be able to set the minimum osd version 
to mimic. Stop the osds in a node, manually mark them down, start them 
back up in mimic. Depending on how bad things are, setting pause on 
the cluster to just finish the upgrade faster might not be a bad idea 
either.


This should be a simple question, have you confirmed that there are no 
networking problems between the MONs while the elections are happening?


On Sat, Sep 8, 2018, 7:52 PM Kevin Hrpcek <mailto:kevin.hrp...@ssec.wisc.edu>> wrote:


Hey Sage,

I've posted the file with my email address for the user. It is
with debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The
mons are calling for elections about every minute so I let this
run for a few elections and saw this node become the leader a
couple times. Debug logs start around 23:27:30. I had managed to
get about 850/857 osds up, but it seems that within the last 30
min it has all gone bad again due to the OSDs reporting each other
as failed. We relaxed the osd_heartbeat_interval to 30 and
osd_heartbeat_grace to 60 in an attempt to slow down how quickly
OSDs are trying to fail each other. I'll put in the
rocksdb_cache_size setting.

Thanks for taking a look.

Kevin

On 09/08/2018 06:04 PM, Sage Weil wrote:

    Hi Kevin,

I can't think of any major luminous->mimic changes off the top of my head
that would impact CPU usage, but it's always possible there is something
subtle.  Can you ceph-post-file a the full log from one of your mons
(preferbably the leader)?

You might try adjusting the rocksdb cache size.. try setting

  rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB

on the mons and restarting?

Thanks!
sage

On Sat, 8 Sep 2018, Kevin Hrpcek wrote:


Hello,

I've had a Lumin

Re: [ceph-users] Mimic upgrade failure

2018-09-08 Thread Kevin Hrpcek

Hey Sage,

I've posted the file with my email address for the user. It is with 
debug_mon 20/20, debug_paxos 20/20, and debug ms 1/5. The mons are 
calling for elections about every minute so I let this run for a few 
elections and saw this node become the leader a couple times. Debug logs 
start around 23:27:30. I had managed to get about 850/857 osds up, but 
it seems that within the last 30 min it has all gone bad again due to 
the OSDs reporting each other as failed. We relaxed the 
osd_heartbeat_interval to 30 and osd_heartbeat_grace to 60 in an attempt 
to slow down how quickly OSDs are trying to fail each other. I'll put in 
the rocksdb_cache_size setting.


Thanks for taking a look.

Kevin

On 09/08/2018 06:04 PM, Sage Weil wrote:

Hi Kevin,

I can't think of any major luminous->mimic changes off the top of my head
that would impact CPU usage, but it's always possible there is something
subtle.  Can you ceph-post-file a the full log from one of your mons
(preferbably the leader)?

You might try adjusting the rocksdb cache size.. try setting

  rocksdb_cache_size = 1342177280   # 10x the default, ~1.3 GB

on the mons and restarting?

Thanks!
sage

On Sat, 8 Sep 2018, Kevin Hrpcek wrote:


Hello,

I've had a Luminous -> Mimic upgrade go very poorly and my cluster is stuck
with almost all pgs down. One problem is that the mons have started to
re-elect a new quorum leader almost every minute. This is making it difficult
to monitor the cluster and even run any commands on it since at least half the
time a ceph command times out or takes over a minute to return results. I've
looked at the debug logs and it appears there is some timeout occurring with
paxos of about a minute. The msg_dispatch thread of the mons is often running
a core at 100% for about a minute(user time, no iowait). Running strace on it
shows the process is going through all of the mon db files (about 6gb in
store.db/*.sst). Does anyone have an idea of what this timeout is or why my
mons are always reelecting? One theory I have is that the msg_dispatch can't
process the SST's fast enough and hits some timeout for a health check and the
mon drops itself from the quorum since it thinks it isn't healthy. I've been
thinking of introducing a new mon to the cluster on hardware with a better cpu
to see if that can process the SSTs within this timeout.

My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 11/41 osd
servers on luminous. The original problem started when I restarted the osds on
one of the hosts. The cluster reacted poorly to them going down and went into
a frenzy of taking down other osds and remapping. I eventually got that stable
and the PGs were 90% good with the finish line in sight and then the mons
started their issue of releecting every minute. Now I can't keep any decent
amount of PGs up for more than a few hours. This started on Wednesday.

Any help would be greatly appreciated.

Thanks,
Kevin

--Debug snippet from a mon at reelection time
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds e14242
maybe_resize_cluster in 1 max 1
2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds e14242
tick: resetting beacon timeouts due to mon delay (slow election?) of 59.8106s
seconds
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim trim_to
13742 would only trim 238 < paxos_service_trim_min 250
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
auth
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
check_rotate updated rotating
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth v120657
encode_pending v 120658
2018-09-07 20:08:08.655 7f57b92cd700  5 mon.sephmon2@1(leader).paxos(paxos
updating c 132917556..132918214) queue_pending_finisher 0x55dce8e5b370
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).paxos(paxos
updating c 132917556..132918214) trigger_propose not active, will propose
later
2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr e2234 tick:
resetting beacon timeouts due to mon delay (slow election?) of 59.8844s
seconds
2018-09-07 20:08:08.655 7f57b92cd700 10
mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 1734
would only trim 221 < paxos_service_trim_min 250
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health
check_member_health
2018-09-07 20:08:08.657 7f57bcdd0700  1 -- 10.1.9.202:6789/0 >> -
conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0
l=0)._process_connection sd=447 -
2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17
ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health
check_m

[ceph-users] Mimic upgrade failure

2018-09-08 Thread Kevin Hrpcek

Hello,

I've had a Luminous -> Mimic upgrade go very poorly and my cluster is 
stuck with almost all pgs down. One problem is that the mons have 
started to re-elect a new quorum leader almost every minute. This is 
making it difficult to monitor the cluster and even run any commands on 
it since at least half the time a ceph command times out or takes over a 
minute to return results. I've looked at the debug logs and it appears 
there is some timeout occurring with paxos of about a minute. The 
msg_dispatch thread of the mons is often running a core at 100% for 
about a minute(user time, no iowait). Running strace on it shows the 
process is going through all of the mon db files (about 6gb in 
store.db/*.sst). Does anyone have an idea of what this timeout is or why 
my mons are always reelecting? One theory I have is that the 
msg_dispatch can't process the SST's fast enough and hits some timeout 
for a health check and the mon drops itself from the quorum since it 
thinks it isn't healthy. I've been thinking of introducing a new mon to 
the cluster on hardware with a better cpu to see if that can process the 
SSTs within this timeout.


My cluster has the mons,mds,mgr and 30/41 osd servers on mimic, and 
11/41 osd servers on luminous. The original problem started when I 
restarted the osds on one of the hosts. The cluster reacted poorly to 
them going down and went into a frenzy of taking down other osds and 
remapping. I eventually got that stable and the PGs were 90% good with 
the finish line in sight and then the mons started their issue of 
releecting every minute. Now I can't keep any decent amount of PGs up 
for more than a few hours. This started on Wednesday.


Any help would be greatly appreciated.

Thanks,
Kevin

--Debug snippet from a mon at reelection time
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).mds 
e14242 maybe_resize_cluster in 1 max 1
2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mds 
e14242 tick: resetting beacon timeouts due to mon delay (slow election?) 
of 59.8106s seconds
2018-09-07 20:08:08.655 7f57b92cd700 10 
mon.sephmon2@1(leader).paxosservice(mdsmap 13504..14242) maybe_trim 
trim_to 13742 would only trim 238 < paxos_service_trim_min 250
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth 
v120657 auth
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth 
v120657 check_rotate updated rotating
2018-09-07 20:08:08.655 7f57b92cd700 10 
mon.sephmon2@1(leader).paxosservice(auth 120594..120657) propose_pending
2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).auth 
v120657 encode_pending v 120658
2018-09-07 20:08:08.655 7f57b92cd700  5 
mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) 
queue_pending_finisher 0x55dce8e5b370
2018-09-07 20:08:08.655 7f57b92cd700 10 
mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) 
trigger_propose not active, will propose later
2018-09-07 20:08:08.655 7f57b92cd700  4 mon.sephmon2@1(leader).mgr e2234 
tick: resetting beacon timeouts due to mon delay (slow election?) of 
59.8844s seconds
2018-09-07 20:08:08.655 7f57b92cd700 10 
mon.sephmon2@1(leader).paxosservice(mgr 1513..2234) maybe_trim trim_to 
1734 would only trim 221 < paxos_service_trim_min 250

2018-09-07 20:08:08.655 7f57b92cd700 10 mon.sephmon2@1(leader).health tick
2018-09-07 20:08:08.655 7f57b92cd700 20 mon.sephmon2@1(leader).health 
check_member_health
2018-09-07 20:08:08.657 7f57bcdd0700  1 -- 10.1.9.202:6789/0 >> - 
conn(0x55dcee55be00 :6789 s=STATE_ACCEPTING pgs=0 cs=0 
l=0)._process_connection sd=447 -
2018-09-07 20:08:08.657 7f57bcdd0700 10 mon.sephmon2@1(leader) e17 
ms_verify_authorizer 10.1.9.32:6823/4007 osd protocol 0
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).health 
check_member_health avail 79% total 40 GiB, used 8.4 GiB, avail 32 GiB
2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader).health 
check_leader_health
2018-09-07 20:08:08.662 7f57b92cd700 10 
mon.sephmon2@1(leader).paxosservice(health 1534..1720) maybe_trim 
trim_to 1715 would only trim 181 < paxos_service_trim_min 250

2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader).config tick
2018-09-07 20:08:08.662 7f57b92cd700 20 mon.sephmon2@1(leader) e17 
sync_trim_providers
2018-09-07 20:08:08.662 7f57b92cd700 -1 mon.sephmon2@1(leader) e17 
get_health_metrics reporting 1940 slow ops, oldest is osd_failure(failed 
timeout osd.72 10.1.9.9:6800/68904 for 317sec e987498 v987498)
2018-09-07 20:08:08.662 7f57b92cd700  1 
mon.sephmon2@1(leader).paxos(paxos updating c 132917556..132918214) 
accept timeout, calling fresh election

2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 bootstrap
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 
sync_reset_requester
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.sephmon2@1(leader) e17 
unregister_cluster_logger
2018-09-07 20:08:08.662 7f57b92cd700 10 mon.seph

[ceph-users] SPDK/DPDK with Intel P3700 NVMe pool

2018-08-30 Thread Kevin Olbrich
Hi!

During our move from filestore to bluestore, we removed several Intel P3700
NVMe from the nodes.

Is someone running a SPDK/DPDK NVMe-only EC pool? Is it working well?
The docs are very short about the setup:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#spdk-usage

I would like to re-use these cards for high-end (max IO) for database VMs.

Some notes or feedback about the setup (ceph-volume etc.) would be
appreciated.

Thank you.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HDD-only CephFS cluster with EC and without SSD/NVMe

2018-08-22 Thread Kevin Olbrich
Hi!

I am in the progress of moving a local ("large", 24x1TB) ZFS RAIDZ2 to
CephFS.
This storage is used for backup images (large sequential reads and writes).

To save space and have a RAIDZ2 (RAID6) like setup, I am planning the
following profile:

ceph osd erasure-code-profile set myprofile \
   k=3 \
   m=2 \
   ruleset-failure-domain=rack

Performance is not the first priority, this is why I do not plan to
outsource WAL/DB (broken NVMe = broken OSDs is more administrative overhead
then single OSDs).
Disks are attached by SAS multipath, throughput in general is no problem
but I did not test with ceph yet.

Is anyone using CephFS + bluestore + ec 3/2 + without WAL/DB-dev and is it
working well?

Thank you.

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running 12.2.5 without problems, should I upgrade to 12.2.7 or wait for 12.2.8?

2018-08-10 Thread Kevin Olbrich
Am Fr., 10. Aug. 2018 um 19:29 Uhr schrieb :

>
>
> Am 30. Juli 2018 09:51:23 MESZ schrieb Micha Krause :
> >Hi,
>
> Hi Micha,
>
> >
> >I'm Running 12.2.5 and I have no Problems at the moment.
> >
> >However my servers reporting daily that they want to upgrade to 12.2.7,
> >is this save or should I wait for 12.2.8?
> >
> I guess you should Upgrade to 12.2.7 as soon as you can, specialy when
>

Why? As far as I unterstood, replicated pools for rbd are out of danger -
.6 and .7 were mostly fixes for the known cases.
We are not planning any upgrade from 12.2.5 atm. Please correct me, if I am
wrong.

Kevin


> Quote:
> The v12.2.5 release has a potential data corruption issue with erasure
> coded pools. If you ran v12.2.5 with erasure coding, please see below.
>
> See: https://ceph.com/releases/12-2-7-luminous-released/
>
> Hth
> - Mehmet
> >Are there any predictions when the 12.2.8 release will be available?
> >
> >
> >Micha Krause
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-19 Thread Kevin Olbrich
Hi,

on upgrade from 12.2.4 to 12.2.5 the balancer module broke (mgr crashes
minutes after service started).
Only solution was to disable the balancer (service is running fine since).

Is this fixed in 12.2.7?
I was unable to locate the bug in bugtracker.

Kevin

2018-07-17 18:28 GMT+02:00 Abhishek Lekshmanan :

>
> This is the seventh bugfix release of Luminous v12.2.x long term
> stable release series. This release contains several fixes for
> regressions in the v12.2.6 and v12.2.5 releases.  We recommend that
> all users upgrade.
>
> *NOTE* The v12.2.6 release has serious known regressions, while 12.2.6
> wasn't formally announced in the mailing lists or blog, the packages
> were built and available on download.ceph.com since last week. If you
> installed this release, please see the upgrade procedure below.
>
> *NOTE* The v12.2.5 release has a potential data corruption issue with
> erasure coded pools. If you ran v12.2.5 with erasure coding, please see
> below.
>
> The full blog post alongwith the complete changelog is published at the
> official ceph blog at https://ceph.com/releases/12-2-7-luminous-released/
>
> Upgrading from v12.2.6
> --
>
> v12.2.6 included an incomplete backport of an optimization for
> BlueStore OSDs that avoids maintaining both the per-object checksum
> and the internal BlueStore checksum.  Due to the accidental omission
> of a critical follow-on patch, v12.2.6 corrupts (fails to update) the
> stored per-object checksum value for some objects.  This can result in
> an EIO error when trying to read those objects.
>
> #. If your cluster uses FileStore only, no special action is required.
>This problem only affects clusters with BlueStore.
>
> #. If your cluster has only BlueStore OSDs (no FileStore), then you
>should enable the following OSD option::
>
>  osd skip data digest = true
>
>This will avoid setting and start ignoring the full-object digests
>whenever the primary for a PG is BlueStore.
>
> #. If you have a mix of BlueStore and FileStore OSDs, then you should
>enable the following OSD option::
>
>  osd distrust data digest = true
>
>This will avoid setting and start ignoring the full-object digests
>in all cases.  This weakens the data integrity checks for
>FileStore (although those checks were always only opportunistic).
>
> If your cluster includes BlueStore OSDs and was affected, deep scrubs
> will generate errors about mismatched CRCs for affected objects.
> Currently the repair operation does not know how to correct them
> (since all replicas do not match the expected checksum it does not
> know how to proceed).  These warnings are harmless in the sense that
> IO is not affected and the replicas are all still in sync.  The number
> of affected objects is likely to drop (possibly to zero) on their own
> over time as those objects are modified.  We expect to include a scrub
> improvement in v12.2.8 to clean up any remaining objects.
>
> Additionally, see the notes below, which apply to both v12.2.5 and v12.2.6.
>
> Upgrading from v12.2.5 or v12.2.6
> -
>
> If you used v12.2.5 or v12.2.6 in combination with erasure coded
> pools, there is a small risk of corruption under certain workloads.
> Specifically, when:
>
> * An erasure coded pool is in use
> * The pool is busy with successful writes
> * The pool is also busy with updates that result in an error result to
>   the librados user.  RGW garbage collection is the most common
>   example of this (it sends delete operations on objects that don't
>   always exist.)
> * Some OSDs are reasonably busy.  One known example of such load is
>   FileStore splitting, although in principle any load on the cluster
>   could also trigger the behavior.
> * One or more OSDs restarts.
>
> This combination can trigger an OSD crash and possibly leave PGs in a state
> where they fail to peer.
>
> Notably, upgrading a cluster involves OSD restarts and as such may
> increase the risk of encountering this bug.  For this reason, for
> clusters with erasure coded pools, we recommend the following upgrade
> procedure to minimize risk:
>
> 1. Install the v12.2.7 packages.
> 2. Temporarily quiesce IO to cluster::
>
>  ceph osd pause
>
> 3. Restart all OSDs and wait for all PGs to become active.
> 4. Resume IO::
>
>  ceph osd unpause
>
> This will cause an availability outage for the duration of the OSD
> restarts.  If this in unacceptable, an *more risky* alternative is to
> disable RGW garbage collection (the primary known cause of these rados
> operations) for the duration of the upgrade::
>
> 1. Set ``rgw_enable_gc_threads = false`` in ceph

Re: [ceph-users] Periodically activating / peering on OSD add

2018-07-14 Thread Kevin Olbrich
PS: It's luminous 12.2.5!


Mit freundlichen Grüßen / best regards,
Kevin Olbrich.

2018-07-14 15:19 GMT+02:00 Kevin Olbrich :

> Hi,
>
> why do I see activating followed by peering during OSD add (refill)?
> I did not change pg(p)_num.
>
> Is this normal? From my other clusters, I don't think that happend...
>
> Kevin
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Periodically activating / peering on OSD add

2018-07-14 Thread Kevin Olbrich
Hi,

why do I see activating followed by peering during OSD add (refill)?
I did not change pg(p)_num.

Is this normal? From my other clusters, I don't think that happend...

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore and number of devices

2018-07-13 Thread Kevin Olbrich
You can keep the same layout as before. Most place DB/WAL combined in one
partition (similar to the journal on filestore).

Kevin

2018-07-13 12:37 GMT+02:00 Robert Stanford :

>
>  I'm using filestore now, with 4 data devices per journal device.
>
>  I'm confused by this: "BlueStore manages either one, two, or (in certain
> cases) three storage devices."
> (http://docs.ceph.com/docs/luminous/rados/configuration/
> bluestore-config-ref/)
>
>  When I convert my journals to bluestore, will they still be four data
> devices (osds) per journal, or will they each require a dedicated journal
> drive now?
>
>  Regards
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds daemon damaged

2018-07-12 Thread Kevin

Sorry for the long posting but trying to cover everything

I woke up to find my cephfs filesystem down. This was in the logs

2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 
0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head


I had one standby MDS, but as far as I can tell it did not fail over. 
This was in the logs


(insufficient standby MDS daemons available)

Currently my ceph looks like this
  cluster:
id: ..
health: HEALTH_ERR
1 filesystem is degraded
1 mds daemon damaged

  services:
mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
mgr: ids27(active)
mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
osd: 5 osds: 5 up, 5 in

  data:
pools:   3 pools, 202 pgs
objects: 1013k objects, 4018 GB
usage:   12085 GB used, 6544 GB / 18630 GB avail
pgs: 201 active+clean
 1   active+clean+scrubbing+deep

  io:
client:   0 B/s rd, 0 op/s rd, 0 op/s wr

I started trying to get the damaged MDS back online

Based on this page 
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts


# cephfs-journal-tool journal export backup.bin
2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is 
unreadable
2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
readable, attempt object-by-object dump with `rados`

Error ((5) Input/output error)

# cephfs-journal-tool event recover_dentries summary
Events by type:
2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is 
unreadableErrors: 0


cephfs-journal-tool journal reset - (I think this command might have 
worked)


Next up, tried to reset the filesystem

ceph fs reset test-cephfs-1 --yes-i-really-mean-it

Each time same errors

2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: 
MDS_DAMAGE (was: 1 mds daemon damaged)
2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 
assigned to filesystem test-cephfs-1 as rank 0
2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 
0x200: (5) Input/output error
2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds 
daemon damaged (MDS_DAMAGE)
2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 
0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head
2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 
filesystem is degraded; 1 mds daemon damaged


Tried to 'fail' mds.ds27
# ceph mds fail ds27
# failed mds gid 1929168

Command worked, but each time I run the reset command the same errors 
above appear


Online searches say the object read error has to be removed. But there's 
no object listed. This web page is the closest to the issue

http://tracker.ceph.com/issues/20863

Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it 
completes but still have the same issue above


Final option is to attempt removing mds.ds27. If mds.ds29 was a standby 
and has data it should become live. If it was not

I assume we will lose the filesystem at this point

Why didn't the standby MDS failover?

Just looking for any way to recover the cephfs, thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck peering (looping?) after upgrade to Luminous.

2018-07-11 Thread Kevin Olbrich
Sounds a little bit like the problem I had on OSDs:

[ceph-users] Blocked requests activating+remapped after extending pg(p)_num
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026680.html>
 *Kevin
Olbrich*

   - [ceph-users] Blocked requests activating+remapped
   afterextendingpg(p)_num
   <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026681.html>
 *Burkhard Linke*
  - [ceph-users] Blocked requests activating+remapped
  afterextendingpg(p)_num
  <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026682.html>
*Kevin Olbrich*
 - [ceph-users] Blocked requests activating+remapped
 afterextendingpg(p)_num
 
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026683.html>
   *Kevin Olbrich*
 - [ceph-users] Blocked requests activating+remapped
 afterextendingpg(p)_num
 
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026685.html>
   *Kevin Olbrich*
 - [ceph-users] Blocked requests activating+remapped
 afterextendingpg(p)_num
 
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026689.html>
   *Kevin Olbrich*
 - [ceph-users] Blocked requests activating+remapped
 afterextendingpg(p)_num
 
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026692.html>
   *Paul Emmerich*
 - [ceph-users] Blocked requests activating+remapped
 afterextendingpg(p)_num
 
<http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026695.html>
   *Kevin Olbrich*

I ended up restarting the OSDs which were stuck in that state and they
immediately fixed themselfs.
It should also work to just "out" the problem-OSDs and immeditly up them
again to fix it.

- Kevin

2018-07-11 20:30 GMT+02:00 Magnus Grönlund :

> Hi,
>
> Started to upgrade a ceph-cluster from Jewel (10.2.10) to Luminous (12.2.6)
>
> After upgrading and restarting the mons everything looked OK, the mons had
> quorum, all OSDs where up and in and all the PGs where active+clean.
> But before I had time to start upgrading the OSDs it became obvious that
> something had gone terribly wrong.
> All of a sudden 1600 out of 4100 PGs where inactive and 40% of the data
> was misplaced!
>
> The mons appears OK and all OSDs are still up and in, but a few hours
> later there was still 1483 pgs stuck inactive, essentially all of them in
> peering!
> Investigating one of the stuck PGs it appears to be looping between
> “inactive”, “remapped+peering” and “peering” and the epoch number is rising
> fast, see the attached pg query outputs.
>
> We really can’t afford to loose the cluster or the data so any help or
> suggestions on how to debug or fix this issue would be very, very
> appreciated!
>
>
> health: HEALTH_ERR
> 1483 pgs are stuck inactive for more than 60 seconds
> 542 pgs backfill_wait
> 14 pgs backfilling
> 11 pgs degraded
> 1402 pgs peering
> 3 pgs recovery_wait
> 11 pgs stuck degraded
> 1483 pgs stuck inactive
> 2042 pgs stuck unclean
> 7 pgs stuck undersized
> 7 pgs undersized
> 111 requests are blocked > 32 sec
> 10586 requests are blocked > 4096 sec
> recovery 9472/11120724 objects degraded (0.085%)
> recovery 1181567/11120724 objects misplaced (10.625%)
> noout flag(s) set
> mon.eselde02u32 low disk space
>
>   services:
> mon: 3 daemons, quorum eselde02u32,eselde02u33,eselde02u34
> mgr: eselde02u32(active), standbys: eselde02u33, eselde02u34
> osd: 111 osds: 111 up, 111 in; 800 remapped pgs
>  flags noout
>
>   data:
> pools:   18 pools, 4104 pgs
> objects: 3620k objects, 13875 GB
> usage:   42254 GB used, 160 TB / 201 TB avail
> pgs: 1.876% pgs unknown
>  34.259% pgs not active
>  9472/11120724 objects degraded (0.085%)
>  1181567/11120724 objects misplaced (10.625%)
>  2062 active+clean
> 1221 peering
>  535  active+remapped+backfill_wait
>  181  remapped+peering
>  77   unknown
>  13   active+remapped+backfilling
>  7active+undersized+degraded+remapped+backfill_wait
>  4remapped
>  3active+recovery_wait+degraded+remapped
>  1active+degraded+remapped+backfilling
>
>   io:
> recovery: 298 MB/s, 77 objects/s
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd lock remove unable to parse address

2018-07-10 Thread Kevin Olbrich
2018-07-10 14:37 GMT+02:00 Jason Dillaman :

> On Tue, Jul 10, 2018 at 2:37 AM Kevin Olbrich  wrote:
>
>> 2018-07-10 0:35 GMT+02:00 Jason Dillaman :
>>
>>> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least
>>> present on the client computer you used? I would have expected the OSD to
>>> determine the client address, so it's odd that it was able to get a
>>> link-local address.
>>>
>>
>> Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is
>> attached to brX which has an ULA-prefix for the ceph cluster.
>> Eth0 has no address itself. In this case this must mean, the address has
>> been carried down to the hardware interface.
>>
>> I am wondering why it uses link local when there is an ULA-prefix
>> available.
>>
>> The address is available on brX on this client node.
>>
>
> I'll open a tracker ticker to get that issue fixed, but in the meantime,
> you can run "rados -p  rmxattr rbd_header.
> lock.rbd_lock" to remove the lock.
>

Worked perfectly, thank you very much!


>
>> - Kevin
>>
>>
>>> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich  wrote:
>>>
>>>> 2018-07-09 21:25 GMT+02:00 Jason Dillaman :
>>>>
>>>>> BTW -- are you running Ceph on a one-node computer? I thought IPv6
>>>>> addresses starting w/ fe80 were link-local addresses which would probably
>>>>> explain why an interface scope id was appended. The current IPv6 address
>>>>> parser stops reading after it encounters a non hex, colon character [1].
>>>>>
>>>>
>>>> No, this is a compute machine attached to the storage vlan where I
>>>> previously had also local disks.
>>>>
>>>>
>>>>>
>>>>>
>>>>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman 
>>>>> wrote:
>>>>>
>>>>>> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses
>>>>>> since it is failing to parse the address as valid. Perhaps it's barfing 
>>>>>> on
>>>>>> the "%eth0" scope id suffix within the address.
>>>>>>
>>>>>> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich  wrote:
>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I tried to convert an qcow2 file to rbd and set the wrong pool.
>>>>>>> Immediately I stopped the transfer but the image is stuck locked:
>>>>>>>
>>>>>>> Previusly when that happened, I was able to remove the image after
>>>>>>> 30 secs.
>>>>>>>
>>>>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02
>>>>>>> There is 1 exclusive lock on this image.
>>>>>>> Locker ID  Address
>>>>>>>
>>>>>>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86%
>>>>>>> eth0]:0/1200385089
>>>>>>>
>>>>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02
>>>>>>> "auto 93921602220416" client.1195723
>>>>>>> rbd: releasing lock failed: (22) Invalid argument
>>>>>>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse
>>>>>>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089
>>>>>>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to
>>>>>>> blacklist client: (22) Invalid argument
>>>>>>>
>>>>>>> The image is not in use anywhere!
>>>>>>>
>>>>>>> How can I force removal of all locks for this image?
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Kevin
>>>>>>> ___
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jason
>>>>>>
>>>>>
>>>>> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108
>>>>>
>>>>> --
>>>>> Jason
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Jason
>>>
>>
>>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd lock remove unable to parse address

2018-07-10 Thread Kevin Olbrich
2018-07-10 0:35 GMT+02:00 Jason Dillaman :

> Is the link-local address of "fe80::219:99ff:fe9e:3a86%eth0" at least
> present on the client computer you used? I would have expected the OSD to
> determine the client address, so it's odd that it was able to get a
> link-local address.
>

Yes, it is. eth0 is part of bond0 which is a vlan trunk. Bond0.X is
attached to brX which has an ULA-prefix for the ceph cluster.
Eth0 has no address itself. In this case this must mean, the address has
been carried down to the hardware interface.

I am wondering why it uses link local when there is an ULA-prefix available.

The address is available on brX on this client node.

- Kevin


> On Mon, Jul 9, 2018 at 3:43 PM Kevin Olbrich  wrote:
>
>> 2018-07-09 21:25 GMT+02:00 Jason Dillaman :
>>
>>> BTW -- are you running Ceph on a one-node computer? I thought IPv6
>>> addresses starting w/ fe80 were link-local addresses which would probably
>>> explain why an interface scope id was appended. The current IPv6 address
>>> parser stops reading after it encounters a non hex, colon character [1].
>>>
>>
>> No, this is a compute machine attached to the storage vlan where I
>> previously had also local disks.
>>
>>
>>>
>>>
>>> On Mon, Jul 9, 2018 at 3:14 PM Jason Dillaman 
>>> wrote:
>>>
>>>> Hmm ... it looks like there is a bug w/ RBD locks and IPv6 addresses
>>>> since it is failing to parse the address as valid. Perhaps it's barfing on
>>>> the "%eth0" scope id suffix within the address.
>>>>
>>>> On Mon, Jul 9, 2018 at 2:47 PM Kevin Olbrich  wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I tried to convert an qcow2 file to rbd and set the wrong pool.
>>>>> Immediately I stopped the transfer but the image is stuck locked:
>>>>>
>>>>> Previusly when that happened, I was able to remove the image after 30
>>>>> secs.
>>>>>
>>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock list fpi_server02
>>>>> There is 1 exclusive lock on this image.
>>>>> Locker ID  Address
>>>>>
>>>>> client.1195723 auto 93921602220416 [fe80::219:99ff:fe9e:3a86%
>>>>> eth0]:0/1200385089
>>>>>
>>>>> [root@vm2003 images1]# rbd -p rbd_vms_hdd lock rm fpi_server02 "auto
>>>>> 93921602220416" client.1195723
>>>>> rbd: releasing lock failed: (22) Invalid argument
>>>>> 2018-07-09 20:45:19.080543 7f6c2c267d40 -1 librados: unable to parse
>>>>> address [fe80::219:99ff:fe9e:3a86%eth0]:0/1200385089
>>>>> 2018-07-09 20:45:19.080555 7f6c2c267d40 -1 librbd: unable to blacklist
>>>>> client: (22) Invalid argument
>>>>>
>>>>> The image is not in use anywhere!
>>>>>
>>>>> How can I force removal of all locks for this image?
>>>>>
>>>>> Kind regards,
>>>>> Kevin
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>>
>>>> --
>>>> Jason
>>>>
>>>
>>> [1] https://github.com/ceph/ceph/blob/master/src/msg/msg_types.cc#L108
>>>
>>> --
>>> Jason
>>>
>>
>>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   >