Re: [ceph-users] data cleaup/disposal process

2018-01-11 Thread M Ranga Swami Reddy
Hi - The "rbd rm" or "rados rm -p " will not clean the data in
side the OSDs. for ex: I wrote 1 MB data on my image/volume, then
removed that image using "rbd rm" command, is this "rbd rm" will
remove the data in side the OSD's object or just mark it as removed.

Thanks
Swami

On Thu, Jan 4, 2018 at 6:49 PM, Sergey Malinin  wrote:
> http://cephnotes.ksperis.com/blog/2014/07/04/remove-big-rbd-image
>
> 
> From: ceph-users  on behalf of M Ranga
> Swami Reddy 
> Sent: Thursday, January 4, 2018 3:55:27 PM
> To: ceph-users; ceph-devel
> Subject: [ceph-users] data cleaup/disposal process
>
> Hello,
> In Ceph, is the way to cleanup data before deleting an image?
>
> Means wipe the data with '0' before deleting an image.
>
> Please let me know if you have any suggestions here.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-11 Thread Christian Balzer

Hello,

On Thu, 11 Jan 2018 11:42:53 -0600 Adam Tygart wrote:

> Some people are doing hyperconverged ceph, colocating qemu
> virtualization with ceph-osds. It is relevant for a decent subset of
> people here. Therefore knowledge of the degree of performance
> degradation is useful.
> 
It was my understanding that meltdown can not reach the host kernel space
from inside VMs, only other VMs would be at risk at the most.
Spectre is a different beast, but again AFAIK there aren't any kernel
patches for that yet.

See for example:
https://security.stackexchange.com/questions/176709/meltdown-and-virtual-machines

The chuckles you're hearing are me with nearly all of our compute nodes
still being AMD ones. ^o^

Christian
> --
> Adam
> 
> On Thu, Jan 11, 2018 at 11:38 AM,   wrote:
> > I don't understand how all of this is related to Ceph
> >
> > Ceph runs on a dedicated hardware, there is nothing there except Ceph, and
> > the ceph daemons have already all power on ceph's data.
> > And there is no random-code execution allowed on this node.
> >
> > Thus, spectre & meltdown are meaning-less for Ceph's node, and mitigations
> > should be disabled
> >
> > Is this wrong ?
> >
> >
> > On 01/11/2018 06:26 PM, Dan van der Ster wrote:  
> >>
> >> Hi all,
> >>
> >> Is anyone getting useful results with your benchmarking? I've prepared
> >> two test machines/pools and don't see any definitive slowdown with
> >> patched kernels from CentOS [1].
> >>
> >> I wonder if Ceph will be somewhat tolerant of these patches, similarly
> >> to what's described here:
> >> http://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
> >>
> >> Cheers, Dan
> >>
> >> [1] Ceph v12.2.2, FileStore OSDs, kernels 3.10.0-693.11.6.el7.x86_64
> >> vs the ancient 3.10.0-327.18.2.el7.x86_64
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>  
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trying to increase number of PGs throws "Error E2BIG" though PGs/OSD < mon_max_pg_per_osd

2018-01-11 Thread Brad Hubbard
On Fri, Jan 12, 2018 at 11:27 AM, Subhachandra Chandra
 wrote:
> Hello,
>
>  We are running experiments on a Ceph cluster before we move data on it.
> While trying to increase the number of PGs on one of the pools it threw the
> following error
>
> root@ctrl1:/# ceph osd pool set data pg_num 65536
> Error E2BIG: specified pg_num 65536 is too large (creating 32768 new PGs on
> ~540 OSDs exceeds per-OSD max of 32)

That comes from here:

https://github.com/ceph/ceph/blob/5d7813f612aea59239c8375aaa00919ae32f952f/src/mon/OSDMonitor.cc#L6027

So the warning is triggered because new_pgs (65536) >
g_conf->mon_osd_max_split_count (32) * expected_osds (540)

>
> There are 2 pools named "data" and "metadata". "data" is an erasure coded
> pool (6,3) and "metadata" is a replicated pool with a replication factor of
> 3.
>
> root@ctrl1:/# ceph osd lspools
> 1 metadata,2 data,
> root@ctrl1:/# ceph osd pool get metadata pg_num
> pg_num: 512
> root@ctrl1:/# ceph osd pool get data pg_num
> pg_num: 32768
>
> osd: 540 osds: 540 up, 540 in
>  flags noout,noscrub,nodeep-scrub
>
>   data:
> pools:   2 pools, 33280 pgs
> objects: 7090k objects, 1662 TB
> usage:   2501 TB used, 1428 TB / 3929 TB avail
> pgs: 33280 active+clean
>
> The current PG/OSD ratio according to my calculation should be 549
 (32768 * 9 + 512 * 3 ) / 540.0
> 548.97778
>
> Increasing the number of PGs in the "data" pool should increase the PG/OSD
> ratio to about 1095
 (65536 * 9 + 512 * 3 ) / 540.0
> 1095.
>
> In the config, settings related to PG/OSD ratio look like
> mon_max_pg_per_osd = 1500
> osd_max_pg_per_osd_hard_ratio = 1.0
>
> Trying to increase the number of PGs to 65536 throws the previously
> mentioned error. The new PG/OSD ratio is still under the configured limit.
> Why do we see the error? Further, there seems to be a bug in the error
> message where it says "exceeds per-OSD max of 32" in terms of where does
> "32" comes from?

Maybe the wording could be better. Perhaps "exceeds per-OSD max with
mon_osd_max_split_count of 32". I'll submit this and see how it goes.

>
> P.S. I understand that the PG/OSD ratio configured on this cluster far
> exceeds the recommended values. The experiment is to find scaling limits and
> try out expansion scenarios.
>
> Thanks
> Subhachandra
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Trying to increase number of PGs throws "Error E2BIG" though PGs/OSD < mon_max_pg_per_osd

2018-01-11 Thread Subhachandra Chandra
Hello,

 We are running experiments on a Ceph cluster before we move data on
it. While trying to increase the number of PGs on one of the pools it threw
the following error

root@ctrl1:/# ceph osd pool set data pg_num 65536
Error E2BIG: specified pg_num 65536 is too large (creating 32768 new PGs on
~540 OSDs exceeds per-OSD max of 32)

There are 2 pools named "data" and "metadata". "data" is an erasure coded
pool (6,3) and "metadata" is a replicated pool with a replication factor of
3.

root@ctrl1:/# ceph osd lspools
1 metadata,2 data,
root@ctrl1:/# ceph osd pool get metadata pg_num
pg_num: 512
root@ctrl1:/# ceph osd pool get data pg_num
pg_num: 32768

osd: 540 osds: 540 up, 540 in
 flags noout,noscrub,nodeep-scrub

  data:
pools:   2 pools, 33280 pgs
objects: 7090k objects, 1662 TB
usage:   2501 TB used, 1428 TB / 3929 TB avail
pgs: 33280 active+clean

The current PG/OSD ratio according to my calculation should be 549
>>> (32768 * 9 + 512 * 3 ) / 540.0
548.97778

Increasing the number of PGs in the "data" pool should increase the PG/OSD
ratio to about 1095
>>> (65536 * 9 + 512 * 3 ) / 540.0
1095.

In the config, settings related to PG/OSD ratio look like
mon_max_pg_per_osd = 1500
osd_max_pg_per_osd_hard_ratio = 1.0

Trying to increase the number of PGs to 65536 throws the previously
mentioned error. The new PG/OSD ratio is still under the configured limit.
Why do we see the error? Further, there seems to be a bug in the error
message where it says "exceeds per-OSD max of 32" in terms of where does
"32" comes from?

P.S. I understand that the PG/OSD ratio configured on this cluster far
exceeds the recommended values. The experiment is to find scaling limits
and try out expansion scenarios.

Thanks
Subhachandra
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 4 incomplete PGs causing RGW to go offline?

2018-01-11 Thread David Turner
Which pools are the incomplete PGs a part of? I would say it's very likely
that if some of the RGW metadata was incomplete that the daemons wouldn't
be happy.

On Thu, Jan 11, 2018, 6:17 PM Brent Kennedy  wrote:

> We have 3 RadosGW servers running behind HAProxy to enable clients to
> connect to the ceph cluster like an amazon bucket.  After all the failures
> and upgrade issues were resolved, I cannot get the RadosGW servers to stay
> online.  They were upgraded to luminous, I even upgraded the OS to Ubuntu
> 16 on them ( before upgrading to Luminous ).  They used to have apache on
> them as they ran Hammer and before that firefly.  I removed apache before
> upgrading to Luminous.  The start up and run for about 4-6 hours before all
> three start to go offline.  Client traffic is light right now as we are
> just testing file read/write before we reactivate them ( they switched back
> to amazon while we fix them ).
>
>
>
> Could the 4 incomplete PGs be causing them to go offline?  The last time I
> saw an issue like this was when recovery wasn’t working 100%, so it seems
> related since they haven’t been stable since we upgraded( but that was also
> after the failures we had, which is why I am not trying to specifically
> blame the upgrade ).
>
>
>
> When I look at the radosgw log, this is what I see ( the first 2 lines
> show up plenty before this, they are health checks by the haproxy server,
> the next two are file requests that 404 fail I am guessing, then the last
> one is me restarting the service ):
>
>
>
> 2018-01-11 20:14:36.640577 7f5826aa3700  1 == req done
> req=0x7f5826a9d1f0 op status=0 http_status=200 ==
>
> 2018-01-11 20:14:36.640602 7f5826aa3700  1 civetweb: 0x56202c567000:
> 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -
>
> 2018-01-11 20:14:36.640835 7f5816282700  1 == req done
> req=0x7f581627c1f0 op status=0 http_status=200 ==
>
> 2018-01-11 20:14:36.640859 7f5816282700  1 civetweb: 0x56202c61:
> 192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -
>
> 2018-01-11 20:14:36.761917 7f5835ac1700  1 == starting new request
> req=0x7f5835abb1f0 =
>
> 2018-01-11 20:14:36.763936 7f5835ac1700  1 == req done
> req=0x7f5835abb1f0 op status=0 http_status=404 ==
>
> 2018-01-11 20:14:36.763983 7f5835ac1700  1 civetweb: 0x56202c4ce000:
> 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD
> /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 -
> aws-sdk-dotnet-35/2
>
> .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO
>
> 2018-01-11 20:14:36.772611 7f5808266700  1 == starting new request
> req=0x7f58082601f0 =
>
> 2018-01-11 20:14:36.773733 7f5808266700  1 == req done
> req=0x7f58082601f0 op status=0 http_status=404 ==
>
> 2018-01-11 20:14:36.773769 7f5808266700  1 civetweb: 0x56202c6aa000:
> 192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD
> /Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 -
> aws-sdk-dotnet-35/2
>
> .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO
>
> 2018-01-11 20:14:38.163617 7f5836ac3700  1 == starting new request
> req=0x7f5836abd1f0 =
>
> 2018-01-11 20:14:38.165352 7f5836ac3700  1 == req done
> req=0x7f5836abd1f0 op status=0 http_status=404 ==
>
> 2018-01-11 20:14:38.165401 7f5836ac3700  1 civetweb: 0x56202c4e2000:
> 192.168.120.21 - - [11/Jan/2018:20:14:38 +] "HEAD
> /Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 -
> aws-sdk-dotnet-35/2
>
> .0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO
>
> 2018-01-11 20:14:38.170551 7f5807a65700  1 == starting new request
> req=0x7f5807a5f1f0 =
>
> 2018-01-11 20:14:40.322236 7f58352c0700  1 == starting new request
> req=0x7f58352ba1f0 =
>
> 2018-01-11 20:14:40.323468 7f5834abf700  1 == starting new request
> req=0x7f5834ab91f0 =
>
> 2018-01-11 20:14:41.643365 7f58342be700  1 == starting new request
> req=0x7f58342b81f0 =
>
> 2018-01-11 20:14:41.643358 7f58312b8700  1 == starting new request
> req=0x7f58312b21f0 =
>
> 2018-01-11 20:14:50.324196 7f5829aa9700  1 == starting new request
> req=0x7f5829aa31f0 =
>
> 2018-01-11 20:14:50.325622 7f58332bc700  1 == starting new request
> req=0x7f58332b61f0 =
>
> 2018-01-11 20:14:51.645678 7f58362c2700  1 == starting new request
> req=0x7f58362bc1f0 =
>
> 2018-01-11 20:14:51.645671 7f582e2b2700  1 == starting new request
> req=0x7f582e2ac1f0 =
>
> 2018-01-11 20:15:00.326452 7f5815a81700  1 == starting new request
> req=0x7f5815a7b1f0 =
>
> 2018-01-11 20:15:00.328787 7f5828aa7700  1 == starting new request
> req=0x7f5828aa11f0 =
>
> 2018-01-11 20:15:01.648196 7f580ea73700  1 == starting new request
> req=0x7f580ea6d1f0 =
>
> 2018-01-11 20:15:01.648698 7f5830ab7700  1 == starting new request
> req=0x7f5830ab11f0 =
>
> 2018-01-11 20:15:10.328810 7f5832abb700  1 == 

[ceph-users] 4 incomplete PGs causing RGW to go offline?

2018-01-11 Thread Brent Kennedy
We have 3 RadosGW servers running behind HAProxy to enable clients to
connect to the ceph cluster like an amazon bucket.  After all the failures
and upgrade issues were resolved, I cannot get the RadosGW servers to stay
online.  They were upgraded to luminous, I even upgraded the OS to Ubuntu 16
on them ( before upgrading to Luminous ).  They used to have apache on them
as they ran Hammer and before that firefly.  I removed apache before
upgrading to Luminous.  The start up and run for about 4-6 hours before all
three start to go offline.  Client traffic is light right now as we are just
testing file read/write before we reactivate them ( they switched back to
amazon while we fix them ).  

 

Could the 4 incomplete PGs be causing them to go offline?  The last time I
saw an issue like this was when recovery wasn't working 100%, so it seems
related since they haven't been stable since we upgraded( but that was also
after the failures we had, which is why I am not trying to specifically
blame the upgrade ).

 

When I look at the radosgw log, this is what I see ( the first 2 lines show
up plenty before this, they are health checks by the haproxy server, the
next two are file requests that 404 fail I am guessing, then the last one is
me restarting the service ):

 

2018-01-11 20:14:36.640577 7f5826aa3700  1 == req done
req=0x7f5826a9d1f0 op status=0 http_status=200 ==

2018-01-11 20:14:36.640602 7f5826aa3700  1 civetweb: 0x56202c567000:
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.640835 7f5816282700  1 == req done
req=0x7f581627c1f0 op status=0 http_status=200 ==

2018-01-11 20:14:36.640859 7f5816282700  1 civetweb: 0x56202c61:
192.168.120.22 - - [11/Jan/2018:20:14:36 +] "HEAD / HTTP/1.0" 1 0 - -

2018-01-11 20:14:36.761917 7f5835ac1700  1 == starting new request
req=0x7f5835abb1f0 =

2018-01-11 20:14:36.763936 7f5835ac1700  1 == req done
req=0x7f5835abb1f0 op status=0 http_status=404 ==

2018-01-11 20:14:36.763983 7f5835ac1700  1 civetweb: 0x56202c4ce000:
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD
/Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 -
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:36.772611 7f5808266700  1 == starting new request
req=0x7f58082601f0 =

2018-01-11 20:14:36.773733 7f5808266700  1 == req done
req=0x7f58082601f0 op status=0 http_status=404 ==

2018-01-11 20:14:36.773769 7f5808266700  1 civetweb: 0x56202c6aa000:
192.168.120.21 - - [11/Jan/2018:20:14:36 +] "HEAD
/Jobimages/vendor05/10/3962896/3962896_cover.pdf HTTP/1.1" 1 0 -
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:38.163617 7f5836ac3700  1 == starting new request
req=0x7f5836abd1f0 =

2018-01-11 20:14:38.165352 7f5836ac3700  1 == req done
req=0x7f5836abd1f0 op status=0 http_status=404 ==

2018-01-11 20:14:38.165401 7f5836ac3700  1 civetweb: 0x56202c4e2000:
192.168.120.21 - - [11/Jan/2018:20:14:38 +] "HEAD
/Jobimages/vendor05/10/3445645/3445645_cover.pdf HTTP/1.1" 1 0 -
aws-sdk-dotnet-35/2

.0.2.2 .NET Runtime/4.0 .NET Framework/4.0 OS/6.2.9200.0 FileIO

2018-01-11 20:14:38.170551 7f5807a65700  1 == starting new request
req=0x7f5807a5f1f0 =

2018-01-11 20:14:40.322236 7f58352c0700  1 == starting new request
req=0x7f58352ba1f0 =

2018-01-11 20:14:40.323468 7f5834abf700  1 == starting new request
req=0x7f5834ab91f0 =

2018-01-11 20:14:41.643365 7f58342be700  1 == starting new request
req=0x7f58342b81f0 =

2018-01-11 20:14:41.643358 7f58312b8700  1 == starting new request
req=0x7f58312b21f0 =

2018-01-11 20:14:50.324196 7f5829aa9700  1 == starting new request
req=0x7f5829aa31f0 =

2018-01-11 20:14:50.325622 7f58332bc700  1 == starting new request
req=0x7f58332b61f0 =

2018-01-11 20:14:51.645678 7f58362c2700  1 == starting new request
req=0x7f58362bc1f0 =

2018-01-11 20:14:51.645671 7f582e2b2700  1 == starting new request
req=0x7f582e2ac1f0 =

2018-01-11 20:15:00.326452 7f5815a81700  1 == starting new request
req=0x7f5815a7b1f0 =

2018-01-11 20:15:00.328787 7f5828aa7700  1 == starting new request
req=0x7f5828aa11f0 =

2018-01-11 20:15:01.648196 7f580ea73700  1 == starting new request
req=0x7f580ea6d1f0 =

2018-01-11 20:15:01.648698 7f5830ab7700  1 == starting new request
req=0x7f5830ab11f0 =

2018-01-11 20:15:10.328810 7f5832abb700  1 == starting new request
req=0x7f5832ab51f0 =

2018-01-11 20:15:10.329541 7f582f2b4700  1 == starting new request
req=0x7f582f2ae1f0 =

2018-01-11 20:15:11.650655 7f582d2b0700  1 == starting new request
req=0x7f582d2aa1f0 =

2018-01-11 20:15:11.651401 7f582aaab700  1 == starting new request
req=0x7f582aaa51f0 =

2018-01-11 20:15:20.332032 7f582c2ae700  1 == starting new request
req=0x7f582c2a81f0 

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Reed Dier
Thank you for documenting your progress and peril on the ML.

Luckily I only have 24x 8TB HDD and 50x 1.92TB SSDs to migrate over to 
bluestore.

8 nodes, 4 chassis (failure domain), 3 drives per node for the HDDs, so I’m 
able to do about 3 at a time (1 node) for rip/replace.

Definitely taking it slow and steady, and the SSDs will move quickly for 
backfills as well.
Seeing about 1TB/6hr on backfills, without much performance hit on rest of 
everything, about 5TB average util on each 8TB disk, so just about 30 hours-ish 
per host *8 hosts will be about 10 days, so a couple weeks is a safe amount of 
headway.
This write performance certainly seems better on bluestore than filestore, so 
that likely helps as well.

Expect I can probably refill an SSD osd in about an hour or two, and will 
likely stagger those out.
But with such a small number of osd’s currently, I’m taking the by-hand 
approach rather than scripting it so as to avoid similar pitfalls.

Reed 

> On Jan 11, 2018, at 12:38 PM, Brady Deetz  wrote:
> 
> I hear you on time. I have 350 x 6TB drives to convert. I recently posted 
> about a disaster I created automating my migration. Good luck
> 
> On Jan 11, 2018 12:22 PM, "Reed Dier"  > wrote:
> I am in the process of migrating my OSDs to bluestore finally and thought I 
> would give you some input on how I am approaching it.
> Some of saga you can find in another ML thread here: 
> https://www.spinics.net/lists/ceph-users/msg41802.html 
> 
> 
> My first OSD I was cautious, and I outed the OSD without downing it, allowing 
> it to move data off.
> Some background on my cluster, for this OSD, it is an 8TB spinner, with an 
> NVMe partition previously used for journaling in filestore, intending to be 
> used for block.db in bluestore.
> 
> Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, 
> set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph 
> auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to 
> create the new LVM target. Then unset the norecover and norebalance flags and 
> it backfilled like normal.
> 
> I initially ran into issues with specifying --osd.id  causing 
> my osd’s to fail to start, but removing that I was able to get it to fill in 
> the gap of the OSD I just removed.
> 
> I’m now doing quicker, more destructive migrations in an attempt to reduce 
> data movement.
> This way I don’t read from OSD I’m replacing, write to other OSD temporarily, 
> read back from temp OSD, write back to ‘new’ OSD.
> I’m just reading from replica and writing to ‘new’ OSD.
> 
> So I’m setting the norecover and norebalance flags, down the OSD (but not 
> out, it stays in, also have the noout flag set), destroy/zap, recreate using 
> ceph-volume, unset the flags, and it starts backfilling.
> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time 
> to offload it and then backfill back from them. I trust my disks enough to 
> backfill from the other disks, and its going well. Also seeing very good 
> write performance backfilling compared to previous drive replacements in 
> filestore, so thats very promising.
> 
> Reed
> 
>> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen > > wrote:
>> 
>> Hi Alfredo,
>> 
>> thank you for your comments:
>> 
>> Zitat von Alfredo Deza >:
>>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen >> > wrote:
 Dear *,
 
 has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
 keeping the OSD number? There have been a number of messages on the list,
 reporting problems, and my experience is the same. (Removing the existing
 OSD and creating a new one does work for me.)
 
 I'm working on an Ceph 12.2.2 cluster and tried following
 http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
  
 
 - this basically says
 
 1. destroy old OSD
 2. zap the disk
 3. prepare the new OSD
 4. activate the new OSD
 
 I never got step 4 to complete. The closest I got was by doing the 
 following
 steps (assuming OSD ID "999" on /dev/sdzz):
 
 1. Stop the old OSD via systemd (osd-node # systemctl stop
 ceph-osd@999.service)
 
 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
 
 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
 volume group
 
 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
 
 4. destroy the old OSD (osd-node # ceph osd destroy 999
 --yes-i-really-mean-it)
 
 5. create a new OSD entry 

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Brady Deetz
I hear you on time. I have 350 x 6TB drives to convert. I recently posted
about a disaster I created automating my migration. Good luck

On Jan 11, 2018 12:22 PM, "Reed Dier"  wrote:

> I am in the process of migrating my OSDs to bluestore finally and thought
> I would give you some input on how I am approaching it.
> Some of saga you can find in another ML thread here:
> https://www.spinics.net/lists/ceph-users/msg41802.html
>
> My first OSD I was cautious, and I outed the OSD without downing it,
> allowing it to move data off.
> Some background on my cluster, for this OSD, it is an 8TB spinner, with an
> NVMe partition previously used for journaling in filestore, intending to be
> used for block.db in bluestore.
>
> Then I downed it, flushed the journal, destroyed it, zapped with
> ceph-volume, set norecover and norebalance flags, did ceph osd crush remove
> osd.$ID, ceph auth del osd.$ID, and ceph osd rm osd.$ID and used
> ceph-volume locally to create the new LVM target. Then unset the norecover
> and norebalance flags and it backfilled like normal.
>
> I initially ran into issues with specifying --osd.id causing my osd’s to
> fail to start, but removing that I was able to get it to fill in the gap of
> the OSD I just removed.
>
> I’m now doing quicker, more destructive migrations in an attempt to reduce
> data movement.
> This way I don’t read from OSD I’m replacing, write to other OSD
> temporarily, read back from temp OSD, write back to ‘new’ OSD.
> I’m just reading from replica and writing to ‘new’ OSD.
>
> So I’m setting the norecover and norebalance flags, down the OSD (but not
> out, it stays in, also have the noout flag set), destroy/zap, recreate
> using ceph-volume, unset the flags, and it starts backfilling.
> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a *long* time
> to offload it and then backfill back from them. I trust my disks enough to
> backfill from the other disks, and its going well. Also seeing very good
> write performance backfilling compared to previous drive replacements in
> filestore, so thats very promising.
>
> Reed
>
> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen  wrote:
>
> Hi Alfredo,
>
> thank you for your comments:
>
> Zitat von Alfredo Deza :
>
> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:
>
> Dear *,
>
> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
> keeping the OSD number? There have been a number of messages on the list,
> reporting problems, and my experience is the same. (Removing the existing
> OSD and creating a new one does work for me.)
>
> I'm working on an Ceph 12.2.2 cluster and tried following
> http://docs.ceph.com/docs/master/rados/operations/add-
> or-rm-osds/#replacing-an-osd
> - this basically says
>
> 1. destroy old OSD
> 2. zap the disk
> 3. prepare the new OSD
> 4. activate the new OSD
>
> I never got step 4 to complete. The closest I got was by doing the
> following
> steps (assuming OSD ID "999" on /dev/sdzz):
>
> 1. Stop the old OSD via systemd (osd-node # systemctl stop
> ceph-osd@999.service)
>
> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>
> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
> volume group
>
> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>
> 4. destroy the old OSD (osd-node # ceph osd destroy 999
> --yes-i-really-mean-it)
>
> 5. create a new OSD entry (osd-node # ceph osd new $(cat
> /var/lib/ceph/osd/ceph-999/fsid) 999)
>
>
> Step 5 and 6 are problematic if you are going to be trying ceph-volume
> later on, which takes care of doing this for you.
>
>
> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
> /var/lib/ceph/osd/ceph-999/keyring)
>
>
> I at first tried to follow the documented steps (without my steps 5 and
> 6), which did not work for me. The documented approach failed with "init
> authentication >> failed: (1) Operation not permitted", because actually
> ceph-volume did not add the auth entry for me.
>
> But even after manually adding the authentication, the "ceph-volume"
> approach failed, as the OSD was still marked "destroyed" in the osdmap
> epoch as used by ceph-osd (see the commented messages from ceph-osd.999.log
> below).
>
>
> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
> --osd-id 999 --data /dev/sdzz)
>
>
> You are going to hit a bug in ceph-volume that is preventing you from
> specifying the osd id directly if the ID has been destroyed.
>
> See http://tracker.ceph.com/issues/22642
>
>
> If I read that bug description correctly, you're confirming why I needed
> step #6 above (manually adding the OSD auth entry. But even if ceph-volume
> had added it, the ceph-osd.log entries suggest that starting the OSD would
> still have failed, because of accessing the wrong osdmap epoch.

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Reed Dier
I am in the process of migrating my OSDs to bluestore finally and thought I 
would give you some input on how I am approaching it.
Some of saga you can find in another ML thread here: 
https://www.spinics.net/lists/ceph-users/msg41802.html 


My first OSD I was cautious, and I outed the OSD without downing it, allowing 
it to move data off.
Some background on my cluster, for this OSD, it is an 8TB spinner, with an NVMe 
partition previously used for journaling in filestore, intending to be used for 
block.db in bluestore.

Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, 
set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph 
auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to 
create the new LVM target. Then unset the norecover and norebalance flags and 
it backfilled like normal.

I initially ran into issues with specifying --osd.id causing my osd’s to fail 
to start, but removing that I was able to get it to fill in the gap of the OSD 
I just removed.

I’m now doing quicker, more destructive migrations in an attempt to reduce data 
movement.
This way I don’t read from OSD I’m replacing, write to other OSD temporarily, 
read back from temp OSD, write back to ‘new’ OSD.
I’m just reading from replica and writing to ‘new’ OSD.

So I’m setting the norecover and norebalance flags, down the OSD (but not out, 
it stays in, also have the noout flag set), destroy/zap, recreate using 
ceph-volume, unset the flags, and it starts backfilling.
For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time to 
offload it and then backfill back from them. I trust my disks enough to 
backfill from the other disks, and its going well. Also seeing very good write 
performance backfilling compared to previous drive replacements in filestore, 
so thats very promising.

Reed

> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen  wrote:
> 
> Hi Alfredo,
> 
> thank you for your comments:
> 
> Zitat von Alfredo Deza >:
>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:
>>> Dear *,
>>> 
>>> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
>>> keeping the OSD number? There have been a number of messages on the list,
>>> reporting problems, and my experience is the same. (Removing the existing
>>> OSD and creating a new one does work for me.)
>>> 
>>> I'm working on an Ceph 12.2.2 cluster and tried following
>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>>> - this basically says
>>> 
>>> 1. destroy old OSD
>>> 2. zap the disk
>>> 3. prepare the new OSD
>>> 4. activate the new OSD
>>> 
>>> I never got step 4 to complete. The closest I got was by doing the following
>>> steps (assuming OSD ID "999" on /dev/sdzz):
>>> 
>>> 1. Stop the old OSD via systemd (osd-node # systemctl stop
>>> ceph-osd@999.service)
>>> 
>>> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>>> 
>>> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
>>> volume group
>>> 
>>> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>>> 
>>> 4. destroy the old OSD (osd-node # ceph osd destroy 999
>>> --yes-i-really-mean-it)
>>> 
>>> 5. create a new OSD entry (osd-node # ceph osd new $(cat
>>> /var/lib/ceph/osd/ceph-999/fsid) 999)
>> 
>> Step 5 and 6 are problematic if you are going to be trying ceph-volume
>> later on, which takes care of doing this for you.
>> 
>>> 
>>> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
>>> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
>>> /var/lib/ceph/osd/ceph-999/keyring)
> 
> I at first tried to follow the documented steps (without my steps 5 and 6), 
> which did not work for me. The documented approach failed with "init 
> authentication >> failed: (1) Operation not permitted", because actually 
> ceph-volume did not add the auth entry for me.
> 
> But even after manually adding the authentication, the "ceph-volume" approach 
> failed, as the OSD was still marked "destroyed" in the osdmap epoch as used 
> by ceph-osd (see the commented messages from ceph-osd.999.log below).
> 
>>> 
>>> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
>>> --osd-id 999 --data /dev/sdzz)
>> 
>> You are going to hit a bug in ceph-volume that is preventing you from
>> specifying the osd id directly if the ID has been destroyed.
>> 
>> See http://tracker.ceph.com/issues/22642 
>> 
> 
> If I read that bug description correctly, you're confirming why I needed step 
> #6 above (manually adding the OSD auth entry. But even if ceph-volume had 
> added it, the ceph-osd.log entries suggest that starting the OSD would still 
> have failed, because of accessing the wrong osdmap epoch.
> 
> To me it seems like I'm 

Re: [ceph-users] Ceph MGR Influx plugin 12.2.2

2018-01-11 Thread Reed Dier
This morning I went through and enabled the influx plugin in ceph-mgr on12.2.2, 
so far so good.

Only non-obvious step was installing the python-influxdb package that it 
depends on. Probably needs to be baked into the documentation somewhere.

Other than that, 90% of the stats I use are in this, and a few breakdowns of my 
existing statistics are available now.

If I had to make a wishlist of stats I wish were part of this:
PG state stats - number of PGs active, clean, scrubbing, scrubbing-deep, 
backfilling, recovering, etc
Pool ops - we have pool level rd/wr_bytes, would love to see pool level 
rd/wr_ops as well.
Cluster level object state stats - Total Objects, Degraded, Misplaced, Unfound, 
etc
daemon (osd/mon/mds/mgr) state stats - total, up, in, active, degraded/failed, 
quorum, etc
osd recovery_bytes - recovery bytes to compliment ops (like ceph -s provides)

Otherwise, this seems to be a much better approach than CollectD for data 
collection and shipping as it eliminates the middleman and puts the mgr daemons 
to work.

Love to see the ceph-mgr daemons grow in capability like this, take load off 
the mons, and provide more useful functionality.

Thanks,
Reed

> On Jan 11, 2018, at 10:02 AM, Benjeman Meekhof  wrote:
> 
> Hi Reed,
> 
> Someone in our group originally wrote the plugin and put in PR.  Since
> our commit the plugin was 'forward-ported' to master and made
> incompatible with Luminous so we've been using our own version of the
> plugin while waiting for the necessary pieces to be back-ported to
> Luminous to use the modified upstream version.  Now we are in the
> process of trying out the back-ported version that is in 12.2.2 as
> well as adding some additional code from our version that collects pg
> summary information (count of active, etc) and supports sending to
> multiple influx destinations.  We'll attempt to PR any changes we
> make.
> 
> So to answer your question:  Yes, we use it but not exactly the
> version from upstream in production yet.  However in our testing the
> module included with 12.2.2 appears to work as expected and we're
> planning to move over to it and do any future work based from the
> version in the upstream Ceph tree.
> 
> There is one issue/bug that may still exist exist:  because of how the
> data point timestamps are written inside a loop through OSD stats the
> spread is sometimes wide enough that Grafana doesn't group properly
> and you get the appearance of extreme spikes in derivative calculation
> of rates.  We ended up modifying our code to calculate timestamps just
> outside the loops that create data points and apply it to every point
> created in loops through stats.  Of course we'll feed that back
> upstream when we get to it and assuming it is still an issue in the
> current code.
> 
> thanks,
> Ben
> 
> On Thu, Jan 11, 2018 at 2:04 AM, Reed Dier  wrote:
>> Hi all,
>> 
>> Does anyone have any idea if the influx plugin for ceph-mgr is stable in
>> 12.2.2?
>> 
>> Would love to ditch collectd and report directly from ceph if that is the
>> case.
>> 
>> Documentation says that it is added in Mimic/13.x, however it looks like
>> from an earlier ML post that it would be coming to Luminous.
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021302.html
>> 
>> I also see it as a disabled module currently:
>> 
>> $ ceph mgr module ls
>> {
>>"enabled_modules": [
>>"dashboard",
>>"restful",
>>"status"
>>],
>>"disabled_modules": [
>>"balancer",
>>"influx",
>>"localpool",
>>"prometheus",
>>"selftest",
>>"zabbix"
>>]
>> }
>> 
>> 
>> Curious if anyone has been using it in place of CollectD/Telegraf for
>> feeding InfluxDB with statistics.
>> 
>> Thanks,
>> 
>> Reed
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-11 Thread Adam Tygart
Some people are doing hyperconverged ceph, colocating qemu
virtualization with ceph-osds. It is relevant for a decent subset of
people here. Therefore knowledge of the degree of performance
degradation is useful.

--
Adam

On Thu, Jan 11, 2018 at 11:38 AM,   wrote:
> I don't understand how all of this is related to Ceph
>
> Ceph runs on a dedicated hardware, there is nothing there except Ceph, and
> the ceph daemons have already all power on ceph's data.
> And there is no random-code execution allowed on this node.
>
> Thus, spectre & meltdown are meaning-less for Ceph's node, and mitigations
> should be disabled
>
> Is this wrong ?
>
>
> On 01/11/2018 06:26 PM, Dan van der Ster wrote:
>>
>> Hi all,
>>
>> Is anyone getting useful results with your benchmarking? I've prepared
>> two test machines/pools and don't see any definitive slowdown with
>> patched kernels from CentOS [1].
>>
>> I wonder if Ceph will be somewhat tolerant of these patches, similarly
>> to what's described here:
>> http://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
>>
>> Cheers, Dan
>>
>> [1] Ceph v12.2.2, FileStore OSDs, kernels 3.10.0-693.11.6.el7.x86_64
>> vs the ancient 3.10.0-327.18.2.el7.x86_64
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-11 Thread ceph

I don't understand how all of this is related to Ceph

Ceph runs on a dedicated hardware, there is nothing there except Ceph, 
and the ceph daemons have already all power on ceph's data.

And there is no random-code execution allowed on this node.

Thus, spectre & meltdown are meaning-less for Ceph's node, and 
mitigations should be disabled


Is this wrong ?

On 01/11/2018 06:26 PM, Dan van der Ster wrote:

Hi all,

Is anyone getting useful results with your benchmarking? I've prepared
two test machines/pools and don't see any definitive slowdown with
patched kernels from CentOS [1].

I wonder if Ceph will be somewhat tolerant of these patches, similarly
to what's described here:
http://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/

Cheers, Dan

[1] Ceph v12.2.2, FileStore OSDs, kernels 3.10.0-693.11.6.el7.x86_64
vs the ancient 3.10.0-327.18.2.el7.x86_64
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-11 Thread Dan van der Ster
Hi all,

Is anyone getting useful results with your benchmarking? I've prepared
two test machines/pools and don't see any definitive slowdown with
patched kernels from CentOS [1].

I wonder if Ceph will be somewhat tolerant of these patches, similarly
to what's described here:
http://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/

Cheers, Dan

[1] Ceph v12.2.2, FileStore OSDs, kernels 3.10.0-693.11.6.el7.x86_64
vs the ancient 3.10.0-327.18.2.el7.x86_64
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace failed disk in Luminous v12.2.2

2018-01-11 Thread Dietmar Rieder
Hi Konstantin,

thanks for your answer, see my answer to Alfredo which includes your
suggestions.

~Dietmar

On 01/11/2018 12:57 PM, Konstantin Shalygin wrote:
>> Now wonder what is the correct way to replace a failed OSD block disk?
> 
> Generic way for maintenance (e.g. disk replace) is rebalance by change osd 
> weight:
> 
> ceph osd crush reweight osdid 0
> 
> cluster migrate data "from this osd"
> When HEALTH_OK you can safe remove this OSD:
> 
> ceph osd out osd_id
> systemctl stop ceph-osd at osd_id 
> 
> ceph osd crush remove osd_id
> ceph auth del osd_id
> ceph osd rm osd_id
> 
> 
>> I'm not sure if there is something to do with the still existing bluefs db 
>> and wal partitions on the nvme device for the failed OSD. Do they have to be 
>> zapped ? If yes, what is the best way?
> 
> 
> 1. Find nvme partition for this OSD. You can't do it in several ways. 
> ceph-volume, by hand or with "ceph-disk list" (because is more human 
> readable):
> 
> /dev/sda :
>  /dev/sda1 ceph data, active, cluster ceph, osd.0, block /dev/sda2, block.db 
> /dev/nvme2n1p1, block.wal /dev/nvme2n1p2
>  /dev/sda2 ceph block, for /dev/sda1
> 
> 2. Delete partition via parted or fdisk.
> 
> fdisk -u /dev/nvme2n1
> d (delete partitions)
> enter partition number of block.db: 1
> d
> enter partition number of block.wal: 2
> w (write partition table)
> 
> 3. Deploy your new OSD.
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replace failed disk in Luminous v12.2.2

2018-01-11 Thread Dietmar Rieder
Hi Alfredo,

thanks for your coments, see my answers inline.

On 01/11/2018 01:47 PM, Alfredo Deza wrote:
> On Thu, Jan 11, 2018 at 4:30 AM, Dietmar Rieder
>  wrote:
>> Hello,
>>
>> we have failed OSD disk in our Luminous v12.2.2 cluster that needs to
>> get replaced.
>>
>> The cluster was initially deployed using ceph-deploy on Luminous
>> v12.2.0. The OSDs were created using
>>
>> ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
>> --block-wal /dev/nvme0n1 --block-db /dev/nvme0n1
>>
>> Note we separated the bluestore data, wal and db.
>>
>> We updated to Luminous v12.2.1 and further to Luminous v12.2.2.
>>
>> With the last update we also let ceph-volume take over the OSDs using
>> "ceph-volume simple scan  /var/lib/ceph/osd/$osd" and "ceph-volume
>> simple activate ${osd} ${id}". All of this went smoothly.
> 
> That is good to hear!
> 
>>
>> Now wonder what is the correct way to replace a failed OSD block disk?
>>
>> The docs for luminous [1] say:
>>
>> REPLACING AN OSD
>>
>> 1. Destroy the OSD first:
>>
>> ceph osd destroy {id} --yes-i-really-mean-it
>>
>> 2. Zap a disk for the new OSD, if the disk was used before for other
>> purposes. It’s not necessary for a new disk:
>>
>> ceph-disk zap /dev/sdX
>>
>>
>> 3. Prepare the disk for replacement by using the previously destroyed
>> OSD id:
>>
>> ceph-disk prepare --bluestore /dev/sdX  --osd-id {id} --osd-uuid `uuidgen`
>>
>>
>> 4. And activate the OSD:
>>
>> ceph-disk activate /dev/sdX1
>>
>>
>> Initially this seems to be straight forward, but
>>
>> 1. I'm not sure if there is something to do with the still existing
>> bluefs db and wal partitions on the nvme device for the failed OSD. Do
>> they have to be zapped ? If yes, what is the best way? There is nothing
>> mentioned in the docs.
> 
> What is your concern here if the activation seems to work?

I geuss on the nvme partitions for bluefs db and bluefs wal there is
still data related to the failed OSD  block device. I was thinking that
this data might "interfere" with the new replacement OSD block device,
which is empty.

So you are saying that this is no concern, right?
Are they automatically reused and assigned to the replacement OSD block
device, or do I have to specify them when running ceph-disk prepare?
If I need to specify the wal and db partition, how is this done?

I'm asking this since from the logs of the initial cluster deployment I
got the following warning:

[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.db is not the same device as the osd data
[...]
[cephosd-02][WARNING] prepare_device: OSD will not be hot-swappable if
block.wal is not the same device as the osd data


>>
>> 2. Since we already let "ceph-volume simple" take over our OSDs I'm not
>> sure if we should now use ceph-volume or again ceph-disk (followed by
>> "ceph-vloume simple" takeover) to prepare and activate the OSD?
> 
> The `simple` sub-command is meant to help with the activation of OSDs
> at boot time, supporting ceph-disk (or manual) created OSDs.

OK, got this...

> 
> There is no requirement to use `ceph-volume lvm` which is intended for
> new OSDs using LVM as devices.

Fine...

>>
>> 3. If we should use ceph-volume, then by looking at the luminous
>> ceph-volume docs [2] I find for both,
>>
>> ceph-volume lvm prepare
>> ceph-volume lvm activate
>>
>> that the bluestore option is either NOT implemented or NOT supported
>>
>> activate:  [–bluestore] filestore (IS THIS A TYPO???) objectstore (not
>> yet implemented)
>> prepare: [–bluestore] Use the bluestore objectstore (not currently
>> supported)
> 
> These might be a typo on the man page, will get that addressed. Ticket
> opened at http://tracker.ceph.com/issues/22663

Thanks

> bluestore as of 12.2.2 is fully supported and it is the default. The
> --help output in ceph-volume does have the flags updated and correctly
> showing this.

OK

>>
>>
>> So, now I'm completely lost. How is all of this fitting together in
>> order to replace a failed OSD?
> 
> You would need to keep using ceph-disk. Unless you want ceph-volume to
> take over, in which case you would need to follow the steps to deploy
> a new OSD
> with ceph-volume.

OK

> Note that although --osd-id is supported, there is an issue with that
> on 12.2.2 that would prevent you from correctly deploying it
> http://tracker.ceph.com/issues/22642
> 
> The recommendation, if you want to use ceph-volume, would be to omit
> --osd-id and let the cluster give you the ID.
> 
>>
>> 4. More after reading some a recent threads on this list additional
>> questions are coming up:
>>
>> According to the OSD replacement doc [1] :
>>
>> "When disks fail, [...], OSDs need to be replaced. Unlike Removing the
>> OSD, replaced OSD’s id and CRUSH map entry need to be keep [TYPO HERE?
>> keep -> kept] intact after the OSD is destroyed for replacement."
>>
>> but
>> http://tracker.ceph.com/issues/22642 seems to say that it is not
>> possible to 

Re: [ceph-users] issue adding OSDs

2018-01-11 Thread Luis Periquito
this was a bit weird, but is now working... Writing for future
reference if someone faces the same issue.

this cluster was upgraded from jewel to luminous following the
recommended process. When it was finished I just set the require_osd
to luminous. However I hadn't restarted the daemons since. So just
restarting all the OSDs made the problem go away.

How to check if that was the case? The OSDs now have a "class" associated.



On Wed, Jan 10, 2018 at 7:16 PM, Luis Periquito  wrote:
> Hi,
>
> I'm running a cluster with 12.2.1 and adding more OSDs to it.
> Everything is running version 12.2.1 and require_osd is set to
> luminous.
>
> one of the pools is replicated with size 2 min_size 1, and is
> seemingly blocking IO while recovering. I have no slow requests,
> looking at the output of "ceph osd perf" it seems brilliant (all
> numbers are lower than 10).
>
> clients are RBD (OpenStack VM in KVM) and using (mostly) 10.2.7. I've
> tagged those OSDs as out and the RBD just came back to life. I did
> have some objects degraded:
>
> 2018-01-10 18:23:52.081957 mon.mon0 mon.0 x.x.x.x:6789/0 410414 :
> cluster [WRN] Health check update: 9926354/49526500 objects misplaced
> (20.043%) (OBJECT_MISPLACED)
> 2018-01-10 18:23:52.081969 mon.mon0 mon.0 x.x.x.x:6789/0 410415 :
> cluster [WRN] Health check update: Degraded data redundancy:
> 5027/49526500 objects degraded (0.010%), 1761 pgs unclean, 27 pgs
> degraded (PG_DEGRADED)
>
> any thoughts as to what might be happening? I've run such operations
> many a times...
>
> thanks for all help, as I'm grasping as to figure out what's happening...
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does anyone use rcceph script in CentOS/SUSE?

2018-01-11 Thread Ken Dreyer
Please drop it, it has been untested for a long time.

- Ken

On Thu, Jan 11, 2018 at 4:49 AM, Nathan Cutler  wrote:
> To all who are running Ceph on CentOS or SUSE: do you use the "rcceph"
> script? The ceph RPMs ship it in /usr/sbin/rcceph
>
> (Why I ask: more-or-less the same functionality is provided by the
> ceph-osd.target and ceph-mon.target systemd units, and the script is no
> longer maintained, so we'd like to drop it from the RPM packaging unless
> someone is using it.)
>
> Thanks,
> Nathan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MGR Influx plugin 12.2.2

2018-01-11 Thread Benjeman Meekhof
Hi Reed,

Someone in our group originally wrote the plugin and put in PR.  Since
our commit the plugin was 'forward-ported' to master and made
incompatible with Luminous so we've been using our own version of the
plugin while waiting for the necessary pieces to be back-ported to
Luminous to use the modified upstream version.  Now we are in the
process of trying out the back-ported version that is in 12.2.2 as
well as adding some additional code from our version that collects pg
summary information (count of active, etc) and supports sending to
multiple influx destinations.  We'll attempt to PR any changes we
make.

So to answer your question:  Yes, we use it but not exactly the
version from upstream in production yet.  However in our testing the
module included with 12.2.2 appears to work as expected and we're
planning to move over to it and do any future work based from the
version in the upstream Ceph tree.

There is one issue/bug that may still exist exist:  because of how the
data point timestamps are written inside a loop through OSD stats the
spread is sometimes wide enough that Grafana doesn't group properly
and you get the appearance of extreme spikes in derivative calculation
of rates.  We ended up modifying our code to calculate timestamps just
outside the loops that create data points and apply it to every point
created in loops through stats.  Of course we'll feed that back
upstream when we get to it and assuming it is still an issue in the
current code.

thanks,
Ben

On Thu, Jan 11, 2018 at 2:04 AM, Reed Dier  wrote:
> Hi all,
>
> Does anyone have any idea if the influx plugin for ceph-mgr is stable in
> 12.2.2?
>
> Would love to ditch collectd and report directly from ceph if that is the
> case.
>
> Documentation says that it is added in Mimic/13.x, however it looks like
> from an earlier ML post that it would be coming to Luminous.
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021302.html
>
> I also see it as a disabled module currently:
>
> $ ceph mgr module ls
> {
> "enabled_modules": [
> "dashboard",
> "restful",
> "status"
> ],
> "disabled_modules": [
> "balancer",
> "influx",
> "localpool",
> "prometheus",
> "selftest",
> "zabbix"
> ]
> }
>
>
> Curious if anyone has been using it in place of CollectD/Telegraf for
> feeding InfluxDB with statistics.
>
> Thanks,
>
> Reed
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Nick Fisk
I take my hat off to you, well done for solving that!!!

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Zdenek Janda
> Sent: 11 January 2018 13:01
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)
> 
> Hi,
> we have restored damaged ODS not starting after bug caused by this issue,
> detailed steps are for reference at
> http://tracker.ceph.com/issues/21142#note-9 , should anybody hit into this 
> this
> should fix it for you.
> Thanks
> Zdenek Janda
> 
> 
> 
> 
> On 11.1.2018 11:40, Zdenek Janda wrote:
> > Hi,
> > I have succeeded in identifying faulty PG:
> >
> >  -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d
> > needs 13939-15333  -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00  1
> > osd.15 15340 build_past_intervals_parallel over 13939-15333  -3448>
> > 2018-01-11 11:32:20.019436 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13939  -3447> 2018-01-11
> > 11:32:20.019447 7f066e2a3e00 20 osd.15 0 get_map
> > 13939 - loading and decoding 0x55d39deefb80  -3446> 2018-01-11
> > 11:32:20.249771 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 13939 27475 bytes
> >  -3445> 2018-01-11 11:32:20.250392 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13939 pg 12.62d first map, acting
> > [21,9] up [21,9], same_interval_since = 13939  -3444> 2018-01-11
> > 11:32:20.250505 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 13940  -3443> 2018-01-11
> > 11:32:20.250529 7f066e2a3e00 20 osd.15 0 get_map
> > 13940 - loading and decoding 0x55d39deef800  -3442> 2018-01-11
> > 11:32:20.251883 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 13940 27475 bytes
> > 
> > -3> 2018-01-11 11:32:26.973843 7f066e2a3e00 10 osd.15 15340
> > build_past_intervals_parallel epoch 15087
> > -2> 2018-01-11 11:32:26.973999 7f066e2a3e00 20 osd.15 0 get_map
> > 15087 - loading and decoding 0x55d3f9e7e700
> > -1> 2018-01-11 11:32:26.984286 7f066e2a3e00 10 osd.15 0 add_map_bl
> > 15087 11409 bytes
> >  0> 2018-01-11 11:32:26.990595 7f066e2a3e00 -1
> > /build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> > pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> > thread 7f066e2a3e00 time 2018-01-11 11:32:26.984716
> > /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> > assert(interval.last > last)
> >
> > Lets see what can be done about this PG.
> >
> > Thanks
> > Zdenek Janda
> >
> >
> > On 11.1.2018 11:20, Zdenek Janda wrote:
> >> Hi,
> >>
> >> updated the issue at http://tracker.ceph.com/issues/21142#note-5 with
> >> last 1 lines of strace before ABRT. Crash ends with:
> >>
> >>  0.002429 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\354:\0\0"...,
> >> 12288, 908492996608) = 12288
> >>  0.007869 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\355:\0\0"...,
> >> 12288, 908493324288) = 12288
> >>  0.004220 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\356:\0\0"...,
> >> 12288, 908499615744) = 12288
> >>  0.009143 pread64(22,
> >>
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215
> >> {\357:\0\0"...,
> >> 12288, 908500926464) = 12288
> >>  0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"...,
> >> 275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> >> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> >> thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
> >> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> >> assert(interval.last > last)
> >>
> >> Any suggestions are welcome, need to understand mechanism why this
> >> happened
> >>
> >> Thanks
> >> Zdenek Janda
> >>
> >>
> >> On 11.1.2018 10:48, Josef Zelenka wrote:
> >>> I have posted logs/strace from our osds with details to a ticket in
> >>> the ceph bug tracker - see here
> >>> http://tracker.ceph.com/issues/21142. You can see where exactly the
> >>> OSDs crash etc, this can be of help if someone decides to debug it.
> >>>
> >>> JZ
> >>>
> >>>
> >>> On 10/01/18 22:05, Josef Zelenka wrote:
> 
>  Hi, today we had a disasterous crash - we are running a 3 node, 24
>  osd in total cluster (8 each) with SSDs for blockdb, HDD for
>  bluestore data. This cluster is used as a radosgw backend, for
>  storing a big number of thumbnails for a file hosting site - around
>  110m files in total. We were adding an interface to the nodes which
>  required a restart, but after restarting one of the nodes, a lot of
>  the OSDs were kicked out of the cluster and rgw stopped working. We
>  have a lot of pgs down and unfound atm. OSDs can't be started(aside
>  from some, that's a mystery) with this error -  FAILED assert (
>  interval.last >
>  last) - they just periodically 

[ceph-users] Unable to join additional mon servers (luminous)

2018-01-11 Thread Thomas Gebhardt
Hello,

I'm running a ceph-12.2.2 cluster on debian/stretch with three mon
servers, unsuccessfully trying to add another (or two additional) mon
servers. While the new mon server keeps in state "synchronizing", the
old mon servers get out of quorum, endlessly changing state from "peon"
to "electing" or "probing", and eventually back to "peon" or "leader".

On a small test cluster everthing works as expected, the new mons
painlessly join the cluster. But on my production cluster I always run
into trouble, both with ceph-deploy and manual intervention. Probably
I'm missing some fundamental factor. Maybe anyone can give me a hint?

These are the existing mons:

my-ceph-mon-3: IP AAA.BBB.CCC.23
my-ceph-mon-4: IP AAA.BBB.CCC.24
my-ceph-mon-5: IP AAA.BBB.CCC.25

Trying to add

my-ceph-mon-1: IP AAA.BBB.CCC.31

Here is a (hopefully) relevant and representative part of the logs on
my-ceph-mon-5 when my-ceph-mon-1 tries to join:

2018-01-11 15:16:08.340741 7f69ba8db700  0
mon.my-ceph-mon-5@2(peon).data_health(6128) update_stats avail 57% total
19548 MB, used 8411 MB, avail 11149 MB
2018-01-11 15:16:16.830566 7f69b48cf700  0 -- AAA.BBB.CCC.18:6789/0 >>
AAA.BBB.CCC.31:6789/0 conn(0x55d19cac2000 :6789
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 0 vs existing csq=1
existing_state=STATE_STANDBY
2018-01-11 15:16:16.830582 7f69b48cf700  0 -- AAA.BBB.CCC.18:6789/0 >>
AAA.BBB.CCC.31:6789/0 conn(0x55d19cac2000 :6789
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept peer reset, then tried to connect to us,
replacing
2018-01-11 15:16:16.831864 7f69b80d6700  1 mon.my-ceph-mon-5@2(peon) e15
 adding peer AAA.BBB.CCC.31:6789/0 to list of hints
2018-01-11 15:16:16.833701 7f69b50d0700  0 -- AAA.BBB.CCC.18:6789/0 >>
AAA.BBB.CCC.31:6789/0 conn(0x55d19c8ca000 :6789
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 0 vs existing csq=1
existing_state=STATE_STANDBY
2018-01-11 15:16:16.833713 7f69b50d0700  0 -- AAA.BBB.CCC.18:6789/0 >>
AAA.BBB.CCC.31:6789/0 conn(0x55d19c8ca000 :6789
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept peer reset, then tried to connect to us,
replacing
2018-01-11 15:16:16.834843 7f69b80d6700  1 mon.my-ceph-mon-5@2(peon) e15
 adding peer AAA.BBB.CCC.31:6789/0 to list of hints
2018-01-11 15:16:35.907962 7f69ba8db700  1
mon.my-ceph-mon-5@2(peon).paxos(paxos active c 9653210..9653763)
lease_timeout -- calling new election
2018-01-11 15:16:35.908589 7f69b80d6700  0 mon.my-ceph-mon-5@2(probing)
e15 handle_command mon_command({"prefix": "status"} v 0) v1
2018-01-11 15:16:35.908630 7f69b80d6700  0 log_channel(audit) log [DBG]
: from='client.? 172.25.24.15:0/1078983440' entity='client.admin'
cmd=[{"prefix": "status"}]: dispatch
2018-01-11 15:16:35.909124 7f69b80d6700  0 log_channel(cluster) log
[INF] : mon.my-ceph-mon-5 calling new monitor election
2018-01-11 15:16:35.909284 7f69b80d6700  1
mon.my-ceph-mon-5@2(electing).elector(6128) init, last seen epoch 6128
2018-01-11 15:16:50.132414 7f69ba8db700  1
mon.my-ceph-mon-5@2(electing).elector(6129) init, last seen epoch 6129,
mid-election, bumping
2018-01-11 15:16:55.209177 7f69b80d6700 -1
mon.my-ceph-mon-5@2(peon).paxos(paxos recovering c 9653210..9653777)
lease_expire from mon.0 AAA.BBB.CCC.23:6789/0 is 0.032801 seconds in the
past; mons are probably laggy (or possibly clocks are too skewed)
2018-01-11 15:17:09.316472 7f69ba8db700  1
mon.my-ceph-mon-5@2(peon).paxos(paxos updating c 9653210..9653778)
lease_timeout -- calling new election
2018-01-11 15:17:09.316597 7f69ba8db700  0
mon.my-ceph-mon-5@2(probing).data_health(6134) update_stats avail 57%
total 19548 MB, used 8411 MB, avail 11149 MB
2018-01-11 15:17:09.317414 7f69b80d6700  0 log_channel(cluster) log
[INF] : mon.my-ceph-mon-5 calling new monitor election
2018-01-11 15:17:09.317517 7f69b80d6700  1
mon.my-ceph-mon-5@2(electing).elector(6134) init, last seen epoch 6134
2018-01-11 15:17:22.059573 7f69ba8db700  1
mon.my-ceph-mon-5@2(peon).paxos(paxos updating c 9653210..9653779)
lease_timeout -- calling new election
2018-01-11 15:17:22.060021 7f69b80d6700  1
mon.my-ceph-mon-5@2(probing).data_health(6138) service_dispatch_op not
in quorum -- drop message
2018-01-11 15:17:22.060279 7f69b80d6700  1
mon.my-ceph-mon-5@2(probing).data_health(6138) service_dispatch_op not
in quorum -- drop message
2018-01-11 15:17:22.060499 7f69b80d6700  0 log_channel(cluster) log
[INF] : mon.my-ceph-mon-5 calling new monitor election
2018-01-11 15:17:22.060612 7f69b80d6700  1
mon.my-ceph-mon-5@2(electing).elector(6138) init, last seen epoch 6138
...

As far as I can see clock skew is not a problem (tested with "ntpq -p").

Any idea what might go wrong?

Thanks, Thomas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues on Luminous

2018-01-11 Thread Rafał Wądołowski

This drives are running as osd, not as journal.

I think I can't understand is, why the performance of using rados bench 
with 1 thread is 3 times slower? Ceph osd bench shows good results.


In my opinion it could be a 20% less speed, because of software overhead.

I read the blog post 
(http://ceph.com/geen-categorie/quick-analysis-of-the-ceph-io-layer/) 
and it will be good to have an explanation about it.


@Mark, Could you tell us (community) is it normal behaviour of these 
tests? What is the difference?


BR,

Rafał Wądołowski

On 05.01.2018 19:29, Christian Wuerdig wrote:

You should do your reference test with dd with  oflag=direct,dsync

direct will only bypass the cache while dsync will fsync on every
block which is much closer to reality of what ceph is doing afaik

On Thu, Jan 4, 2018 at 9:54 PM, Rafał Wądołowski
 wrote:

Hi folks,

I am currently benchmarking my cluster for an performance issue and I have
no idea, what is going on. I am using these devices in qemu.

Ceph version 12.2.2

Infrastructure:

3 x Ceph-mon

11 x Ceph-osd

Ceph-osd has 22x1TB Samsung SSD 850 EVO 1TB

96GB RAM

2x E5-2650 v4

4x10G Network (2 seperate bounds for cluster and public) with MTU 9000


I had tested it with rados bench:

# rados bench -p rbdbench 30 write -t 1

Total time run: 30.055677
Total writes made:  1199
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 159.571
Stddev Bandwidth:   6.83601
Max bandwidth (MB/sec): 168
Min bandwidth (MB/sec): 140
Average IOPS:   39
Stddev IOPS:1
Max IOPS:   42
Min IOPS:   35
Average Latency(s): 0.0250656
Stddev Latency(s):  0.00321545
Max latency(s): 0.0471699
Min latency(s): 0.0206325

# ceph tell osd.0 bench
{
 "bytes_written": 1073741824,
 "blocksize": 4194304,
 "bytes_per_sec": 414199397
}

Testing osd directly

# dd if=/dev/zero of=/dev/sdc bs=4M oflag=direct count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB, 400 MiB) copied, 1.0066 s, 417 MB/s

When I do dd inside vm (bs=4M wih direct), I have result like in rados
bench.

I think that the speed should be arround ~400MB/s.

Is there any new parameters for rbd in luminous? Maybe I forgot about some
performance tricks? If more information needed feel free to ask.

--
BR,
Rafal Wadolowski
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

2018-01-11 Thread Alessandro De Salvo
Hi,
took quite some time to recover the pgs, and indeed the problem with the
mds instances was due to the activating pgs. Once they were cleared the
fs went back to the original state.
I had to restart a few times some OSds though, in order to get all the
pgs activated, and I didn't hit the limits on the max pgs, but I'm close
to, so I have set them to 300 just to be safe (AFAIK it was the limit
set to prior releases of ceph, not sure why it was lowered to 200 now).
Thanks,

Alessandro

On Tue, 2018-01-09 at 09:01 +0100, Burkhard Linke wrote:
> Hi,
> 
> 
> On 01/08/2018 05:40 PM, Alessandro De Salvo wrote:
> > Thanks Lincoln,
> >
> > indeed, as I said the cluster is recovering, so there are pending ops:
> >
> >
> > pgs: 21.034% pgs not active
> >  1692310/24980804 objects degraded (6.774%)
> >  5612149/24980804 objects misplaced (22.466%)
> >  458 active+clean
> >  329 active+remapped+backfill_wait
> >  159 activating+remapped
> >  100 active+undersized+degraded+remapped+backfill_wait
> >  58  activating+undersized+degraded+remapped
> >  27  activating
> >  22  active+undersized+degraded+remapped+backfilling
> >  6   active+remapped+backfilling
> >  1   active+recovery_wait+degraded
> >
> >
> > If it's just a matter to wait for the system to complete the recovery 
> > it's fine, I'll deal with that, but I was wondendering if there is a 
> > more suble problem here.
> >
> > OK, I'll wait for the recovery to complete and see what happens, thanks.
> 
> The blocked MDS might be caused by the 'activating' PGs. Do you have a 
> warning about too much PGs per OSD? If that is the case, 
> activating/creating/peering/whatever on the affected OSDs is blocked, 
> which leads to blocked requests etc.
> 
> You can resolve this be increasing the number of allowed PGs per OSD 
> ('mon_max_pg_per_osd'). AFAIK it needs to be set for mon, mgr and osd 
> instances. There was also been some discussion about this setting on the 
> mailing list in the last weeks.
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Zdenek Janda
Hi,
we have restored damaged ODS not starting after bug caused by this
issue, detailed steps are for reference at
http://tracker.ceph.com/issues/21142#note-9 , should anybody hit into
this this should fix it for you.
Thanks
Zdenek Janda




On 11.1.2018 11:40, Zdenek Janda wrote:
> Hi,
> I have succeeded in identifying faulty PG:
> 
>  -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d
> needs 13939-15333
>  -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00  1 osd.15 15340
> build_past_intervals_parallel over 13939-15333
>  -3448> 2018-01-11 11:32:20.019436 7f066e2a3e00 10 osd.15 15340
> build_past_intervals_parallel epoch 13939
>  -3447> 2018-01-11 11:32:20.019447 7f066e2a3e00 20 osd.15 0 get_map
> 13939 - loading and decoding 0x55d39deefb80
>  -3446> 2018-01-11 11:32:20.249771 7f066e2a3e00 10 osd.15 0 add_map_bl
> 13939 27475 bytes
>  -3445> 2018-01-11 11:32:20.250392 7f066e2a3e00 10 osd.15 15340
> build_past_intervals_parallel epoch 13939 pg 12.62d first map, acting
> [21,9] up [21,9], same_interval_since = 13939
>  -3444> 2018-01-11 11:32:20.250505 7f066e2a3e00 10 osd.15 15340
> build_past_intervals_parallel epoch 13940
>  -3443> 2018-01-11 11:32:20.250529 7f066e2a3e00 20 osd.15 0 get_map
> 13940 - loading and decoding 0x55d39deef800
>  -3442> 2018-01-11 11:32:20.251883 7f066e2a3e00 10 osd.15 0 add_map_bl
> 13940 27475 bytes
> 
> -3> 2018-01-11 11:32:26.973843 7f066e2a3e00 10 osd.15 15340
> build_past_intervals_parallel epoch 15087
> -2> 2018-01-11 11:32:26.973999 7f066e2a3e00 20 osd.15 0 get_map
> 15087 - loading and decoding 0x55d3f9e7e700
> -1> 2018-01-11 11:32:26.984286 7f066e2a3e00 10 osd.15 0 add_map_bl
> 15087 11409 bytes
>  0> 2018-01-11 11:32:26.990595 7f066e2a3e00 -1
> /build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> thread 7f066e2a3e00 time 2018-01-11 11:32:26.984716
> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> assert(interval.last > last)
> 
> Lets see what can be done about this PG.
> 
> Thanks
> Zdenek Janda
> 
> 
> On 11.1.2018 11:20, Zdenek Janda wrote:
>> Hi,
>>
>> updated the issue at http://tracker.ceph.com/issues/21142#note-5 with
>> last 1 lines of strace before ABRT. Crash ends with:
>>
>>  0.002429 pread64(22,
>> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\354:\0\0"...,
>> 12288, 908492996608) = 12288
>>  0.007869 pread64(22,
>> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\355:\0\0"...,
>> 12288, 908493324288) = 12288
>>  0.004220 pread64(22,
>> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\356:\0\0"...,
>> 12288, 908499615744) = 12288
>>  0.009143 pread64(22,
>> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\357:\0\0"...,
>> 12288, 908500926464) = 12288
>>  0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"...,
>> 275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
>> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
>> thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
>> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
>> assert(interval.last > last)
>>
>> Any suggestions are welcome, need to understand mechanism why this happened
>>
>> Thanks
>> Zdenek Janda
>>
>>
>> On 11.1.2018 10:48, Josef Zelenka wrote:
>>> I have posted logs/strace from our osds with details to a ticket in the
>>> ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You
>>> can see where exactly the OSDs crash etc, this can be of help if someone
>>> decides to debug it.
>>>
>>> JZ
>>>
>>>
>>> On 10/01/18 22:05, Josef Zelenka wrote:

 Hi, today we had a disasterous crash - we are running a 3 node, 24 osd
 in total cluster (8 each) with SSDs for blockdb, HDD for bluestore
 data. This cluster is used as a radosgw backend, for storing a big
 number of thumbnails for a file hosting site - around 110m files in
 total. We were adding an interface to the nodes which required a
 restart, but after restarting one of the nodes, a lot of the OSDs were
 kicked out of the cluster and rgw stopped working. We have a lot of
 pgs down and unfound atm. OSDs can't be started(aside from some,
 that's a mystery) with this error -  FAILED assert ( interval.last >
 last) - they just periodically restart. So far, the cluster is broken
 and we can't seem to bring it back up. We tried fscking the osds via
 the ceph objectstore tool, but it was no good. The root of all this
 seems to be in the FAILED assert(interval.last > last) error, however
 i can't find any info regarding this or how to fix it. Did someone
 here also encounter it? We're running luminous on ubuntu 16.04.

 Thanks

 Josef Zelenka

 Cloudevelops



 ___
 ceph-users mailing list
 

Re: [ceph-users] replace failed disk in Luminous v12.2.2

2018-01-11 Thread Alfredo Deza
On Thu, Jan 11, 2018 at 4:30 AM, Dietmar Rieder
 wrote:
> Hello,
>
> we have failed OSD disk in our Luminous v12.2.2 cluster that needs to
> get replaced.
>
> The cluster was initially deployed using ceph-deploy on Luminous
> v12.2.0. The OSDs were created using
>
> ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
> --block-wal /dev/nvme0n1 --block-db /dev/nvme0n1
>
> Note we separated the bluestore data, wal and db.
>
> We updated to Luminous v12.2.1 and further to Luminous v12.2.2.
>
> With the last update we also let ceph-volume take over the OSDs using
> "ceph-volume simple scan  /var/lib/ceph/osd/$osd" and "ceph-volume
> simple activate ${osd} ${id}". All of this went smoothly.

That is good to hear!

>
> Now wonder what is the correct way to replace a failed OSD block disk?
>
> The docs for luminous [1] say:
>
> REPLACING AN OSD
>
> 1. Destroy the OSD first:
>
> ceph osd destroy {id} --yes-i-really-mean-it
>
> 2. Zap a disk for the new OSD, if the disk was used before for other
> purposes. It’s not necessary for a new disk:
>
> ceph-disk zap /dev/sdX
>
>
> 3. Prepare the disk for replacement by using the previously destroyed
> OSD id:
>
> ceph-disk prepare --bluestore /dev/sdX  --osd-id {id} --osd-uuid `uuidgen`
>
>
> 4. And activate the OSD:
>
> ceph-disk activate /dev/sdX1
>
>
> Initially this seems to be straight forward, but
>
> 1. I'm not sure if there is something to do with the still existing
> bluefs db and wal partitions on the nvme device for the failed OSD. Do
> they have to be zapped ? If yes, what is the best way? There is nothing
> mentioned in the docs.

What is your concern here if the activation seems to work?

>
> 2. Since we already let "ceph-volume simple" take over our OSDs I'm not
> sure if we should now use ceph-volume or again ceph-disk (followed by
> "ceph-vloume simple" takeover) to prepare and activate the OSD?

The `simple` sub-command is meant to help with the activation of OSDs
at boot time, supporting ceph-disk (or manual) created OSDs.

There is no requirement to use `ceph-volume lvm` which is intended for
new OSDs using LVM as devices.

>
> 3. If we should use ceph-volume, then by looking at the luminous
> ceph-volume docs [2] I find for both,
>
> ceph-volume lvm prepare
> ceph-volume lvm activate
>
> that the bluestore option is either NOT implemented or NOT supported
>
> activate:  [–bluestore] filestore (IS THIS A TYPO???) objectstore (not
> yet implemented)
> prepare: [–bluestore] Use the bluestore objectstore (not currently
> supported)

These might be a typo on the man page, will get that addressed. Ticket
opened at http://tracker.ceph.com/issues/22663

bluestore as of 12.2.2 is fully supported and it is the default. The
--help output in ceph-volume does have the flags updated and correctly
showing this.

>
>
> So, now I'm completely lost. How is all of this fitting together in
> order to replace a failed OSD?

You would need to keep using ceph-disk. Unless you want ceph-volume to
take over, in which case you would need to follow the steps to deploy
a new OSD
with ceph-volume.

Note that although --osd-id is supported, there is an issue with that
on 12.2.2 that would prevent you from correctly deploying it
http://tracker.ceph.com/issues/22642

The recommendation, if you want to use ceph-volume, would be to omit
--osd-id and let the cluster give you the ID.

>
> 4. More after reading some a recent threads on this list additional
> questions are coming up:
>
> According to the OSD replacement doc [1] :
>
> "When disks fail, [...], OSDs need to be replaced. Unlike Removing the
> OSD, replaced OSD’s id and CRUSH map entry need to be keep [TYPO HERE?
> keep -> kept] intact after the OSD is destroyed for replacement."
>
> but
> http://tracker.ceph.com/issues/22642 seems to say that it is not
> possible to reuse am OSD's id

That is a ceph-volume specific issue, unrelated to how replacement in
Ceph works.

>
>
> So I'm quite lost with an essential and very basic seemingly simple task
> of storage management.

You have two choices:

1) keep using ceph-disk as always, even though you have "ported" your
OSDs with `ceph-volume simple`
2) Deploy new OSDs with ceph-volume

For #1 you will want to keep running `simple` on newly deployed OSDs
so that they can come up after a reboot, since `simple` disables the
udev rules
that caused activation with ceph-disk

>
> Thanks for any help here.
>
> ~Dietmar
>
>
> [1]: http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-osds/
> [2]: http://docs.ceph.com/docs/luminous/man/8/ceph-volume/
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph-volume does not support upstart

2018-01-11 Thread Alfredo Deza
On Wed, Jan 10, 2018 at 8:38 PM, 赵赵贺东  wrote:
> Hello,
> I am sorry for the delay.
> Thank you for your suggestion.
>
> It is better to update system or keep using ceph-disk in fact.
> Thank you Alfredo Deza & Cary.

Both are really OK options for you for now. Unless ceph-disk is
causing you issues, in which case, the recommended thing to do is to
upgrade to a systemd system that ceph-volume supports
>
>
>> 在 2018年1月8日,下午11:41,Alfredo Deza  写道:
>>
>> ceph-volume relies on systemd, it will not work with upstart. Going
>> the fstab way might work, but most of the lvm implementation will want
>> to do systemd-related calls like enabling units and placing files.
>>
>> For upstart you might want to keep using ceph-disk, unless upgrading
>> to a newer OS is an option in which case ceph-volume would work (as
>> long as systemd is available)
>>
>> On Sat, Dec 30, 2017 at 9:11 PM, 赵赵贺东  wrote:
>>> Hello Cary,
>>>
>>> Thank you for your detailed description, it’s really helpful for me!
>>> I will have a try when I get back to my office!
>>>
>>> Thank you for your attention to this matter.
>>>
>>>
>>> 在 2017年12月30日,上午3:51,Cary  写道:
>>>
>>> Hello,
>>>
>>> I mount my Bluestore OSDs in /etc/fstab:
>>>
>>> vi /etc/fstab
>>>
>>> tmpfs   /var/lib/ceph/osd/ceph-12  tmpfs   rw,relatime 0 0
>>> =
>>> Then mount everyting in fstab with:
>>> mount -a
>>> ==
>>> I activate my OSDs this way on startup: You can find the fsid with
>>>
>>> cat /var/lib/ceph/osd/ceph-12/fsid
>>>
>>> Then add file named ceph.start so ceph-volume will be run at startup.
>>>
>>> vi /etc/local.d/ceph.start
>>> ceph-volume lvm activate 12 827f4a2c-8c1b-427b-bd6c-66d31a0468ac
>>> ==
>>> Make it excitable:
>>> chmod 700 /etc/local.d/ceph.start
>>> ==
>>> cd /etc/local.d/
>>> ./ceph.start
>>> ==
>>> I am a Gentoo user and use OpenRC, so this may not apply to you.
>>> ==
>>> cd /etc/init.d/
>>> ln -s ceph ceph-osd.12
>>> /etc/init.d/ceph-osd.12 start
>>> rc-update add ceph-osd.12 default
>>>
>>> Cary
>>>
>>> On Fri, Dec 29, 2017 at 8:47 AM, 赵赵贺东  wrote:
>>>
>>> Hello Cary!
>>> It’s really big surprise for me to receive your reply!
>>> Sincere thanks to you!
>>> I know it’s a fake execute file, but it works!
>>>
>>> >
>>> $ cat /usr/sbin/systemctl
>>> #!/bin/bash
>>> exit 0
>>> <
>>>
>>> I can start my osd by following command
>>> /usr/bin/ceph-osd --cluster=ceph -i 12 -f --setuser ceph --setgroup ceph
>>>
>>> But, threre are still problems.
>>> 1.Though ceph-osd can start successfully, prepare log and activate log looks
>>> like errors occurred.
>>>
>>> Prepare log:
>>> ===>
>>> # ceph-volume lvm prepare --bluestore --data vggroup/lv
>>> Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-12
>>> Running command: chown -R ceph:ceph /dev/dm-0
>>> Running command: sudo ln -s /dev/vggroup/lv /var/lib/ceph/osd/ceph-12/block
>>> Running command: sudo ceph --cluster ceph --name client.bootstrap-osd
>>> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o
>>> /var/lib/ceph/osd/ceph-12/activate.monmap
>>> stderr: got monmap epoch 1
>>> Running command: ceph-authtool /var/lib/ceph/osd/ceph-12/keyring
>>> --create-keyring --name osd.12 --add-key
>>> AQAQ+UVa4z2ANRAAmmuAExQauFinuJuL6A56ww==
>>> stdout: creating /var/lib/ceph/osd/ceph-12/keyring
>>> stdout: added entity osd.12 auth auth(auid = 18446744073709551615
>>> key=AQAQ+UVa4z2ANRAAmmuAExQauFinuJuL6A56ww== with 0 caps)
>>> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-12/keyring
>>> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-12/
>>> Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore
>>> --mkfs -i 12 --monmap /var/lib/ceph/osd/ceph-12/activate.monmap --key
>>>  --osd-data
>>> /var/lib/ceph/osd/ceph-12/ --osd-uuid 827f4a2c-8c1b-427b-bd6c-66d31a0468ac
>>> --setuser ceph --setgroup ceph
>>> stderr: warning: unable to create /var/run/ceph: (13) Permission denied
>>> stderr: 2017-12-29 08:13:08.609127 b66f3000 -1 asok(0x850c62a0)
>>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
>>> bind the UNIX domain socket to '/var/run/ceph/ceph-osd.12.asok': (2) No such
>>> file or directory
>>> stderr:
>>> stderr: 2017-12-29 08:13:08.643410 b66f3000 -1
>>> bluestore(/var/lib/ceph/osd/ceph-12//block) _read_bdev_label unable to
>>> decode label at offset 66: buffer::malformed_input: void
>>> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past
>>> end of struct encoding
>>> stderr: 

Re: [ceph-users] replace failed disk in Luminous v12.2.2

2018-01-11 Thread Konstantin Shalygin

Now wonder what is the correct way to replace a failed OSD block disk?


Generic way for maintenance (e.g. disk replace) is rebalance by change osd 
weight:

ceph osd crush reweight osdid 0

cluster migrate data "from this osd"
When HEALTH_OK you can safe remove this OSD:

ceph osd out osd_id
systemctl stopceph-osd at osd_id 

ceph osd crush remove osd_id
ceph auth del osd_id
ceph osd rm osd_id



I'm not sure if there is something to do with the still existing bluefs db and 
wal partitions on the nvme device for the failed OSD. Do they have to be zapped 
? If yes, what is the best way?



1. Find nvme partition for this OSD. You can't do it in several ways. ceph-volume, by 
hand or with "ceph-disk list" (because is more human readable):

/dev/sda :
 /dev/sda1 ceph data, active, cluster ceph, osd.0, block /dev/sda2, block.db 
/dev/nvme2n1p1, block.wal /dev/nvme2n1p2
 /dev/sda2 ceph block, for /dev/sda1

2. Delete partition via parted or fdisk.

fdisk -u /dev/nvme2n1
d (delete partitions)
enter partition number of block.db: 1
d
enter partition number of block.wal: 2
w (write partition table)

3. Deploy your new OSD.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does anyone use rcceph script in CentOS/SUSE?

2018-01-11 Thread Nathan Cutler
To all who are running Ceph on CentOS or SUSE: do you use the "rcceph" 
script? The ceph RPMs ship it in /usr/sbin/rcceph


(Why I ask: more-or-less the same functionality is provided by the 
ceph-osd.target and ceph-mon.target systemd units, and the script is no 
longer maintained, so we'd like to drop it from the RPM packaging unless 
someone is using it.)


Thanks,
Nathan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One object degraded cause all ceph requests hang - Jewel 10.2.6 (rbd + radosgw)

2018-01-11 Thread Vincent Godin
As no response were given, i will explain what i found : maybe it
could help other people

.dirXXX object is an index marker with a 0 data size. The metadata
associated to this object (located in the levelDB of the OSDs
currently holding this marker) is the index of the bucket
corresponding to this marker.
My problem came from the number of objects stored in this bucket :
more than 50 millions. As the size of an object in the index is
between 200 and 250 bytes, the index should have a 12 GB size. That's
why it is recommanded to add a shard the index for each 100.000
objects.
During a ceph process rebuild, some pgs move from some OSDs to others.
When a index is moving, all the write requests to the bucket are
blocked till the operation completed. During this move, the user had
launched an upload batch on the bucket so a lot of requests were
blocked, leading to block all the requests on the primary pgs hold by
the OSD.
So the loop i saw was in fact just normal and but moving a 12 GB
object from one SATA to an other takes several minutes, to long in
fact for a ceph cluster with a lot of clients to survive
The lesson of this story is : Don't forget to shard your bucket !!!


---
Yesterday we just encountered this bug. One OSD was looping on
"2018-01-03 16:20:59.148121 7f011a6a1700  0 log_channel(cluster) log
[WRN] : slow request 30.254269 seconds old, received at 2018-01-03
16:20:28.883837: osd_op(client.48285929.0:14601958 35.8abfc02e
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call
rgw.bucket_prepare_op] snapc 0=[] ondisk+write+known_if_redirected
e359833) currently waiting for degraded object".

The requests on this OSD.150 went quickly in blocked state

2018-01-03 16:25:56.241064 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 20 slow requests, 1 included below; oldest blocked for >
327.357139 secs
2018-01-03 16:30:19.299288 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 45 slow requests, 1 included below; oldest blocked for >
590.415387 secs
...
...
2018-01-03 16:46:04.900204 7f011a6a1700  0 log_channel(cluster) log
[WRN] : 100 slow requests, 2 included below; oldest blocked for >
1204.060056 secs

while still looping

2018-01-03 16:46:04.900220 7f011a6a1700  0 log_channel(cluster) log
[WRN] : slow request 123.294762 seconds old, received at 2018-01-03
16:44:01.605320 : osd_op(client.48285929.0:14605228 35.8abfc02e
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 [call
rgw.bucket_complete_op] snapc 0=[]
ack+ondisk+write+known_if_redirected e359833) currently waiting for
degraded object

All theses resquest were blocked on OSD.150.
A lot of VMs attached to Ceph were hanging.

The degraded object was
.dir.0a3e5369-ff79-4f7d-b0b6-79c5a75b1759.29113876.1 in the pg 35.2e.
This PG was located on 4 OSDs. The object has a 0 size on the 4 OSDs.
It was not possible to do a ceph osd pg 35.2e query with a response.
Killing the OSD.150 lead to the requests bloqued on the new primary.

I found the relatively new bug #22072 which looks like mine but there
was no response from the ceph team. I finally tried the same solution
: rados rm -p pool/degraded_object but with no response from the
command. I stopped the command after 15 mn. Few minutes later, the 4
OSDs holding the pg 35.2e suddenly rebooted and the problem was
solved. The object was deleted on the 4 OSDs.

Anyway, it leads to a production break and i have no idea of what
produced the "degraded object" and i'm not sure if the solution came
from my command or from a inside process. At this time we are still
trying to repare some filesystems of the VMs attached to Ceph and i
have to explain that this all production break comes from one empty
object ... The real problem is why Ceph was unable to handle this
"degraded object" and looped on it, blocking all the requests on the
OSD.150 ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Future

2018-01-11 Thread Massimiliano Cuttini

Hi everybody,

i'm always looking at CEPH for the future.
But I do see several issue that are leaved unresolved and block nearly 
future adoption.

I would like to know if there are some answear already:

_*1) Separation between Client and Server distribution.*_
At this time you have always to update client & server in order to match 
the same distribution of Ceph.
This is ok in the early releases but in future I do expect that the 
ceph-client is ONE, not many for every major version.
The client should be able to self determinate what version of the 
protocol and what feature are enabable and connect to at least 3 or 5 
older major version of Ceph by itself.


_*2) Kernel is old -> feature mismatch*_
Ok, kernel is old, and so? Just do not use it and turn to NBD.
And please don't let me even know, just virtualize under the hood.

_*3) Management complexity*_
Ceph is amazing, but is just too big to have everything under control 
(too many services).
Now there is a management console, but as far as I read this management 
console just show basic data about performance.

So it doesn't manage at all... it's just a monitor...

In the end You have just to manage everything by your command-line.
In order to manage by web it's mandatory:

 * _create, delete, enable, disable services_
   If I need to run ISCSI redundant gateway, do I really need to
   cut command from your online docs?
   Of course no. You just can script it better than what every admin
   can do.
   Just give few arguments on the html forms and that's all.

 * _create, delete, enable, disable users_
   I have to create users and keys for 24 servers. Do you really think
   it's possible to make it without some bad transcription or bad
   cut of the keys across all servers.
   Everybody end by just copy the admin keys across all servers, giving
   very unsecure full permission to all clients.

 * _create MAPS  (server, datacenter, rack, node, osd)._
   This is mandatory to design how the data need to be replicate.
   It's not good create this by script or shell, it's needed a graph
   editor which can dive you the perpective of what will be copied where.

 * _check hardware below the hood_
   It's missing the checking of the health of the hardware below.
   But Ceph was born as a storage software that ensure redundacy and
   protect you from single failure.
   So WHY did just ignore to check the healths of disks with SMART?
   FreeNAS just do a better work on this giving lot of tools to
   understand which disks is which and if it will fail in the nearly
   future.
   Of course also Ceph could really forecast issues by itself and need
   to start to integrate with basic hardware IO.
   For example, should be possible to enable disable UID on the disks
   in order to know which one need to be replace.
   I guess this kind of feature are quite standard across all linux
   distributions.

The management complexity can be completly overcome with a great Web 
Manager.
A Web Manager, in the end is just a wrapper for Shell Command from the 
CephAdminNode to others.
If you think about it a wrapper is just tons of time easier to develop 
than what has been already developed.
I do really see that CEPH is the future of storage. But there is some 
quick-avoidable complexity that need to be reduced.


If there are already some plan for these issue I really would like to know.

Thanks,
Max



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to get the usage of an indexless-bucket

2018-01-11 Thread Vincent Godin
How to know the usage of an indexless bucket ? We need to have this
information for our billing process
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Zdenek Janda
Hi,
I have succeeded in identifying faulty PG:

 -3450> 2018-01-11 11:32:20.015658 7f066e2a3e00 10 osd.15 15340 12.62d
needs 13939-15333
 -3449> 2018-01-11 11:32:20.019405 7f066e2a3e00  1 osd.15 15340
build_past_intervals_parallel over 13939-15333
 -3448> 2018-01-11 11:32:20.019436 7f066e2a3e00 10 osd.15 15340
build_past_intervals_parallel epoch 13939
 -3447> 2018-01-11 11:32:20.019447 7f066e2a3e00 20 osd.15 0 get_map
13939 - loading and decoding 0x55d39deefb80
 -3446> 2018-01-11 11:32:20.249771 7f066e2a3e00 10 osd.15 0 add_map_bl
13939 27475 bytes
 -3445> 2018-01-11 11:32:20.250392 7f066e2a3e00 10 osd.15 15340
build_past_intervals_parallel epoch 13939 pg 12.62d first map, acting
[21,9] up [21,9], same_interval_since = 13939
 -3444> 2018-01-11 11:32:20.250505 7f066e2a3e00 10 osd.15 15340
build_past_intervals_parallel epoch 13940
 -3443> 2018-01-11 11:32:20.250529 7f066e2a3e00 20 osd.15 0 get_map
13940 - loading and decoding 0x55d39deef800
 -3442> 2018-01-11 11:32:20.251883 7f066e2a3e00 10 osd.15 0 add_map_bl
13940 27475 bytes

-3> 2018-01-11 11:32:26.973843 7f066e2a3e00 10 osd.15 15340
build_past_intervals_parallel epoch 15087
-2> 2018-01-11 11:32:26.973999 7f066e2a3e00 20 osd.15 0 get_map
15087 - loading and decoding 0x55d3f9e7e700
-1> 2018-01-11 11:32:26.984286 7f066e2a3e00 10 osd.15 0 add_map_bl
15087 11409 bytes
 0> 2018-01-11 11:32:26.990595 7f066e2a3e00 -1
/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
thread 7f066e2a3e00 time 2018-01-11 11:32:26.984716
/build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
assert(interval.last > last)

Lets see what can be done about this PG.

Thanks
Zdenek Janda


On 11.1.2018 11:20, Zdenek Janda wrote:
> Hi,
> 
> updated the issue at http://tracker.ceph.com/issues/21142#note-5 with
> last 1 lines of strace before ABRT. Crash ends with:
> 
>  0.002429 pread64(22,
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\354:\0\0"...,
> 12288, 908492996608) = 12288
>  0.007869 pread64(22,
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\355:\0\0"...,
> 12288, 908493324288) = 12288
>  0.004220 pread64(22,
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\356:\0\0"...,
> 12288, 908499615744) = 12288
>  0.009143 pread64(22,
> "\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\357:\0\0"...,
> 12288, 908500926464) = 12288
>  0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"...,
> 275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
> pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
> thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
> /build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
> assert(interval.last > last)
> 
> Any suggestions are welcome, need to understand mechanism why this happened
> 
> Thanks
> Zdenek Janda
> 
> 
> On 11.1.2018 10:48, Josef Zelenka wrote:
>> I have posted logs/strace from our osds with details to a ticket in the
>> ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You
>> can see where exactly the OSDs crash etc, this can be of help if someone
>> decides to debug it.
>>
>> JZ
>>
>>
>> On 10/01/18 22:05, Josef Zelenka wrote:
>>>
>>> Hi, today we had a disasterous crash - we are running a 3 node, 24 osd
>>> in total cluster (8 each) with SSDs for blockdb, HDD for bluestore
>>> data. This cluster is used as a radosgw backend, for storing a big
>>> number of thumbnails for a file hosting site - around 110m files in
>>> total. We were adding an interface to the nodes which required a
>>> restart, but after restarting one of the nodes, a lot of the OSDs were
>>> kicked out of the cluster and rgw stopped working. We have a lot of
>>> pgs down and unfound atm. OSDs can't be started(aside from some,
>>> that's a mystery) with this error -  FAILED assert ( interval.last >
>>> last) - they just periodically restart. So far, the cluster is broken
>>> and we can't seem to bring it back up. We tried fscking the osds via
>>> the ceph objectstore tool, but it was no good. The root of all this
>>> seems to be in the FAILED assert(interval.last > last) error, however
>>> i can't find any info regarding this or how to fix it. Did someone
>>> here also encounter it? We're running luminous on ubuntu 16.04.
>>>
>>> Thanks
>>>
>>> Josef Zelenka
>>>
>>> Cloudevelops
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


Re: [ceph-users] How to "reset" rgw?

2018-01-11 Thread Martin Emrich

Ok thanks, I'll try it out...

Regards,

Martin


Am 10.01.18 um 18:48 schrieb Casey Bodley:


On 01/10/2018 04:34 AM, Martin Emrich wrote:

Hi!

As I cannot find any solution for my broken rgw pools, the only way 
out is to give up and "reset".


How do I throw away all rgw data from a ceph cluster? Just delete all 
rgw pools? Or are some parts stored elsewhere (monitor, ...)?


Thanks,

Martin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Deleting all of rgw's pools should be sufficient.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Zdenek Janda
Hi,

updated the issue at http://tracker.ceph.com/issues/21142#note-5 with
last 1 lines of strace before ABRT. Crash ends with:

 0.002429 pread64(22,
"\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\354:\0\0"...,
12288, 908492996608) = 12288
 0.007869 pread64(22,
"\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\355:\0\0"...,
12288, 908493324288) = 12288
 0.004220 pread64(22,
"\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\356:\0\0"...,
12288, 908499615744) = 12288
 0.009143 pread64(22,
"\10\7\213,\0\0\6\1i\33\0\0c\341\353kW\rC\365\2310\34\307\212\270\215{\357:\0\0"...,
12288, 908500926464) = 12288
 0.010802 write(2, "/build/ceph-12.2.1/src/osd/osd_t"...,
275/build/ceph-12.2.1/src/osd/osd_types.cc: In function 'virtual void
pi_compact_rep::add_interval(bool, const PastIntervals::pg_interval_t&)'
thread 7fb85e234e00 time 2018-01-11 11:02:54.783628
/build/ceph-12.2.1/src/osd/osd_types.cc: 3205: FAILED
assert(interval.last > last)

Any suggestions are welcome, need to understand mechanism why this happened

Thanks
Zdenek Janda


On 11.1.2018 10:48, Josef Zelenka wrote:
> I have posted logs/strace from our osds with details to a ticket in the
> ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You
> can see where exactly the OSDs crash etc, this can be of help if someone
> decides to debug it.
> 
> JZ
> 
> 
> On 10/01/18 22:05, Josef Zelenka wrote:
>>
>> Hi, today we had a disasterous crash - we are running a 3 node, 24 osd
>> in total cluster (8 each) with SSDs for blockdb, HDD for bluestore
>> data. This cluster is used as a radosgw backend, for storing a big
>> number of thumbnails for a file hosting site - around 110m files in
>> total. We were adding an interface to the nodes which required a
>> restart, but after restarting one of the nodes, a lot of the OSDs were
>> kicked out of the cluster and rgw stopped working. We have a lot of
>> pgs down and unfound atm. OSDs can't be started(aside from some,
>> that's a mystery) with this error -  FAILED assert ( interval.last >
>> last) - they just periodically restart. So far, the cluster is broken
>> and we can't seem to bring it back up. We tried fscking the osds via
>> the ceph objectstore tool, but it was no good. The root of all this
>> seems to be in the FAILED assert(interval.last > last) error, however
>> i can't find any info regarding this or how to fix it. Did someone
>> here also encounter it? We're running luminous on ubuntu 16.04.
>>
>> Thanks
>>
>> Josef Zelenka
>>
>> Cloudevelops
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Zdenek Janda
Hi,
does anyone suggest what to do with this ? I have identified the
underlying crashing code src/osd/osd_types.cc [assert(interval.last >
last);] commited by Sage Weil, however didnt figured out exact mechanism
of function and why it crashes. Also unclear is mechanism, how this bug
spreaded and crashed so many healtly OSDs which are unable to start now.
This seems pretty serious issue, as it can take down large numbers of
OSDs without sweat. What can we do here now ?
Thanks
Zdenek Janda


On 11.1.2018 10:48, Josef Zelenka wrote:
> I have posted logs/strace from our osds with details to a ticket in the
> ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You
> can see where exactly the OSDs crash etc, this can be of help if someone
> decides to debug it.
> 
> JZ
> 
> 
> On 10/01/18 22:05, Josef Zelenka wrote:
>>
>> Hi, today we had a disasterous crash - we are running a 3 node, 24 osd
>> in total cluster (8 each) with SSDs for blockdb, HDD for bluestore
>> data. This cluster is used as a radosgw backend, for storing a big
>> number of thumbnails for a file hosting site - around 110m files in
>> total. We were adding an interface to the nodes which required a
>> restart, but after restarting one of the nodes, a lot of the OSDs were
>> kicked out of the cluster and rgw stopped working. We have a lot of
>> pgs down and unfound atm. OSDs can't be started(aside from some,
>> that's a mystery) with this error -  FAILED assert ( interval.last >
>> last) - they just periodically restart. So far, the cluster is broken
>> and we can't seem to bring it back up. We tried fscking the osds via
>> the ceph objectstore tool, but it was no good. The root of all this
>> seems to be in the FAILED assert(interval.last > last) error, however
>> i can't find any info regarding this or how to fix it. Did someone
>> here also encounter it? We're running luminous on ubuntu 16.04.
>>
>> Thanks
>>
>> Josef Zelenka
>>
>> Cloudevelops
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)

2018-01-11 Thread Josef Zelenka
I have posted logs/strace from our osds with details to a ticket in the 
ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You 
can see where exactly the OSDs crash etc, this can be of help if someone 
decides to debug it.


JZ


On 10/01/18 22:05, Josef Zelenka wrote:


Hi, today we had a disasterous crash - we are running a 3 node, 24 osd 
in total cluster (8 each) with SSDs for blockdb, HDD for bluestore 
data. This cluster is used as a radosgw backend, for storing a big 
number of thumbnails for a file hosting site - around 110m files in 
total. We were adding an interface to the nodes which required a 
restart, but after restarting one of the nodes, a lot of the OSDs were 
kicked out of the cluster and rgw stopped working. We have a lot of 
pgs down and unfound atm. OSDs can't be started(aside from some, 
that's a mystery) with this error - FAILED assert ( interval.last > 
last) - they just periodically restart. So far, the cluster is broken 
and we can't seem to bring it back up. We tried fscking the osds via 
the ceph objectstore tool, but it was no good. The root of all this 
seems to be in the FAILED assert(interval.last > last) error, however 
i can't find any info regarding this or how to fix it. Did someone 
here also encounter it? We're running luminous on ubuntu 16.04.


Thanks

Josef Zelenka

Cloudevelops



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] replace failed disk in Luminous v12.2.2

2018-01-11 Thread Dietmar Rieder
Hello,

we have failed OSD disk in our Luminous v12.2.2 cluster that needs to
get replaced.

The cluster was initially deployed using ceph-deploy on Luminous
v12.2.0. The OSDs were created using

ceph-deploy osd create --bluestore cephosd-${osd}:/dev/sd${disk}
--block-wal /dev/nvme0n1 --block-db /dev/nvme0n1

Note we separated the bluestore data, wal and db.

We updated to Luminous v12.2.1 and further to Luminous v12.2.2.

With the last update we also let ceph-volume take over the OSDs using
"ceph-volume simple scan  /var/lib/ceph/osd/$osd" and "ceph-volume
simple activate ${osd} ${id}". All of this went smoothly.

Now wonder what is the correct way to replace a failed OSD block disk?

The docs for luminous [1] say:

REPLACING AN OSD

1. Destroy the OSD first:

ceph osd destroy {id} --yes-i-really-mean-it

2. Zap a disk for the new OSD, if the disk was used before for other
purposes. It’s not necessary for a new disk:

ceph-disk zap /dev/sdX


3. Prepare the disk for replacement by using the previously destroyed
OSD id:

ceph-disk prepare --bluestore /dev/sdX  --osd-id {id} --osd-uuid `uuidgen`


4. And activate the OSD:

ceph-disk activate /dev/sdX1


Initially this seems to be straight forward, but

1. I'm not sure if there is something to do with the still existing
bluefs db and wal partitions on the nvme device for the failed OSD. Do
they have to be zapped ? If yes, what is the best way? There is nothing
mentioned in the docs.

2. Since we already let "ceph-volume simple" take over our OSDs I'm not
sure if we should now use ceph-volume or again ceph-disk (followed by
"ceph-vloume simple" takeover) to prepare and activate the OSD?

3. If we should use ceph-volume, then by looking at the luminous
ceph-volume docs [2] I find for both,

ceph-volume lvm prepare
ceph-volume lvm activate

that the bluestore option is either NOT implemented or NOT supported

activate:  [–bluestore] filestore (IS THIS A TYPO???) objectstore (not
yet implemented)
prepare: [–bluestore] Use the bluestore objectstore (not currently
supported)


So, now I'm completely lost. How is all of this fitting together in
order to replace a failed OSD?

4. More after reading some a recent threads on this list additional
questions are coming up:

According to the OSD replacement doc [1] :

"When disks fail, [...], OSDs need to be replaced. Unlike Removing the
OSD, replaced OSD’s id and CRUSH map entry need to be keep [TYPO HERE?
keep -> kept] intact after the OSD is destroyed for replacement."

but
http://tracker.ceph.com/issues/22642 seems to say that it is not
possible to reuse am OSD's id


So I'm quite lost with an essential and very basic seemingly simple task
of storage management.

Thanks for any help here.

~Dietmar


[1]: http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-osds/
[2]: http://docs.ceph.com/docs/luminous/man/8/ceph-volume/

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com