Re: [ceph-users] osd be marked down when recovering

2019-06-26 Thread zhanrzh...@teamsun.com.cn
Hello,Paul,:
   Thanks for your help.The aim I did it in my test/dev environment is to ready 
for my production cluster. 
If set nodown,while clinet read/write on the osd that previously marked 
down, What will it happen? How can I avoid it? or is there any document I can 
refer to?  Thanks!

 
From: Paul Emmerich
Date: 2019-06-26 19:31
To: zhanrzh...@teamsun.com.cn
CC: ceph-users
Subject: Re: [ceph-users] osd be marked down when recovering
Looks like it's overloaded and runs into a timeout. For a test/dev environment: 
try to set the nodown flag for this experiment if you just want to ignore these 
timeouts completely.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 26, 2019 at 1:26 PM zhanrzh...@teamsun.com.cn 
 wrote:
Hi,all:
I start ceph cluster on my machine with development mode,to estimate the 
time of recoverying after increasing pgp_num.
   all of daemon  run on one machine.
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
memory: 377GB
OS:CentOS Linux release 7.6.1810
ceph version:hammer

builded ceph according to http://docs.ceph.com/docs/hammer/dev/quick_guide/,
ceph -s shows:
cluster 15ec2f3f-86e5-46bc-bf98-4b35841ee6a5
 health HEALTH_WARN
pool rbd pg_num 512 > pgp_num 256
 monmap e1: 1 mons at {a=172.30.250.25:6789/0}
election epoch 2, quorum 0 a
 osdmap e88: 30 osds: 30 up, 30 in
  pgmap v829: 512 pgs, 1 pools, 57812 MB data, 14454 objects
5691 GB used, 27791 GB / 33483 GB avail
 512 active+clean
and ceph osd tree[3]
It start to recovering after i increased pgp_num. ceph -w says there are some 
osd down, but the process is runing.All configuration items of osd or mon are 
default[1]
some messages that ceph -w[2] says,as below :

2019-06-26 15:03:21.839750 mon.0 [INF] pgmap v842: 512 pgs: 127 
active+degraded, 84 activating+degraded, 256 active+clean, 45 
active+recovering+degraded; 57812 MB data, 5714 GB used, 27769 GB / 33483 GB 
avail; 22200/43362 objects degraded (51.197%); 50789 kB/s, 12 objects/s 
recovering
2019-06-26 15:03:21.840884 mon.0 [INF] osd.1 172.30.250.25:6804/22500 failed (3 
reports from 3 peers after 24.867116 >= grace 20.00)
2019-06-26 15:03:21.841459 mon.0 [INF] osd.9 172.30.250.25:6836/25078 failed (3 
reports from 3 peers after 24.867645 >= grace 20.00)
2019-06-26 15:03:21.841709 mon.0 [INF] osd.0 172.30.250.25:6800/22260 failed (3 
reports from 3 peers after 24.846423 >= grace 20.00)
2019-06-26 15:03:21.842286 mon.0 [INF] osd.13 172.30.250.25:6852/26651 failed 
(3 reports from 3 peers after 24.846896 >= grace 20.00)
2019-06-26 15:03:21.842607 mon.0 [INF] osd.5 172.30.250.25:6820/23661 failed (3 
reports from 3 peers after 24.804869 >= grace 20.00)
2019-06-26 15:03:21.842938 mon.0 [INF] osd.10 172.30.250.25:6840/25490 failed 
(3 reports from 3 peers after 24.805155 >= grace 20.00)
2019-06-26 15:03:21.843134 mon.0 [INF] osd.12 172.30.250.25:6848/26277 failed 
(3 reports from 3 peers after 24.805329 >= grace 20.00)
2019-06-26 15:03:21.843591 mon.0 [INF] osd.8 172.30.250.25:6832/24722 failed (3 
reports from 3 peers after 24.805843 >= grace 20.00)
2019-06-26 15:03:21.849664 mon.0 [INF] osd.21 172.30.250.25:6884/29762 failed 
(3 reports from 3 peers after 23.497080 >= grace 20.00)
2019-06-26 15:03:21.862729 mon.0 [INF] osd.14 172.30.250.25:6856/27025 failed 
(3 reports from 3 peers after 23.510172 >= grace 20.00)
2019-06-26 15:03:21.864222 mon.0 [INF] osdmap e91: 30 osds: 29 up, 30 in
2019-06-26 15:03:20.336758 osd.11 [WRN] map e91 wrongly marked me down
2019-06-26 15:03:23.408659 mon.0 [INF] pgmap v843: 512 pgs: 8 
stale+activating+degraded, 8 stale+active+clean, 161 active+degraded, 2 
stale+active+recovering+degraded, 33 activating+degraded, 248 active+clean, 45 
active+recovering+degraded, 7 stale+active+degraded; 57812 MB data, 5730 GB 
used, 27752 GB / 33483 GB avail; 27317/43362 objects degraded (62.998%); 61309 
kB/s, 14 objects/s recovering
2019-06-26 15:03:27.538229 mon.0 [INF] osd.18 172.30.250.25:6872/28632 failed 
(3 reports from 3 peers after 23.180489 >= grace 20.00)
2019-06-26 15:03:27.539416 mon.0 [INF] osd.7 172.30.250.25:6828/24366 failed (3 
reports from 3 peers after 21.900054 >= grace 20.00)
2019-06-26 15:03:27.541831 mon.0 [INF] osdmap e92: 30 osds: 19 up, 30 in
2019-06-26 15:03:32.748179 mon.0 [INF] osdmap e93: 30 osds: 17 up, 30 in
2019-06-26 15:03:33.678682 mon.0 [INF] pgmap v845: 512 pgs: 17 
stale+activating+degraded, 95 stale+active+clean, 55 active+degraded, 13 
peering, 18 stale+active+recovering+degraded, 20 activating+degraded, 155 
active+clean, 22 active+recovery_wait+degraded, 48 active+recovering+degraded, 
69 stale+active+degraded; 57812 MB data, 5734 GB used, 27748 GB / 33483 GB 
avail; 26979/43362 objects degraded (62.218%); 11510 kB/s, 2 

[ceph-users] ceph ansible deploy lvm advanced

2019-06-26 Thread Fabio Abreu
Hi Everybody,

I starting a new lab environment with ceph ansible , bluestore and lvm
advanced deployment.

Which size configuration is recommend to set data ,journal  wal and db lvm ?

Someone had configured in lvm adavanced deploy ?

Regards,
Fabio



-- 
Atenciosamente,
Fabio Abreu Reis
http://fajlinux.com.br
*Tel : *+55 21 98244-0161
*Skype : *fabioabreureis
-- 
Atenciosamente,
Fabio Abreu Reis
http://fajlinux.com.br
*Tel : *+55 21 98244-0161
*Skype : *fabioabreureis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Thoughts on rocksdb and erasurecode

2019-06-26 Thread Christian Wuerdig
Hm, according to https://tracker.ceph.com/issues/24025 snappy compression
should be available out of the box at least since luminous. What ceph
version are you running?

On Wed, 26 Jun 2019 at 21:51, Rafał Wądołowski 
wrote:

> We changed these settings. Our config now is:
>
> bluestore_rocksdb_options =
> "compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=50331648,target_file_size_base=50331648,max_background_compactions=31,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=603979776,max_bytes_for_level_multiplier=10,compaction_threads=32,flusher_threads=8"
>
> It could be changed without redeploy. It changes the sst files, when
> compaction is triggered. The additional improvement is Snappy compression.
> We rebuild ceph with support for it. I can create PR with it, if you want :)
>
>
> Best Regards,
>
> Rafał Wądołowski
> Cloud & Security Engineer
>
> On 25.06.2019 22:16, Christian Wuerdig wrote:
>
> The sizes are determined by rocksdb settings - some details can be found
> here: https://tracker.ceph.com/issues/24361
> One thing to note, in this thread
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030775.html
> it's noted that rocksdb could use up to 100% extra space during compaction
> so if you want to avoid spill over during compaction then safer values
> would be 6/60/600 GB
>
> You can change max_bytes_for_level_base and max_bytes_for_level_multiplier
> to suit your needs better but I'm not sure if that can be changed on the
> fly or if you have to re-create OSDs in order to make them apply
>
> On Tue, 25 Jun 2019 at 18:06, Rafał Wądołowski 
> wrote:
>
>> Why are you selected this specific sizes? Are there any tests/research on
>> it?
>>
>>
>> Best Regards,
>>
>> Rafał Wądołowski
>>
>> On 24.06.2019 13:05, Konstantin Shalygin wrote:
>>
>> Hi
>>
>> Have been thinking a bit about rocksdb and EC pools:
>>
>> Since a RADOS object written to a EC(k+m) pool is split into several
>> minor pieces, then the OSD will receive many more smaller objects,
>> compared to the amount it would receive in a replicated setup.
>>
>> This must mean that the rocksdb will also need to handle this more
>> entries, and will grow faster. This will have an impact when using
>> bluestore for slow HDD with DB on SSD drives, where the faster growing
>> rocksdb might result in spillover to slow store - if not taken into
>> consideration when designing the disk layout.
>>
>> Are my thoughts on the right track or am I missing something?
>>
>> Has somebody done any measurement on rocksdb growth, comparing replica
>> vs EC ?
>>
>> If you want to be not affected on spillover of block.db - use 3/30/300 GB
>> partition for your block.db.
>>
>>
>>
>> k
>>
>> ___
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Tech Talk tomorrow: Intro to Ceph

2019-06-26 Thread Sage Weil
Hi everyone,

Tomorrow's Ceph Tech Talk will be an updated "Intro to Ceph" talk by Sage 
Weil.  This will be based on a newly refreshed set of slides and provide a 
high-level introduction to the overall Ceph architecture, RGW, RBD, and 
CephFS.

Our plan is to follow-up later this summer with complementary deep-dive 
talks on each of the major components: RGW, RBD, and CephFS to start.

You can join the talk live tomorrow June 27 at 1700 UTC (1PM ET) at

https://bluejeans.com/613110014/browser

As usual, the talk will be recorded and posted to the YouTube channel[1] 
as well.

Thanks! 
sage


[1] https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy osd create adds osds but weight is 0 and not adding hosts to CRUSH map

2019-06-26 Thread Hayashida, Mami
Please disregard the earlier message.  I found the culprit:
`osd_crush_update_on_start` was set to false.

*Mami Hayashida*
*Research Computing Associate*
Univ. of Kentucky ITS Research Computing Infrastructure



On Wed, Jun 26, 2019 at 11:37 AM Hayashida, Mami 
wrote:

> I am trying to build a Ceph cluster using ceph-deploy.  To add OSDs, I
> used the following command (which I had successfully used before to build
> another cluster):
>
> ceph-deploy osd create --block-db=ssd0/db0 --data=/dev/sdh  osd0
> ceph-deploy osd create --block-db=ssd0/db1 --data=/dev/sdi   osd0
> etc.
>
> Prior to running those commands, I did manually create LVs on /dev/sda for
> DB/WAL with:
>
> *** on osd0 node***
> sudo pvcreate /dev/sda
> sudo vgcreate ssd0 /dev/sda;
> for i in {0..9}; do
> sudo lvcreate -L 40G -n db${i} ssd0;
> done
> **
> But I just realized (after creating over 240 OSDs!) neither the host nor
> each osd weight was added to the CRUSH map as far as I can tell (expected
> weight for each osd is 3.67799):
>
> cephuser@admin_node:~$ ceph osd tree
> ID  CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF
>  -10 root default
>   0   hdd  0 osd.0up  1.0 1.0
>   1   hdd  0 osd.1up  1.0 1.0
> (... and so on)
>
> And checking the cruch map with `ceph osd crush dump` also confirms that
> there are no host entries or weight (capacity) of each osd.  At the same
> time,
> `ceph -s` and the dashboard correctly shows ` usage: 9.7 TiB used, 877 TiB
> / 886 TiB avail` (correct number for all the OSDs added so far). In fact,
> the dashboard even correctly groups OSDs into correct hosts.
>
> One additional info: I have been able to create a test pool `ceph osd pool
> create mytest 8` but cannot create an object in the pool.
>
> I am running Ceph version mimic 13.2.6 which I installed using ceph-deploy
> version 2.0.1, all servers running Ubuntu 18.0.4.2.
>
> Any help/advice is appreciated.
>
> *Mami Hayashida*
> *Research Computing Associate*
> Univ. of Kentucky ITS Research Computing Infrastructure
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy osd create adds osds but weight is 0 and not adding hosts to CRUSH map

2019-06-26 Thread Hayashida, Mami
I am trying to build a Ceph cluster using ceph-deploy.  To add OSDs, I used
the following command (which I had successfully used before to build
another cluster):

ceph-deploy osd create --block-db=ssd0/db0 --data=/dev/sdh  osd0
ceph-deploy osd create --block-db=ssd0/db1 --data=/dev/sdi   osd0
etc.

Prior to running those commands, I did manually create LVs on /dev/sda for
DB/WAL with:

*** on osd0 node***
sudo pvcreate /dev/sda
sudo vgcreate ssd0 /dev/sda;
for i in {0..9}; do
sudo lvcreate -L 40G -n db${i} ssd0;
done
**
But I just realized (after creating over 240 OSDs!) neither the host nor
each osd weight was added to the CRUSH map as far as I can tell (expected
weight for each osd is 3.67799):

cephuser@admin_node:~$ ceph osd tree
ID  CLASS WEIGHT TYPE NAMESTATUS REWEIGHT PRI-AFF
 -10 root default
  0   hdd  0 osd.0up  1.0 1.0
  1   hdd  0 osd.1up  1.0 1.0
(... and so on)

And checking the cruch map with `ceph osd crush dump` also confirms that
there are no host entries or weight (capacity) of each osd.  At the same
time,
`ceph -s` and the dashboard correctly shows ` usage: 9.7 TiB used, 877 TiB
/ 886 TiB avail` (correct number for all the OSDs added so far). In fact,
the dashboard even correctly groups OSDs into correct hosts.

One additional info: I have been able to create a test pool `ceph osd pool
create mytest 8` but cannot create an object in the pool.

I am running Ceph version mimic 13.2.6 which I installed using ceph-deploy
version 2.0.1, all servers running Ubuntu 18.0.4.2.

Any help/advice is appreciated.

*Mami Hayashida*
*Research Computing Associate*
Univ. of Kentucky ITS Research Computing Infrastructure
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Lars Marowsky-Bree
On 2019-06-26T14:45:31, Sage Weil  wrote:

Hi Sage,

I think that makes sense. I'd have preferred the Oct/Nov target, but
that'd have made Octopus quite short.

Unsure whether freezing in December with a release in March is too long
though. But given how much people scramble, setting that as a goal
probably will help with stabilization.

I'm also hoping that one day, we can move towards a more agile
continuously integration model (like the Linux kernel does) instead of
massive yearly forklifts. But hey, that's just me ;-)



Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-26 Thread Gregory Farnum
Awesome. I made a ticket and pinged the Bluestore guys about it:
http://tracker.ceph.com/issues/40557

On Tue, Jun 25, 2019 at 1:52 AM Thomas Byrne - UKRI STFC
 wrote:
>
> I hadn't tried manual compaction, but it did the trick. The db shrunk down to 
> 50MB and the OSD booted instantly. Thanks!
>
> I'm confused as to why the OSDs weren't doing this themselves, especially as 
> the operation only took a few seconds. But for now I'm happy that this is 
> easy to rectify if we run into it again.
>
> I've uploaded the log of a slow boot with debug_bluestore turned up [1], and 
> I can provide other logs/files if anyone thinks they could be useful.
>
> Cheers,
> Tom
>
> [1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446
>
> -Original Message-
> From: Gregory Farnum 
> Sent: 24 June 2019 17:30
> To: Byrne, Thomas (STFC,RAL,SC) 
> Cc: ceph-users 
> Subject: Re: [ceph-users] OSDs taking a long time to boot due to 
> 'clear_temp_objects', even with fresh PGs
>
> On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC 
>  wrote:
> >
> > Hi all,
> >
> >
> >
> > Some bluestore OSDs in our Luminous test cluster have started becoming 
> > unresponsive and booting very slowly.
> >
> >
> >
> > These OSDs have been used for stress testing for hardware destined for our 
> > production cluster, so have had a number of pools on them with many, many 
> > objects in the past. All these pools have since been deleted.
> >
> >
> >
> > When booting the OSDs, they spend a few minutes *per PG* in 
> > clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> > hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> > read and all available IOPS consumed. The OSD will finish booting and come 
> > up fine, but will then start hammering the disk again and fall over at some 
> > point later, causing the cluster to gradually fall apart. I'm guessing 
> > something is 'not optimal' in the rocksDB.
> >
> >
> >
> > Deleting all pools will stop this behaviour and OSDs without PGs will 
> > reboot quickly and stay up, but creating a pool will cause OSDs that get 
> > even a single PG to start exhibiting this behaviour again.
> >
> >
> >
> > These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are 
> > ~1yr old. Upgrading to 12.2.12 did not change this behaviour. A blueFS 
> > export of a problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 
> > 63.80 KB, L1 - 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems 
> > excessive for an empty OSD, but it's also the first time I've looked into 
> > this so may be normal?
> >
> >
> >
> > Destroying and recreating an OSD resolves the issue for that OSD, which is 
> > acceptable for this cluster, but I'm a little concerned a similar thing 
> > could happen on a production cluster. Ideally, I would like to try and 
> > understand what has happened before recreating the problematic OSDs.
> >
> >
> >
> > Has anyone got any thoughts on what might have happened, or tips on how to 
> > dig further into this?
>
> Have you tried a manual compaction? The only other time I see this being 
> reported was for FileStore-on-ZFS and it was just very slow at metadata 
> scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme 
> OSD Boot Time") There has been at least one PR about object listings being 
> slow in BlueStore when there are a lot of deleted objects, which would match 
> up with your many deleted pools/objects.
>
> If you have any debug logs the BlueStore devs might be interested in them to 
> check if the most recent patches will fix it.
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Bob Farrell
March seems sensible to me for the reasons you stated. If a release gets
delayed, I'd prefer it to be on the spring side of Christmas (again for the
reasons already mentioned).

That aside, I'm now very impatient to install Octopus on my 8-node cluster.
: )

On Wed, 26 Jun 2019 at 15:46, Sage Weil  wrote:

> Hi everyone,
>
> We talked a bit about this during the CLT meeting this morning.  How about
> the following proposal:
>
> - Target release date of Mar 1 each year.
> - Target freeze in Dec.  That will allow us to use the holidays to do a
>   lot of testing when the lab infrastructure tends to be somewhat idle.
>
> If we get an early build out at the point of the freeze (or even earlier),
> perhaps this capture some of the time that the retailers have during their
> lockdown to identify structural issues with release.  It is probably
> better to do more of this testing at this point in the cycle so that we
> have time to properly fix any big issues (like performance or scaling
> regressions).  It is of course a challenge to motivate testing on
> something that is too far from the final a release, but we can try.
>
> This avoids an abbreviated octopus cycle, and avoids placing August (which
> also often has people out for vacations) right in the middle of the
> lead-up to the freeze.
>
> Thoughts?
> sage
>
>
>
> On Wed, 26 Jun 2019, Sage Weil wrote:
>
> > On Wed, 26 Jun 2019, Alfonso Martinez Hidalgo wrote:
> > > I think March is a good idea.
> >
> > Spring had a slight edge over fall in the twitter poll (for whatever
> > that's worth).  I see the appeal for fall when it comes to down time
> for
> > retailers, but as a practical matter for Octopus specifically, a target
> of
> > say October means freezing in August, which means we only have 2
> > more months of development time.  I'm worried that will turn Octopus
> > in another weak (aka lightly adopted) release.
> >
> > March would mean freezing in January again, which would give us July to
> > Dec... 6 more months.
> >
> > sage
> >
> >
> >
> > >
> > > On Tue, Jun 25, 2019 at 4:32 PM Alfredo Deza  wrote:
> > >
> > > > On Mon, Jun 17, 2019 at 4:09 PM David Turner 
> > > > wrote:
> > > > >
> > > > > This was a little long to respond with on Twitter, so I thought I'd
> > > > share my thoughts here. I love the idea of a 12 month cadence. I like
> > > > October because admins aren't upgrading production within the first
> few
> > > > months of a new release. It gives it plenty of time to be stable for
> the OS
> > > > distros as well as giving admins something low-key to work on over
> the
> > > > holidays with testing the new releases in stage/QA.
> > > >
> > > > October sounds ideal, but in reality, we haven't been able to release
> > > > right on time as long as I can remember. Realistically, if we set
> > > > October, we are probably going to get into November/December.
> > > >
> > > > For example, Nautilus was set to release in February and we got it
> out
> > > > late in late March (Almost April)
> > > >
> > > > Would love to see more of a discussion around solving the problem of
> > > > releasing when we say we are going to - so that we can then choose
> > > > what the cadence is.
> > > >
> > > > >
> > > > > On Mon, Jun 17, 2019 at 12:22 PM Sage Weil 
> wrote:
> > > > >>
> > > > >> On Wed, 5 Jun 2019, Sage Weil wrote:
> > > > >> > That brings us to an important decision: what time of year
> should we
> > > > >> > release?  Once we pick the timing, we'll be releasing at that
> time
> > > > *every
> > > > >> > year* for each release (barring another schedule shift, which
> we want
> > > > to
> > > > >> > avoid), so let's choose carefully!
> > > > >>
> > > > >> I've put up a twitter poll:
> > > > >>
> > > > >> https://twitter.com/liewegas/status/1140655233430970369
> > > > >>
> > > > >> Thanks!
> > > > >> sage
> > > > >> ___
> > > > >> ceph-users mailing list
> > > > >> ceph-users@lists.ceph.com
> > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > >
> > >
> > > --
> > >
> > > Alfonso Martínez
> > >
> > > Senior Software Engineer, Ceph Storage
> > >
> > > Red Hat 
> > > 
> > > ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Sage Weil
Hi everyone,

We talked a bit about this during the CLT meeting this morning.  How about 
the following proposal:

- Target release date of Mar 1 each year.
- Target freeze in Dec.  That will allow us to use the holidays to do a 
  lot of testing when the lab infrastructure tends to be somewhat idle.

If we get an early build out at the point of the freeze (or even earlier), 
perhaps this capture some of the time that the retailers have during their 
lockdown to identify structural issues with release.  It is probably 
better to do more of this testing at this point in the cycle so that we 
have time to properly fix any big issues (like performance or scaling 
regressions).  It is of course a challenge to motivate testing on 
something that is too far from the final a release, but we can try.

This avoids an abbreviated octopus cycle, and avoids placing August (which 
also often has people out for vacations) right in the middle of the 
lead-up to the freeze.

Thoughts?
sage



On Wed, 26 Jun 2019, Sage Weil wrote:

> On Wed, 26 Jun 2019, Alfonso Martinez Hidalgo wrote:
> > I think March is a good idea.
> 
> Spring had a slight edge over fall in the twitter poll (for whatever 
> that's worth).  I see the appeal for fall when it comes to down time for  
> retailers, but as a practical matter for Octopus specifically, a target of
> say October means freezing in August, which means we only have 2
> more months of development time.  I'm worried that will turn Octopus 
> in another weak (aka lightly adopted) release.
> 
> March would mean freezing in January again, which would give us July to 
> Dec... 6 more months.
> 
> sage
> 
> 
> 
> > 
> > On Tue, Jun 25, 2019 at 4:32 PM Alfredo Deza  wrote:
> > 
> > > On Mon, Jun 17, 2019 at 4:09 PM David Turner 
> > > wrote:
> > > >
> > > > This was a little long to respond with on Twitter, so I thought I'd
> > > share my thoughts here. I love the idea of a 12 month cadence. I like
> > > October because admins aren't upgrading production within the first few
> > > months of a new release. It gives it plenty of time to be stable for the 
> > > OS
> > > distros as well as giving admins something low-key to work on over the
> > > holidays with testing the new releases in stage/QA.
> > >
> > > October sounds ideal, but in reality, we haven't been able to release
> > > right on time as long as I can remember. Realistically, if we set
> > > October, we are probably going to get into November/December.
> > >
> > > For example, Nautilus was set to release in February and we got it out
> > > late in late March (Almost April)
> > >
> > > Would love to see more of a discussion around solving the problem of
> > > releasing when we say we are going to - so that we can then choose
> > > what the cadence is.
> > >
> > > >
> > > > On Mon, Jun 17, 2019 at 12:22 PM Sage Weil  wrote:
> > > >>
> > > >> On Wed, 5 Jun 2019, Sage Weil wrote:
> > > >> > That brings us to an important decision: what time of year should we
> > > >> > release?  Once we pick the timing, we'll be releasing at that time
> > > *every
> > > >> > year* for each release (barring another schedule shift, which we want
> > > to
> > > >> > avoid), so let's choose carefully!
> > > >>
> > > >> I've put up a twitter poll:
> > > >>
> > > >> https://twitter.com/liewegas/status/1140655233430970369
> > > >>
> > > >> Thanks!
> > > >> sage
> > > >> ___
> > > >> ceph-users mailing list
> > > >> ceph-users@lists.ceph.com
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > 
> > 
> > -- 
> > 
> > Alfonso Martínez
> > 
> > Senior Software Engineer, Ceph Storage
> > 
> > Red Hat 
> > 
> > ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RocksDB with SSD journal 3/30/300 rule

2019-06-26 Thread Robert Ruge
G'Day everyone.

I'm about to try my first OSD's with a split data drive and journal on an SSD 
using some Intel S3500 600GB SSD's I have spare from a previous project. Now I 
would like to make sure that the 300GB journal fits but my question is whether 
that 300GB is 300 * 1000 or 300 * 1024? The reason is that I would like to 
partition the SSD into two to support two OSD's however if it is 1024 then it 
won't fit on the 600GB disk.
Interestingly parted tells me I have two 300GB partitions while fdisk tells me 
I only have 279.5G partitions.

fdisk /dev/sdb
Command (m for help): p
Disk /dev/sdb: 558.9 GiB, 600127266816 bytes, 1172123568 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 4E3D8587-C8D8-2941-CC38-B8EB7C66264D

Device StartEnd   Sectors   Size Type
/dev/sdb1   2048  586061823 586059776 279.5G Linux filesystem
/dev/sdb2  586061824 1172121599 586059776 279.5G Linux filesystem

Command (m for help): q

parted /dev/sdb
GNU Parted 3.2
Using /dev/sdb
(parted) p
Model: ATA INTEL SSDSC2BB60 (scsi)
Disk /dev/sdb: 600GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   EndSize   File system  Name Flags
1  1049kB  300GB  300GB   primary
2  300GB   600GB  300GB   primary

Thanks.

Regards
Robert Ruge


Important Notice: The contents of this email are intended solely for the named 
addressee and are confidential; any unauthorised use, reproduction or storage 
of the contents is expressly prohibited. If you have received this email in 
error, please delete it and any attachments immediately and advise the sender 
by return email or telephone.

Deakin University does not warrant that this email and any attachments are 
error or virus free.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Sage Weil
On Wed, 26 Jun 2019, Alfonso Martinez Hidalgo wrote:
> I think March is a good idea.

Spring had a slight edge over fall in the twitter poll (for whatever 
that's worth).  I see the appeal for fall when it comes to down time for  
retailers, but as a practical matter for Octopus specifically, a target of
say October means freezing in August, which means we only have 2
more months of development time.  I'm worried that will turn Octopus 
in another weak (aka lightly adopted) release.

March would mean freezing in January again, which would give us July to 
Dec... 6 more months.

sage



> 
> On Tue, Jun 25, 2019 at 4:32 PM Alfredo Deza  wrote:
> 
> > On Mon, Jun 17, 2019 at 4:09 PM David Turner 
> > wrote:
> > >
> > > This was a little long to respond with on Twitter, so I thought I'd
> > share my thoughts here. I love the idea of a 12 month cadence. I like
> > October because admins aren't upgrading production within the first few
> > months of a new release. It gives it plenty of time to be stable for the OS
> > distros as well as giving admins something low-key to work on over the
> > holidays with testing the new releases in stage/QA.
> >
> > October sounds ideal, but in reality, we haven't been able to release
> > right on time as long as I can remember. Realistically, if we set
> > October, we are probably going to get into November/December.
> >
> > For example, Nautilus was set to release in February and we got it out
> > late in late March (Almost April)
> >
> > Would love to see more of a discussion around solving the problem of
> > releasing when we say we are going to - so that we can then choose
> > what the cadence is.
> >
> > >
> > > On Mon, Jun 17, 2019 at 12:22 PM Sage Weil  wrote:
> > >>
> > >> On Wed, 5 Jun 2019, Sage Weil wrote:
> > >> > That brings us to an important decision: what time of year should we
> > >> > release?  Once we pick the timing, we'll be releasing at that time
> > *every
> > >> > year* for each release (barring another schedule shift, which we want
> > to
> > >> > avoid), so let's choose carefully!
> > >>
> > >> I've put up a twitter poll:
> > >>
> > >> https://twitter.com/liewegas/status/1140655233430970369
> > >>
> > >> Thanks!
> > >> sage
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> -- 
> 
> Alfonso Martínez
> 
> Senior Software Engineer, Ceph Storage
> 
> Red Hat 
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Sage Weil
On Tue, 25 Jun 2019, Alfredo Deza wrote:
> On Mon, Jun 17, 2019 at 4:09 PM David Turner  wrote:
> >
> > This was a little long to respond with on Twitter, so I thought I'd share 
> > my thoughts here. I love the idea of a 12 month cadence. I like October 
> > because admins aren't upgrading production within the first few months of a 
> > new release. It gives it plenty of time to be stable for the OS distros as 
> > well as giving admins something low-key to work on over the holidays with 
> > testing the new releases in stage/QA.
> 
> October sounds ideal, but in reality, we haven't been able to release
> right on time as long as I can remember. Realistically, if we set
> October, we are probably going to get into November/December.
> 
> For example, Nautilus was set to release in February and we got it out
> late in late March (Almost April)
> 
> Would love to see more of a discussion around solving the problem of
> releasing when we say we are going to - so that we can then choose
> what the cadence is.

I think the "on time" part is solveable.  We should just use the amount 
of time between take the previous release's freeze date and the 
target release date and go with that.  It is a bit fuzzy because I left it 
up to the leads how they handle the freeze, but I think mid-Januaray is 
about right (in reality we waiting longer than that for lots of RADOS 
stuff).  v14.2.0 was Mar 18, so ~2 months.

The cadence is really separate from that, though: even if every release 
were 2 full months late, if we start with the same target it's still a 1 
year cycle.

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs incomplete

2019-06-26 Thread Paul Emmerich
Have you tried: ceph osd force-create-pg ?

If that doesn't work: use objectstore-tool on the OSD (while it's not
running) and use it to force mark the PG as complete. (Don't know the exact
command off the top of my head)

Caution: these are obviously really dangerous commands



Paul



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 26, 2019 at 1:56 AM ☣Adam  wrote:

> How can I tell ceph to give up on "incomplete" PGs?
>
> I have 12 pgs which are "inactive, incomplete" that won't recover.  I
> think this is because in the past I have carelessly pulled disks too
> quickly without letting the system recover.  I suspect the disks that
> have the data for these are long gone.
>
> Whatever the reason, I want to fix it so I have a clean cluser even if
> that means losing data.
>
> I went through the "troubleshooting pgs" guide[1] which is excellent,
> but didn't get me to a fix.
>
> The output of `ceph pg 2.0 query` includes this:
> "recovery_state": [
> {
> "name": "Started/Primary/Peering/Incomplete",
> "enter_time": "2019-06-25 18:35:20.306634",
> "comment": "not enough complete instances of this PG"
> },
>
> I've already restated all OSDs in various orders, and I changed min_size
> to 1 to see if that would allow them to get fixed, but no such luck.
> These pools are not erasure coded and I'm using the Luminous release.
>
> How can I tell ceph to give up on these PGs?  There's nothing identified
> as unfound, so mark_unfound_lost doesn't help.  I feel like `ceph osd
> lost` might be it, but at this point the OSD numbers have been reused
> for new disks, so I'd really like to limit the damage to the 12 PGs
> which are incomplete if possible.
>
> Thanks,
> Adam
>
> [1]
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd be marked down when recovering

2019-06-26 Thread Paul Emmerich
Looks like it's overloaded and runs into a timeout. For a test/dev
environment: try to set the nodown flag for this experiment if you just
want to ignore these timeouts completely.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 26, 2019 at 1:26 PM zhanrzh...@teamsun.com.cn <
zhanrzh...@teamsun.com.cn> wrote:

> Hi,all:
> I start ceph cluster on my machine with development mode,to estimate
> the time of recoverying after increasing pgp_num.
>all of daemon  run on one machine.
> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
> memory: 377GB
> OS:CentOS Linux release 7.6.1810
> ceph version:hammer
>
> builded ceph according to
> http://docs.ceph.com/docs/hammer/dev/quick_guide/,
> ceph -s shows:
> cluster 15ec2f3f-86e5-46bc-bf98-4b35841ee6a5
>  health HEALTH_WARN
> pool rbd pg_num 512 > pgp_num 256
>  monmap e1: 1 mons at {a=172.30.250.25:6789/0}
> election epoch 2, quorum 0 a
>  osdmap e88: 30 osds: 30 up, 30 in
>   pgmap v829: 512 pgs, 1 pools, 57812 MB data, 14454 objects
> 5691 GB used, 27791 GB / 33483 GB avail
>  512 active+clean
> and ceph osd tree[3]
> It start to recovering after i increased pgp_num. ceph -w says there are
> some osd down, but the process is runing.All configuration items of osd or
> mon are default[1]
> some messages that ceph -w[2] says,as below :
>
> 2019-06-26 15:03:21.839750 mon.0 [INF] pgmap v842: 512 pgs: 127
> active+degraded, 84 activating+degraded, 256 active+clean, 45
> active+recovering+degraded; 57812 MB data, 5714 GB used, 27769 GB / 33483
> GB avail; 22200/43362 objects degraded (51.197%); 50789 kB/s, 12 objects/s
> recovering
> 2019-06-26 15:03:21.840884 mon.0 [INF] osd.1 172.30.250.25:6804/22500
> failed (3 reports from 3 peers after 24.867116 >= grace 20.00)
> 2019-06-26 15:03:21.841459 mon.0 [INF] osd.9 172.30.250.25:6836/25078
> failed (3 reports from 3 peers after 24.867645 >= grace 20.00)
> 2019-06-26 15:03:21.841709 mon.0 [INF] osd.0 172.30.250.25:6800/22260
> failed (3 reports from 3 peers after 24.846423 >= grace 20.00)
> 2019-06-26 15:03:21.842286 mon.0 [INF] osd.13 172.30.250.25:6852/26651
> failed (3 reports from 3 peers after 24.846896 >= grace 20.00)
> 2019-06-26 15:03:21.842607 mon.0 [INF] osd.5 172.30.250.25:6820/23661
> failed (3 reports from 3 peers after 24.804869 >= grace 20.00)
> 2019-06-26 15:03:21.842938 mon.0 [INF] osd.10 172.30.250.25:6840/25490
> failed (3 reports from 3 peers after 24.805155 >= grace 20.00)
> 2019-06-26 15:03:21.843134 mon.0 [INF] osd.12 172.30.250.25:6848/26277
> failed (3 reports from 3 peers after 24.805329 >= grace 20.00)
> 2019-06-26 15:03:21.843591 mon.0 [INF] osd.8 172.30.250.25:6832/24722
> failed (3 reports from 3 peers after 24.805843 >= grace 20.00)
> 2019-06-26 15:03:21.849664 mon.0 [INF] osd.21 172.30.250.25:6884/29762
> failed (3 reports from 3 peers after 23.497080 >= grace 20.00)
> 2019-06-26 15:03:21.862729 mon.0 [INF] osd.14 172.30.250.25:6856/27025
> failed (3 reports from 3 peers after 23.510172 >= grace 20.00)
> 2019-06-26 15:03:21.864222 mon.0 [INF] osdmap e91: 30 osds: 29 up, 30 in
> 2019-06-26 15:03:20.336758 osd.11 [WRN] map e91 wrongly marked me down
> 2019-06-26 15:03:23.408659 mon.0 [INF] pgmap v843: 512 pgs: 8
> stale+activating+degraded, 8 stale+active+clean, 161 active+degraded, 2
> stale+active+recovering+degraded, 33 activating+degraded, 248 active+clean,
> 45 active+recovering+degraded, 7 stale+active+degraded; 57812 MB data, 5730
> GB used, 27752 GB / 33483 GB avail; 27317/43362 objects degraded (62.998%);
> 61309 kB/s, 14 objects/s recovering
> 2019-06-26 15:03:27.538229 mon.0 [INF] osd.18 172.30.250.25:6872/28632
> failed (3 reports from 3 peers after 23.180489 >= grace 20.00)
> 2019-06-26 15:03:27.539416 mon.0 [INF] osd.7 172.30.250.25:6828/24366
> failed (3 reports from 3 peers after 21.900054 >= grace 20.00)
> 2019-06-26 15:03:27.541831 mon.0 [INF] osdmap e92: 30 osds: 19 up, 30 in
> 2019-06-26 15:03:32.748179 mon.0 [INF] osdmap e93: 30 osds: 17 up, 30 in
> 2019-06-26 15:03:33.678682 mon.0 [INF] pgmap v845: 512 pgs: 17
> stale+activating+degraded, 95 stale+active+clean, 55 active+degraded, 13
> peering, 18 stale+active+recovering+degraded, 20 activating+degraded, 155
> active+clean, 22 active+recovery_wait+degraded, 48
> active+recovering+degraded, 69 stale+active+degraded; 57812 MB data, 5734
> GB used, 27748 GB / 33483 GB avail; 26979/43362 objects degraded (62.218%);
> 11510 kB/s, 2 objects/s recovering
> 2019-06-26 15:03:33.775701 osd.1 [WRN] map e92 wrongly marked me down
>
> Has anyone got any thoughts on what might have happened, or tips on how to
> dig further into this?
>
> [1] https://github.com/rongzhen-zhan/myfile/blob/master/osd.0.conf
> [2] 

Re: [ceph-users] ceph balancer - Some osds belong to multiple subtrees

2019-06-26 Thread Paul Emmerich
Device classes are implemented with magic invisible crush trees; you've got
two completely independent trees internally: one for crush rules mapping to
HDDs, one to legacy crush rules not specifying a device class.

The balancer *should* be aware of this and ignore it, but I'm not sure
about the state of the balancer on Luminous. There were quite a few
problems in older versions, lots of them have been fixed in backports.

The upmap balancer is much better than the crush-compat balancer, but it
requires all clients to run Luminous or later.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jun 26, 2019 at 10:21 AM Wolfgang Lendl <
wolfgang.le...@meduniwien.ac.at> wrote:

> Hi,
>
> tried to enable the ceph balancer on a 12.2.12 cluster and got this:
>
> mgr[balancer] Some osds belong to multiple subtrees: [0, 1, 2, 3, 4, 5, 6, 7, 
> 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 
> 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 
> 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 
> 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 
> 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 
> 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 
> 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 
> 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 
> 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 
> 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 
> 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 
> 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 
> 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 
> 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 
> 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 
> 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 
> 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 
> 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 
> 298, 299, 300, 301, 302, 303, 304, 305]
>
> I'm not aware of any additional subtree - maybe someone can enlighten me:
>
> ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "crush-compat"
> }
>
> ceph osd crush tree
> ID  CLASS WEIGHT (compat)  TYPE NAME
>  -1   3176.04785   root default
>  -7316.52490 316.52490 host node0
>   0   hdd9.09560   9.09560 osd.0
>   4   hdd9.09560   9.09560 osd.4
>   8   hdd9.09560   9.09560 osd.8
>  10   hdd9.09560   9.09560 osd.10
>  12   hdd9.09560   9.09560 osd.12
>  16   hdd9.09560   9.09560 osd.16
>  20   hdd9.09560   9.09560 osd.20
>  21   hdd9.09560   9.09560 osd.21
>  26   hdd9.09560   9.09560 osd.26
>  29   hdd9.09560   9.09560 osd.29
>  31   hdd9.09560   9.09560 osd.31
>  35   hdd9.09560   9.09560 osd.35
>  37   hdd9.09560   9.09560 osd.37
>  44   hdd9.09560   9.09560 osd.44
>  47   hdd9.09560   9.09560 osd.47
>  56   hdd9.09560   9.09560 osd.56
>  59   hdd9.09560   9.09560 osd.59
>  65   hdd9.09560   9.09560 osd.65
>  71   hdd9.09560   9.09560 osd.71
>  77   hdd9.09560   9.09560 osd.77
>  80   hdd9.09560   9.09560 osd.80
>  83   hdd9.09569   9.09569 osd.83
>  86   hdd9.09560   9.09560 osd.86
>  88   hdd9.09560   9.09560 osd.88
>  94   hdd   10.91409  10.91409 osd.94
>  95   hdd   10.91409  10.91409 osd.95
>  98   hdd   10.91409  10.91409 osd.98
>  99   hdd   10.91409  10.91409 osd.99
> 238   hdd9.09569   9.09569 osd.238
> 239   hdd9.09569   9.09569 osd.239
> 240   hdd9.09569   9.09569 osd.240
> 241   hdd9.09569   9.09569 osd.241
> 242   hdd9.09569   9.09569 osd.242
> 243   hdd9.09569   9.09569 osd.243
>  -3316.52426 316.52426 host node1
>   1   hdd9.09560   9.09560 osd.1
>   5   hdd9.09560   9.09560 osd.5
>   6   hdd9.09560   9.09560 osd.6
>  11   hdd9.09560   9.09560 osd.11
>  13   hdd9.09560   9.09560 osd.13
>  15   hdd9.09560   9.09560 osd.15
>  19   hdd9.09560   9.09560 osd.19
>  23   hdd9.09560   9.09560 osd.23
>  25   hdd9.09560   9.09560 osd.25
>  28   hdd9.09560   9.09560 osd.28
>  32   hdd9.09560   9.09560 osd.32
>  34   hdd

[ceph-users] osd be marked down when recovering

2019-06-26 Thread zhanrzh...@teamsun.com.cn
Hi,all:
I start ceph cluster on my machine with development mode,to estimate the 
time of recoverying after increasing pgp_num.
   all of daemon  run on one machine.
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
memory: 377GB
OS:CentOS Linux release 7.6.1810
ceph version:hammer

builded ceph according to http://docs.ceph.com/docs/hammer/dev/quick_guide/,
ceph -s shows:
cluster 15ec2f3f-86e5-46bc-bf98-4b35841ee6a5
 health HEALTH_WARN
pool rbd pg_num 512 > pgp_num 256
 monmap e1: 1 mons at {a=172.30.250.25:6789/0}
election epoch 2, quorum 0 a
 osdmap e88: 30 osds: 30 up, 30 in
  pgmap v829: 512 pgs, 1 pools, 57812 MB data, 14454 objects
5691 GB used, 27791 GB / 33483 GB avail
 512 active+clean
and ceph osd tree[3]
It start to recovering after i increased pgp_num. ceph -w says there are some 
osd down, but the process is runing.All configuration items of osd or mon are 
default[1]
some messages that ceph -w[2] says,as below :

2019-06-26 15:03:21.839750 mon.0 [INF] pgmap v842: 512 pgs: 127 
active+degraded, 84 activating+degraded, 256 active+clean, 45 
active+recovering+degraded; 57812 MB data, 5714 GB used, 27769 GB / 33483 GB 
avail; 22200/43362 objects degraded (51.197%); 50789 kB/s, 12 objects/s 
recovering
2019-06-26 15:03:21.840884 mon.0 [INF] osd.1 172.30.250.25:6804/22500 failed (3 
reports from 3 peers after 24.867116 >= grace 20.00)
2019-06-26 15:03:21.841459 mon.0 [INF] osd.9 172.30.250.25:6836/25078 failed (3 
reports from 3 peers after 24.867645 >= grace 20.00)
2019-06-26 15:03:21.841709 mon.0 [INF] osd.0 172.30.250.25:6800/22260 failed (3 
reports from 3 peers after 24.846423 >= grace 20.00)
2019-06-26 15:03:21.842286 mon.0 [INF] osd.13 172.30.250.25:6852/26651 failed 
(3 reports from 3 peers after 24.846896 >= grace 20.00)
2019-06-26 15:03:21.842607 mon.0 [INF] osd.5 172.30.250.25:6820/23661 failed (3 
reports from 3 peers after 24.804869 >= grace 20.00)
2019-06-26 15:03:21.842938 mon.0 [INF] osd.10 172.30.250.25:6840/25490 failed 
(3 reports from 3 peers after 24.805155 >= grace 20.00)
2019-06-26 15:03:21.843134 mon.0 [INF] osd.12 172.30.250.25:6848/26277 failed 
(3 reports from 3 peers after 24.805329 >= grace 20.00)
2019-06-26 15:03:21.843591 mon.0 [INF] osd.8 172.30.250.25:6832/24722 failed (3 
reports from 3 peers after 24.805843 >= grace 20.00)
2019-06-26 15:03:21.849664 mon.0 [INF] osd.21 172.30.250.25:6884/29762 failed 
(3 reports from 3 peers after 23.497080 >= grace 20.00)
2019-06-26 15:03:21.862729 mon.0 [INF] osd.14 172.30.250.25:6856/27025 failed 
(3 reports from 3 peers after 23.510172 >= grace 20.00)
2019-06-26 15:03:21.864222 mon.0 [INF] osdmap e91: 30 osds: 29 up, 30 in
2019-06-26 15:03:20.336758 osd.11 [WRN] map e91 wrongly marked me down
2019-06-26 15:03:23.408659 mon.0 [INF] pgmap v843: 512 pgs: 8 
stale+activating+degraded, 8 stale+active+clean, 161 active+degraded, 2 
stale+active+recovering+degraded, 33 activating+degraded, 248 active+clean, 45 
active+recovering+degraded, 7 stale+active+degraded; 57812 MB data, 5730 GB 
used, 27752 GB / 33483 GB avail; 27317/43362 objects degraded (62.998%); 61309 
kB/s, 14 objects/s recovering
2019-06-26 15:03:27.538229 mon.0 [INF] osd.18 172.30.250.25:6872/28632 failed 
(3 reports from 3 peers after 23.180489 >= grace 20.00)
2019-06-26 15:03:27.539416 mon.0 [INF] osd.7 172.30.250.25:6828/24366 failed (3 
reports from 3 peers after 21.900054 >= grace 20.00)
2019-06-26 15:03:27.541831 mon.0 [INF] osdmap e92: 30 osds: 19 up, 30 in
2019-06-26 15:03:32.748179 mon.0 [INF] osdmap e93: 30 osds: 17 up, 30 in
2019-06-26 15:03:33.678682 mon.0 [INF] pgmap v845: 512 pgs: 17 
stale+activating+degraded, 95 stale+active+clean, 55 active+degraded, 13 
peering, 18 stale+active+recovering+degraded, 20 activating+degraded, 155 
active+clean, 22 active+recovery_wait+degraded, 48 active+recovering+degraded, 
69 stale+active+degraded; 57812 MB data, 5734 GB used, 27748 GB / 33483 GB 
avail; 26979/43362 objects degraded (62.218%); 11510 kB/s, 2 objects/s 
recovering
2019-06-26 15:03:33.775701 osd.1 [WRN] map e92 wrongly marked me down

Has anyone got any thoughts on what might have happened, or tips on how to dig 
further into this? 

[1] https://github.com/rongzhen-zhan/myfile/blob/master/osd.0.conf
[2] https://github.com/rongzhen-zhan/myfile/blob/master/ceph-watch.txt
[3] https://github.com/rongzhen-zhan/myfile/blob/master/ceph%20osd%20tree



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] show-prediction-config - no valid command found?

2019-06-26 Thread Nigel Williams
Have I missed a step? Diskprediction module is not working for me.

root@cnx-11:/var/log/ceph# ceph device show-prediction-config
no valid command found; 10 closest matches:

root@cnx-11:/var/log/ceph# ceph mgr module ls
{
"enabled_modules": [
"dashboard",
"diskprediction_cloud",
"iostat",
"pg_autoscaler",
"prometheus",
"restful"
],...

root@cnx-11:/var/log/ceph# ceph -v
ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
(stable)

One other failure I get for:
ceph device get-health-metrics INTEL_SSDPE2KE020T7_BTLE74200D8J2P0DGN
...
"nvme_vendor": "intel",
"dev": "/dev/nvme0n1",
"error": "smartctl returned invalid JSON"
...
with smartmon 7.1
Using this version directly with the device and with JSON output parses ok
(using an online parser).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Thoughts on rocksdb and erasurecode

2019-06-26 Thread Rafał Wądołowski
We changed these settings. Our config now is:

bluestore_rocksdb_options =
"compression=kSnappyCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=50331648,target_file_size_base=50331648,max_background_compactions=31,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=603979776,max_bytes_for_level_multiplier=10,compaction_threads=32,flusher_threads=8"

It could be changed without redeploy. It changes the sst files, when
compaction is triggered. The additional improvement is Snappy
compression. We rebuild ceph with support for it. I can create PR with
it, if you want :)


Best Regards,

Rafał Wądołowski
Cloud & Security Engineer

On 25.06.2019 22:16, Christian Wuerdig wrote:
> The sizes are determined by rocksdb settings - some details can be
> found here: https://tracker.ceph.com/issues/24361
> One thing to note, in this thread
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030775.html
> it's noted that rocksdb could use up to 100% extra space during
> compaction so if you want to avoid spill over during compaction then
> safer values would be 6/60/600 GB
>
> You can change max_bytes_for_level_base and
> max_bytes_for_level_multiplier to suit your needs better but I'm not
> sure if that can be changed on the fly or if you have to re-create
> OSDs in order to make them apply
>
> On Tue, 25 Jun 2019 at 18:06, Rafał Wądołowski
> mailto:rwadolow...@cloudferro.com>> wrote:
>
> Why are you selected this specific sizes? Are there any
> tests/research on it?
>
>
> Best Regards,
>
> Rafał Wądołowski
>
> On 24.06.2019 13:05, Konstantin Shalygin wrote:
>>
>>> Hi
>>>
>>> Have been thinking a bit about rocksdb and EC pools:
>>>
>>> Since a RADOS object written to a EC(k+m) pool is split into several 
>>> minor pieces, then the OSD will receive many more smaller objects, 
>>> compared to the amount it would receive in a replicated setup.
>>>
>>> This must mean that the rocksdb will also need to handle this more 
>>> entries, and will grow faster. This will have an impact when using 
>>> bluestore for slow HDD with DB on SSD drives, where the faster growing 
>>> rocksdb might result in spillover to slow store - if not taken into 
>>> consideration when designing the disk layout.
>>>
>>> Are my thoughts on the right track or am I missing something?
>>>
>>> Has somebody done any measurement on rocksdb growth, comparing replica 
>>> vs EC ?
>>
>> If you want to be not affected on spillover of block.db - use
>> 3/30/300 GB partition for your block.db.
>>
>>
>>
>> k
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph balancer - Some osds belong to multiple subtrees

2019-06-26 Thread Wolfgang Lendl

Hi,

tried to enable the ceph balancer on a 12.2.12 cluster and got this:

mgr[balancer] Some osds belong to multiple subtrees: [0, 1, 2, 3, 4, 5, 6, 7, 
8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 
106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 
122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 
138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 
154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 
170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 
186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 
202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 
218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 
234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 
250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 
266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 
282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 
298, 299, 300, 301, 302, 303, 304, 305]

I'm not aware of any additional subtree - maybe someone can enlighten me:

ceph balancer status
{
"active": true,
"plans": [],
"mode": "crush-compat"
}

ceph osd crush tree
ID  CLASS WEIGHT (compat)  TYPE NAME
 -1   3176.04785   root default
 -7316.52490 316.52490 host node0
  0   hdd9.09560   9.09560 osd.0
  4   hdd9.09560   9.09560 osd.4
  8   hdd9.09560   9.09560 osd.8
 10   hdd9.09560   9.09560 osd.10
 12   hdd9.09560   9.09560 osd.12
 16   hdd9.09560   9.09560 osd.16
 20   hdd9.09560   9.09560 osd.20
 21   hdd9.09560   9.09560 osd.21
 26   hdd9.09560   9.09560 osd.26
 29   hdd9.09560   9.09560 osd.29
 31   hdd9.09560   9.09560 osd.31
 35   hdd9.09560   9.09560 osd.35
 37   hdd9.09560   9.09560 osd.37
 44   hdd9.09560   9.09560 osd.44
 47   hdd9.09560   9.09560 osd.47
 56   hdd9.09560   9.09560 osd.56
 59   hdd9.09560   9.09560 osd.59
 65   hdd9.09560   9.09560 osd.65
 71   hdd9.09560   9.09560 osd.71
 77   hdd9.09560   9.09560 osd.77
 80   hdd9.09560   9.09560 osd.80
 83   hdd9.09569   9.09569 osd.83
 86   hdd9.09560   9.09560 osd.86
 88   hdd9.09560   9.09560 osd.88
 94   hdd   10.91409  10.91409 osd.94
 95   hdd   10.91409  10.91409 osd.95
 98   hdd   10.91409  10.91409 osd.98
 99   hdd   10.91409  10.91409 osd.99
238   hdd9.09569   9.09569 osd.238
239   hdd9.09569   9.09569 osd.239
240   hdd9.09569   9.09569 osd.240
241   hdd9.09569   9.09569 osd.241
242   hdd9.09569   9.09569 osd.242
243   hdd9.09569   9.09569 osd.243
 -3316.52426 316.52426 host node1
  1   hdd9.09560   9.09560 osd.1
  5   hdd9.09560   9.09560 osd.5
  6   hdd9.09560   9.09560 osd.6
 11   hdd9.09560   9.09560 osd.11
 13   hdd9.09560   9.09560 osd.13
 15   hdd9.09560   9.09560 osd.15
 19   hdd9.09560   9.09560 osd.19
 23   hdd9.09560   9.09560 osd.23
 25   hdd9.09560   9.09560 osd.25
 28   hdd9.09560   9.09560 osd.28
 32   hdd9.09560   9.09560 osd.32
 34   hdd9.09560   9.09560 osd.34
 38   hdd9.09560   9.09560 osd.38
 41   hdd9.09560   9.09560 osd.41
 43   hdd9.09560   9.09560 osd.43
 46   hdd9.09560   9.09560 osd.46
 49   hdd9.09560   9.09560 osd.49
 52   hdd9.09560   9.09560 osd.52
 55   hdd9.09560   9.09560 osd.55
 58   hdd9.09560   9.09560 osd.58
 61   hdd9.09560   9.09560 osd.61
 64   hdd9.09560   9.09560 osd.64
 67   hdd9.09560   9.09560 osd.67
 70   hdd9.09560   9.09560 osd.70
 73   hdd9.09560   9.09560 osd.73
 76   hdd9.09560   9.09560 osd.76
 79   hdd9.09560   9.09560 osd.79
 81   hdd9.09560   9.09560 osd.81
 85   hdd9.09560   9.09560 osd.85
 89   hdd9.09560   9.09560 osd.89
 90   hdd   10.91409  10.91409 osd.90
 91   hdd   10.91409  10.91409 osd.91
 96   hdd   10.91409  10.91409