Re: [ceph-users] Help understanding EC object reads

2019-09-16 Thread Thomas Byrne - UKRI STFC
Thanks for responding!

It's good to hear that the primary OSD has some smarts when dealing with 
partial reads, and that seems to line up with what I was seeing, i.e. I would 
have expected drastically worse performance otherwise with our large object 
sizes and tiny block sizes.

I'm am still seeing some performance degradation with the small block sizes, 
but I guess that is coming from the inefficiencies of lots of small requests 
(time spent queuing for the PG etc.), rather than anything related to EC.

Cheers,
Tom

> -Original Message-
> From: Gregory Farnum 
> Sent: 09 September 2019 23:25
> To: Byrne, Thomas (STFC,RAL,SC) 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Help understanding EC object reads
> 
> On Thu, Aug 29, 2019 at 4:57 AM Thomas Byrne - UKRI STFC
>  wrote:
> >
> > Hi all,
> >
> > I’m investigating an issue with our (non-Ceph) caching layers of our large 
> > EC
> cluster. It seems to be turning users requests for whole objects into lots of
> small byte range requests reaching the OSDs, but I’m not sure how
> inefficient this behaviour is in reality.
> >
> > My limited understanding of an EC object partial read is that the entire
> object is reconstructed on the primary OSD, and then the requested byte
> range is sent to the client before the primary discards the reconstructed
> object.
> 
> Ah, it's not necessarily the entire object is reconstructed, but that any 
> stripes
> covering the requested range are reconstructed. It's changed a bit over time
> and there are some knobs controlling it, but I believe this is generally
> efficient — if you ask for a byte range which simply lives on the primary, 
> it's
> not going to talk to the other OSDs to provide that data.
> 
> >
> > Assuming this is correct, do multiple reads for different byte ranges of the
> same object at effectively the same time result in the entire object being
> reconstructed once for each request, or does the primary do something
> clever and use the same reconstructed object for multiple requests before
> discarding it?
> 
> I'm pretty sure it's per-request; the EC pool code generally assumes you have
> another cache on top of RADOS that deals with combining these requests.
> There is a small cache in the OSD but IIRC it's just for keeping stuff 
> consistent
> while writes are in progress.
> -Greg
> 
> >
> > If I’m completely off the mark with what is going on under the hood here, a
> nudge in the right direction would be appreciated!
> >
> >
> >
> > Cheers,
> >
> > Tom
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help understanding EC object reads

2019-08-29 Thread Thomas Byrne - UKRI STFC
Hi all,

I'm investigating an issue with our (non-Ceph) caching layers of our large EC 
cluster. It seems to be turning users requests for whole objects into lots of 
small byte range requests reaching the OSDs, but I'm not sure how inefficient 
this behaviour is in reality.

My limited understanding of an EC object partial read is that the entire object 
is reconstructed on the primary OSD, and then the requested byte range is sent 
to the client before the primary discards the reconstructed object.

Assuming this is correct, do multiple reads for different byte ranges of the 
same object at effectively the same time result in the entire object being 
reconstructed once for each request, or does the primary do something clever 
and use the same reconstructed object for multiple requests before discarding 
it?

If I'm completely off the mark with what is going on under the hood here, a 
nudge in the right direction would be appreciated!

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub start-time and end-time

2019-08-14 Thread Thomas Byrne - UKRI STFC
Hi Torben,

> Is it allowed to have the scrub period cross midnight ? eg have start time at 
> 22:00 and end time 07:00 next morning.

Yes, I think that's what the way it is mostly used, primarily to reduce the 
scrub impact during waking/working hours.

> I assume that if you only configure the one of them - it still behaves as if 
> it is unconfigured ??

The begin and end hours default to 0 and 24 hours respectively, so setting one 
have an effect. E.g. setting the end hour to 6 will mean scrubbing runs from 
midnight to 6AM, or setting the start hour to 16 will run scrubs from 4PM to 
midnight.

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Thomas Byrne - UKRI STFC
As a counterpoint, adding large amounts of new hardware in gradually (or more 
specifically in a few steps) has a few benefits IMO.

- Being able to pause the operation and confirm the new hardware (and cluster) 
is operating as expected. You can identify problems with hardware with OSDs at 
10% weight that would be much harder to notice during backfilling, and could 
cause performance issues to the cluster if they ended up with their full 
complement of PGs.

- Breaking up long backfills. For a full cluster with large OSDs, backfills can 
take weeks. I find that letting the mon stores compact, and getting the cluster 
back to health OK is good for my sanity and gives a good stopping point to work 
on other cluster issues. This obviously depends on the cluster fullness and OSD 
size.

I still aim for the smallest amount of steps/work, but an initial crush 
weighting of 10-25% of final weight is a good sanity check of the new hardware, 
and gives a good indication of how to approach the rest of the backfill.

Cheers,
Tom

From: ceph-users  On Behalf Of Paul Emmerich
Sent: 24 July 2019 20:06
To: Reed Dier 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to add 100 new OSDs...

+1 on adding them all at the same time.

All these methods that gradually increase the weight aren't really necessary in 
newer releases of Ceph.

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:59 PM Reed Dier 
mailto:reed.d...@focusvq.com>> wrote:
Just chiming in to say that this too has been my preferred method for adding 
[large numbers of] OSDs.

Set the norebalance nobackfill flags.
Create all the OSDs, and verify everything looks good.
Make sure my max_backfills, recovery_max_active are as expected.
Make sure everything has peered.
Unset flags and let it run.

One crush map change, one data movement.

Reed



That works, but with newer releases I've been doing this:

- Make sure cluster is HEALTH_OK
- Set the 'norebalance' flag (and usually nobackfill)
- Add all the OSDs
- Wait for the PGs to peer. I usually wait a few minutes
- Remove the norebalance and nobackfill flag
- Wait for HEALTH_OK

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-25 Thread Thomas Byrne - UKRI STFC
I hadn't tried manual compaction, but it did the trick. The db shrunk down to 
50MB and the OSD booted instantly. Thanks!

I'm confused as to why the OSDs weren't doing this themselves, especially as 
the operation only took a few seconds. But for now I'm happy that this is easy 
to rectify if we run into it again.

I've uploaded the log of a slow boot with debug_bluestore turned up [1], and I 
can provide other logs/files if anyone thinks they could be useful.

Cheers,
Tom
 
[1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446

-Original Message-
From: Gregory Farnum  
Sent: 24 June 2019 17:30
To: Byrne, Thomas (STFC,RAL,SC) 
Cc: ceph-users 
Subject: Re: [ceph-users] OSDs taking a long time to boot due to 
'clear_temp_objects', even with fresh PGs

On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC  
wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming 
> unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our 
> production cluster, so have had a number of pools on them with many, many 
> objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in 
> clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> read and all available IOPS consumed. The OSD will finish booting and come up 
> fine, but will then start hammering the disk again and fall over at some 
> point later, causing the cluster to gradually fall apart. I'm guessing 
> something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
> quickly and stay up, but creating a pool will cause OSDs that get even a 
> single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
> old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
> problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
> 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
> OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is 
> acceptable for this cluster, but I'm a little concerned a similar thing could 
> happen on a production cluster. Ideally, I would like to try and understand 
> what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to 
> dig further into this?

Have you tried a manual compaction? The only other time I see this being 
reported was for FileStore-on-ZFS and it was just very slow at metadata 
scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme OSD 
Boot Time") There has been at least one PR about object listings being slow in 
BlueStore when there are a lot of deleted objects, which would match up with 
your many deleted pools/objects.

If you have any debug logs the BlueStore devs might be interested in them to 
check if the most recent patches will fix it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-24 Thread Thomas Byrne - UKRI STFC
Hi all,



Some bluestore OSDs in our Luminous test cluster have started becoming 
unresponsive and booting very slowly.



These OSDs have been used for stress testing for hardware destined for our 
production cluster, so have had a number of pools on them with many, many 
objects in the past. All these pools have since been deleted.



When booting the OSDs, they spend a few minutes *per PG* in clear_temp_objects 
function, even for brand new, empty PGs. The OSD is hammering the disk during 
the clear_temp_objects, with a constant ~30MB/s read and all available IOPS 
consumed. The OSD will finish booting and come up fine, but will then start 
hammering the disk again and fall over at some point later, causing the cluster 
to gradually fall apart. I'm guessing something is 'not optimal' in the rocksDB.



Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
quickly and stay up, but creating a pool will cause OSDs that get even a single 
PG to start exhibiting this behaviour again.



These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
OSD, but it's also the first time I've looked into this so may be normal?



Destroying and recreating an OSD resolves the issue for that OSD, which is 
acceptable for this cluster, but I'm a little concerned a similar thing could 
happen on a production cluster. Ideally, I would like to try and understand 
what has happened before recreating the problematic OSDs.



Has anyone got any thoughts on what might have happened, or tips on how to dig 
further into this?


Cheers,
Tom

Tom Byrne
Storage System Administrator
Scientific Computing Department
Science and Technology Facilities Council
Rutherford Appleton Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to increase Ceph Mon store?

2019-01-08 Thread Thomas Byrne - UKRI STFC
For what it's worth, I think the behaviour Pardhiv and Bryan are describing is 
not quite normal, and sounds similar to something we see on our large luminous 
cluster with elderly (created as jewel?) monitors. After large operations which 
result in the mon stores growing to 20GB+, leaving the cluster with all PGs 
active+clean for days/weeks will usually not result in compaction, and the 
store sizes will slowly grow. 

I've played around with restarting monitors with and without 
mon_compact_on_start set, and using 'ceph tell mon.[id] compact'. For this 
cluster, I found the most reliable way to trigger a compaction was to restart 
all monitors daemons, one at a time, *without* compact_on_start set. The stores 
rapidly compact down to ~1GB in a minute or less after the last mon restarts.

It's worth noting that occasionally (1 out of every 10 times, or fewer) the 
stores will compact without prompting after all PGs become active+clean. 

I haven't put much time into this as I am planning on reinstalling the monitors 
to get rocksDB mon stores. If the problem persists with the new monitors I'll 
have another look at it.

Cheers
Tom

> -Original Message-
> From: ceph-users  On Behalf Of Wido
> den Hollander
> Sent: 08 January 2019 08:28
> To: Pardhiv Karri ; Bryan Stillwell
> 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Is it possible to increase Ceph Mon store?
> 
> 
> 
> On 1/7/19 11:15 PM, Pardhiv Karri wrote:
> > Thank you Bryan, for the information. We have 816 OSDs of size 2TB each.
> > The mon store too big popped up when no rebalancing happened in that
> > month. It is slightly above the 15360 threshold around 15900 or 16100
> > and stayed there for more than a week. We ran the "ceph tell mon.[ID]
> > compact" to get it back earlier this week. Currently the mon store is
> > around 12G on each monitor. If it doesn't grow then I won't change the
> > value but if it grows and gives the warning then I will increase it
> > using "mon_data_size_warn".
> >
> 
> This is normal. The MONs will keep a history of OSDMaps if one or more PGs
> are not active+clean
> 
> They will trim after all the PGs are clean again, nothing to worry about.
> 
> You can increase the setting for the warning, but that will not shrink the
> database.
> 
> Just make sure your monitors have enough free space.
> 
> Wido
> 
> > Thanks,
> > Pardhiv Karri
> >
> >
> >
> > On Mon, Jan 7, 2019 at 1:55 PM Bryan Stillwell  > > wrote:
> >
> > I believe the option you're looking for is mon_data_size_warn.  The
> > default is set to 16106127360.
> >
> > __ __
> >
> > I've found that sometimes the mons need a little help getting
> > started with trimming if you just completed a large expansion.
> > Earlier today I had a cluster where the mon's data directory was
> > over 40GB on all the mons.  When I restarted them one at a time with
> > 'mon_compact_on_start = true' set in the '[mon]' section of
> > ceph.conf, they stayed around 40GB in size.   However, when I was
> > about to hit send on an email to the list about this very topic, the
> > warning cleared up and now the data directory is now between 1-3GB
> > on each of the mons.  This was on a cluster with >1900 OSDs.
> >
> > __ __
> >
> > Bryan
> >
> > __ __
> >
> > *From: *ceph-users  > > on behalf of Pardhiv
> > Karri mailto:meher4in...@gmail.com>>
> > *Date: *Monday, January 7, 2019 at 11:08 AM
> > *To: *ceph-users  > >
> > *Subject: *[ceph-users] Is it possible to increase Ceph Mon
> > store?
> >
> > __ __
> >
> > Hi, __ __
> >
> > __ __
> >
> > We have a large Ceph cluster (Hammer version). We recently saw its
> > mon store growing too big > 15GB on all 3 monitors without any
> > rebalancing happening for quiet sometime. We have compacted the DB
> > using  "#ceph tell mon.[ID] compact" for now. But is there a way to
> > increase the size of the mon store to 32GB or something to avoid
> > getting the Ceph health to warning state due to Mon store growing
> > too big?
> >
> > __ __
> >
> > -- 
> >
> > Thanks,
> >
> > *P**ardhiv **K**arri*
> >
> >
> > 
> >
> > __ __
> >
> >
> >
> > --
> > *Pardhiv Karri*
> > "Rise and Rise again untilLAMBSbecome LIONS"
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed

2019-01-02 Thread Thomas Byrne - UKRI STFC
>   In previous versions of Ceph, I was able to determine which PGs had
> scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
> provided that they were not already being scrubbed. In Luminous, the bad
> PG is not visible in "ceph --status" anywhere. Should I use something like
> "ceph health detail -f json-pretty" instead?

'ceph pg ls inconsistent' lists all inconsistent PGs.

>   Also, is it possible to configure Ceph to attempt repairing the bad PGs
> itself, as soon as the scrub fails? I run most of my OSDs on top of a bunch of
> old spinning disks, and a scrub error almost always means that there is a bad
> sector somewhere, which can easily be fixed by rewriting the lost data using
> "ceph pg repair".

I don't know of a good way to repair inconsistencies automatically from within 
Ceph. However, I seem to remember someone saying that with BlueStore OSDs, read 
errors are attempted to be fixed (by rewriting the unreadable replica/shard) 
when they are discovered during client reads. And there was a potential plan to 
do the same if they are discovered during scrubbing. I can't remember the 
details (this was a while ago, at Cephalocon APAC), so I may be completely off 
the mark here. 

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed sync?

2019-01-02 Thread Thomas Byrne - UKRI STFC
I recently spent some time looking at this, I believe the 'summary' and 
'overall_status' sections are now deprecated. The 'status' and 'checks' fields 
are the ones to use now.

The 'status' field gives you the OK/WARN/ERR, but returning the most severe 
error condition from the 'checks' section is less trivial. AFAIK all 
health_warn states are treated as equally severe, and same for health_err. We 
ended up formatting our single line human readable output as something like:

"HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN: 20 
large omap objects"

To make it obvious which check is causing which state. We needed to supress 
specific checks for callouts, so had to look at each check and the resulting 
state. If you're not trying to do something similar there may be a more 
lightweight way to go about it.

Cheers,
Tom

> -Original Message-
> From: ceph-users  On Behalf Of Jan
> Kasprzak
> Sent: 02 January 2019 09:29
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] ceph health JSON format has changed sync?
> 
>   Hello, Ceph users,
> 
> I am afraid the following question is a FAQ, but I still was not able to find 
> the
> answer:
> 
> I use ceph --status --format=json-pretty as a source of CEPH status for my
> Nagios monitoring. After upgrading to Luminous, I see the following in the
> JSON output when the cluster is not healthy:
> 
> "summary": [
> {
> "severity": "HEALTH_WARN",
> "summary": "'ceph health' JSON format has changed in 
> luminous. If
> you see this your monitoring system is scraping the wrong fields. Disable this
> with 'mon health preluminous compat warning = false'"
> }
> ],
> 
> Apart from that, the JSON data seems reasonable. My question is which part
> of JSON structure are the "wrong fields" I have to avoid. Is it just the
> "summary" section, or some other parts as well? Or should I avoid the whole
> ceph --status and use something different instead?
> 
> What I want is a single machine-readable value with OK/WARNING/ERROR
> meaning, and a single human-readable text line, describing the most severe
> error condition which is currently present. What is the preferred way to get
> this data in Luminous?
> 
>   Thanks,
> 
> -Yenya
> 
> --
> | Jan "Yenya" Kasprzak  |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google  
> the
> symptoms, and hope that you don't have to watch a video. --P. Zaitcev
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2019-01-02 Thread Thomas Byrne - UKRI STFC
Assuming I understand it correctly:

"pg_upmap_items 6.0 [40,20]" refers to replacing (upmapping?) osd.40 with 
osd.20 in the acting set of the placement group '6.0'. Assuming it's a 3 
replica PG, the other two OSDs in the set remain unchanged from the CRUSH 
calculation.

"pg_upmap_items 6.6 [45,46,59,56]" describes two upmap replacements for the PG 
6.6, replacing 45 with 46, and 59 with 56.

Hope that helps.

Cheers,
Tom

> -Original Message-
> From: ceph-users  On Behalf Of
> jes...@krogh.cc
> Sent: 30 December 2018 22:04
> To: Konstantin Shalygin 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
> 
> >> I would still like to have a log somewhere to grep and inspect what
> >> balancer/upmap actually does - when in my cluster. Or some ceph
> >> commands that deliveres some monitoring capabilityes .. any
> >> suggestions?
> > Yes, on ceph-mgr log, when log level is DEBUG.
> 
> Tried the docs .. something like:
> 
> ceph tell mds ... does not seem to work.
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/
> 
> > You can get your cluster upmap's in via `ceph osd dump | grep upmap`.
> 
> Got it -- but I really need the README .. it shows the map ..
> ...
> pg_upmap_items 6.0 [40,20]
> pg_upmap_items 6.1 [59,57,47,48]
> pg_upmap_items 6.2 [59,55,75,9]
> pg_upmap_items 6.3 [22,13,40,39]
> pg_upmap_items 6.4 [23,9]
> pg_upmap_items 6.5 [25,17]
> pg_upmap_items 6.6 [45,46,59,56]
> pg_upmap_items 6.8 [60,54,16,68]
> pg_upmap_items 6.9 [61,69]
> pg_upmap_items 6.a [51,48]
> pg_upmap_items 6.b [43,71,41,29]
> pg_upmap_items 6.c [22,13]
> 
> ..
> 
> But .. I dont have any pg's that should only have 2 replicas.. neither any 
> with 4
> .. how should this be interpreted?
> 
> Thanks.
> 
> --
> Jesper
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-21 Thread Thomas Byrne - UKRI STFC
mon_compact_on_start was not changed from default (false). From the logs, it 
looks like the monitor with the excessive resource usage (mon1) was up and 
winning the majority of elections throughout the period of unresponsiveness, 
with other monitors occasionally winning an election without mon1 participating 
(I’m guessing as it failed to respond).

That’s interesting about the false map updates. We had a short networking blip 
(caused by me) on some monitors shortly before the trouble started, which 
caused some monitors to start calling frequent (every few seconds) elections. 
Could this rapid creation of new monmaps have the same effect as updating pool 
settings? Thus causing the monitor to try and clean up in one go, causing the 
observed resource usage and unresponsiveness.

I’ve been bringing in the storage as you described, I’m in the process of 
adding 6PB of new storage to a ~10PB (raw) cluster (with ~8PB raw utilisation), 
so I’m feeling around for the largest backfills we can safely do. I had been 
weighting up storage in steps that take ~5 days to finish, but have been 
starting the next reweight as we get to the tail end of the previous, so not 
giving the mons time to compact their stores. Although it’s far from ideal 
(from a total time to get new storage weighted up), I’ll be letting the mons 
compact between every backfill until I have a better idea of what went on last 
week.

From: David Turner <drakonst...@gmail.com>
Sent: 17 May 2018 18:57
To: Byrne, Thomas (STFC,RAL,SC) <tom.by...@stfc.ac.uk>
Cc: Wido den Hollander <w...@42on.com>; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] A question about HEALTH_WARN and monitors holding 
onto cluster maps

Generally they clean up slowly by deleting 30 maps every time the maps update.  
You can speed that up by creating false map updates with something like 
updating a pool setting to what it already is.  What it sounds like happened to 
you is that your mon crashed and restarted.  If it crashed and has the setting 
to compact the mon store on start, then it would cause it to forcibly go 
through and clean everything up in 1 go.

I generally plan my backfilling to not take longer than a week.  Any longer 
than that is pretty rough on the mons.  You can achieve that by bringing in new 
storage with a weight of 0.0 and increase it appropriately as opposed to just 
adding it with it's full weight and having everything move at once.

On Thu, May 17, 2018 at 12:56 PM Thomas Byrne - UKRI STFC 
<tom.by...@stfc.ac.uk<mailto:tom.by...@stfc.ac.uk>> wrote:
That seems like a sane way to do it, thanks for the clarification Wido.

As a follow-up, do you have any feeling as to whether the trimming a 
particularly intensive task? We just had a fun afternoon where the monitors 
became unresponsive (no ceph status etc) for several hours, seemingly due to 
the leaders monitor process consuming all available ram+swap (64GB+32GB) on 
that monitor. This was then followed by the actual trimming of the stores 
(26GB->11GB), which took a few minutes and happened simultaneously across the 
monitors.

If this is something to be expected, it'll be a good reason to plan our long 
backfills much more carefully in the future!

> -Original Message-
> From: ceph-users 
> <ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
> On Behalf Of Wido
> den Hollander
> Sent: 17 May 2018 15:40
> To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] A question about HEALTH_WARN and monitors
> holding onto cluster maps
>
>
>
> On 05/17/2018 04:37 PM, Thomas Byrne - UKRI STFC wrote:
> > Hi all,
> >
> >
> >
> > As far as I understand, the monitor stores will grow while not
> > HEALTH_OK as they hold onto all cluster maps. Is this true for all
> > HEALTH_WARN reasons? Our cluster recently went into HEALTH_WARN
> due to
> > a few weeks of backfilling onto new hardware pushing the monitors data
> > stores over the default 15GB threshold. Are they now prevented from
> > shrinking till I increase the threshold above their current size?
> >
>
> No, monitors will trim their data store with all PGs are active+clean, not 
> when
> they are HEALTH_OK.
>
> So a 'noout' flag triggers a WARN, but that doesn't prevent the MONs from
> trimming for example.
>
> Wido
>
> >
> >
> > Cheers
> >
> > Tom
> >
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
&g

Re: [ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-17 Thread Thomas Byrne - UKRI STFC
That seems like a sane way to do it, thanks for the clarification Wido.

As a follow-up, do you have any feeling as to whether the trimming a 
particularly intensive task? We just had a fun afternoon where the monitors 
became unresponsive (no ceph status etc) for several hours, seemingly due to 
the leaders monitor process consuming all available ram+swap (64GB+32GB) on 
that monitor. This was then followed by the actual trimming of the stores 
(26GB->11GB), which took a few minutes and happened simultaneously across the 
monitors.

If this is something to be expected, it'll be a good reason to plan our long 
backfills much more carefully in the future!

> -Original Message-
> From: ceph-users <ceph-users-boun...@lists.ceph.com> On Behalf Of Wido
> den Hollander
> Sent: 17 May 2018 15:40
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] A question about HEALTH_WARN and monitors
> holding onto cluster maps
> 
> 
> 
> On 05/17/2018 04:37 PM, Thomas Byrne - UKRI STFC wrote:
> > Hi all,
> >
> >
> >
> > As far as I understand, the monitor stores will grow while not
> > HEALTH_OK as they hold onto all cluster maps. Is this true for all
> > HEALTH_WARN reasons? Our cluster recently went into HEALTH_WARN
> due to
> > a few weeks of backfilling onto new hardware pushing the monitors data
> > stores over the default 15GB threshold. Are they now prevented from
> > shrinking till I increase the threshold above their current size?
> >
> 
> No, monitors will trim their data store with all PGs are active+clean, not 
> when
> they are HEALTH_OK.
> 
> So a 'noout' flag triggers a WARN, but that doesn't prevent the MONs from
> trimming for example.
> 
> Wido
> 
> >
> >
> > Cheers
> >
> > Tom
> >
> >
> >
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-17 Thread Thomas Byrne - UKRI STFC
Hi all,

As far as I understand, the monitor stores will grow while not HEALTH_OK as 
they hold onto all cluster maps. Is this true for all HEALTH_WARN reasons? Our 
cluster recently went into HEALTH_WARN due to a few weeks of backfilling onto 
new hardware pushing the monitors data stores over the default 15GB threshold. 
Are they now prevented from shrinking till I increase the threshold above their 
current size?

Cheers
Tom


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com