Re: [ceph-users] Luminous with osd flapping, slow requests when deep scrubbing

2018-10-16 Thread Christian Balzer


Hello,
 
On Tue, 16 Oct 2018 14:09:23 +0100 (BST) Andrei Mikhailovsky wrote:

> Hi Christian,
> 
> 
> - Original Message -
> > From: "Christian Balzer" 
> > To: "ceph-users" 
> > Cc: "Andrei Mikhailovsky" 
> > Sent: Tuesday, 16 October, 2018 08:51:36
> > Subject: Re: [ceph-users] Luminous with osd flapping, slow requests when 
> > deep scrubbing  
> 
> > Hello,
> > 
> > On Mon, 15 Oct 2018 12:26:50 +0100 (BST) Andrei Mikhailovsky wrote:
> >   
> >> Hello,
> >> 
> >> I am currently running Luminous 12.2.8 on Ubuntu with 4.15.0-36-generic 
> >> kernel
> >> from the official ubuntu repo. The cluster has 4 mon + osd servers. Each 
> >> osd
> >> server has the total of 9 spinning osds and 1 ssd for the hdd and ssd 
> >> pools.
> >> The hdds are backed by the S3710 ssds for journaling with a ration of 1:5. 
> >> The
> >> ssd pool osds are not using external journals. Ceph is used as a Primary
> >> storage for Cloudstack - all vm disk images are stored on the cluster.
> >>  
> > 
> > For the record, are you seeing the flapping only on HDD pools or with SSD
> > pools as well?
> >   
> 
> 
> I think so, this tend to happen to the HDD pool.
>
Fits the picture and expectations.
 
> 
> 
> > When migrating to Bluestore, did you see this starting to happen before
> > the migration was complete (and just on Bluestore OSDs of course)?
> >   
> 
> 
> Nope, not that I can recall. I did have some issues with performance 
> initially, but I've added a few temp disks to the servers to help with the 
> free space. The cluster was well unhappy when the usage spiked above 90% on 
> some of the osds. After the temp disks were in place, the cluster was back to 
> being a happy.
>
Well, that's never a good state indeed. 

> 
> 
> > What's your HW like, in particular RAM? Current output of "free"?  
> 
> Each of the mon/osd servers has 64GB of ram. Currently, one of the server's 
> mem usage is (it has been restarted 30 mins ago):
> 
> root@arh-ibstorage4-ib:/home/andrei# free -h
>   totalusedfree  shared  buff/cache   
> available
> Mem:62G 11G 50G 10M575M 
> 49G
> Swap:   45G  0B 45G
>
Something with a little more uptime would be more relevant, but at 64GB
and 10 OSDs you'll never use even close to the caching that you had with
filestore when running with default settings.

> 
> The servers with 24 hours uptime have a similar picture, but a slightly 
> larger used amount.
> 
But still nowhere near half let alone near all, right?

> > 
> > If you didn't tune your bluestore cache you're likely just using a
> > fraction of the RAM for caching, making things a LOT harder for OSDs to
> > keep up when compared to filestore and the global (per node) page cache.
> >   
> 
> I haven't done any bluestore cache changes at all after moving to the 
> bluestore type. Could you please point me in the right direction?
> 
Google is your friend, "bluestore cache" finds this as first hit in this
ML, read this thread and the referred documentation and other threads.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/029449.html

> 
> > See the various bluestore cache threads here, one quite recently.
> > 
> > If your cluster was close to the brink with filestore just moving it to
> > bluestore would nicely fit into what you're seeing, especially for the
> > high stress and cache bypassing bluestore deep scrubbing.
> >   
> 
> 
> I have put in place the following config settings in the [global] section:
> 
> 
> # Settings to try to minimise IO client impact / slow requests / osd flapping 
> from scrubbing and snap trimming
> osd_scrub_chunk_min = 1
> osd_scrub_chunk_max = 5
> #osd_scrub_begin_hour = 21
> #osd_scrub_end_hour = 5
> osd_scrub_sleep = 0.1
> osd_scrub_max_interval = 1209600
> osd_scrub_min_interval = 259200
> osd_deep_scrub_interval = 1209600
> osd_deep_scrub_stride = 1048576
> osd_scrub_priority = 1
> osd_snap_trim_priority = 1
> 
> 
> Following the restart of the servers and doing a few tests by manually 
> invoking 6 deep scrubbing processes I haven't seen any more issues with osd 
> flapping or the slow requests. I will keep an eye on it over the next few 
> weeks to see if the issue is resolved.
>
Yes, tuning deep scrubs way down is an obvious way forward and with
bluestore they're less relevant to begin with.
Also note that AFAIK with bluestore deep scrub will bypass all caches,
impacting things even harder.

However what this documents (really confirms) is that going to bluestore
you will incur several performance penalties along the line, some of them
only addressable with more HW (RAM) at this point in time.

Regards,

Christian
 
> 
> 
> > Regards,
> > 
> > Christian  
> >> I have recently migrated all osds to the bluestore, which was a long 
> >> process
> >> with ups and downs, but I am happy to say that the migration is done. 
> >> During
> >> the migration I've disabled the scrubbing (both deep 

Re: [ceph-users] Resolving Large omap objects in RGW index pool

2018-10-16 Thread Chris Sarginson
Hi,

Having spent some time on the below issue, here are the steps I took to
resolve the "Large omap objects" warning.  Hopefully this will help others
who find themselves in this situation.

I got the object ID and OSD ID implicated from the ceph cluster logfile on
the mon.  I then proceeded to the implicated host containing the OSD, and
extracted the implicated PG by running the following, and looking at which
PG had started and completed a deep-scrub around the warning being logged:

grep -C 200 Large /var/log/ceph/ceph-osd.*.log | egrep '(Large
omap|deep-scrub)'

If the bucket had not been sharded sufficiently (IE the cluster log showed
a "Key Count" or "Size" over the thresholds), I ran through the manual
sharding procedure (shown here: https://tracker.ceph.com/issues/24457#note-5
)

Once this was successfully sharded, or if the bucket was previously
sufficiently sharded by Ceph prior to disabling the functionality I was
able to use the following command (seemingly undocumented for Luminous
http://docs.ceph.com/docs/mimic/man/8/radosgw-admin/#commands):

radosgw-admin bi purge --bucket ${bucketname} --bucket-id ${old_bucket_id}

I then issued a ceph pg deep-scrub against the PG that had contained the
Large omap object.

Once I had completed this procedure, my Large omap object warnings went
away and the cluster returned to HEALTH_OK.

However our radosgw bucket indexes pool now seems to be using substantially
more space than previously.  Having looked initially at this bug, and in
particular the first comment:

http://tracker.ceph.com/issues/34307#note-1

I was able to extract a number of bucket indexes that had apparently been
resharded, and removed the legacy index using the radosgw-admin bi purge
--bucket ${bucket} ${marker}.  I am still able  to perform a radosgw-admin
metadata get bucket.instance:${bucket}:${marker} successfully, however now
when I run rados -p .rgw.buckets.index ls | grep ${marker} nothing is
returned.  Even after this, we were still seeing extremely high disk usage
of our OSDs containing the bucket indexes (we have a dedicated pool for
this).  I then modified the one liner referenced in the previous link as
follows:

 grep -E '"bucket"|"id"|"marker"' bucket-stats.out | awk -F ":" '{print
$2}' | tr -d '",' | while read -r bucket; do read -r id; read -r marker; [
"$id" == "$marker" ] && true || NEWID=`radosgw-admin --id rgw.ceph-rgw-1
metadata get bucket.instance:${bucket}:${marker} | python -c 'import sys,
json; print
json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
while [ ${NEWID} ]; do if [ "${NEWID}" != "${marker}" ] && [ ${NEWID} !=
${bucket} ] ; then echo "$bucket $NEWID"; fi; NEWID=`radosgw-admin --id
rgw.ceph-rgw-1 metadata get bucket.instance:${bucket}:${NEWID} | python -c
'import sys, json; print
json.load(sys.stdin)["data"]["bucket_info"]["new_bucket_instance_id"]'`;
done; done > buckets_with_multiple_reindexes2.txt

This loops through the buckets that have a different marker/bucket_id, and
looks to see if a new_bucket_instance_id is there, and if so will loop
through until there is no longer a "new_bucket_instance_id".  After letting
this complete, this suggests that I have over 5000 indexes for 74 buckets,
some of these buckets have > 100 indexes apparently.

:~# awk '{print $1}' buckets_with_multiple_reindexes2.txt | uniq | wc -l
74
~# wc -l buckets_with_multiple_reindexes2.txt
5813 buckets_with_multiple_reindexes2.txt

This is running a single realm, multiple zone configuration, and no multi
site sync, but the closest I can find to this issue is this bug
https://tracker.ceph.com/issues/24603

Should I be OK to loop through these indexes and remove any with a
reshard_status of 2, a new_bucket_instance_id that does not match the
bucket_instance_id returned by the command:

radosgw-admin bucket stats --bucket ${bucket}

I'd ideally like to get to a point where I can turn dynamic sharding back
on safely for this cluster.

Thanks for any assistance, let me know if there's any more information I
should provide
Chris

On Thu, 4 Oct 2018 at 18:22 Chris Sarginson  wrote:

> Hi,
>
> Thanks for the response - I am still unsure as to what will happen to the
> "marker" reference in the bucket metadata, as this is the object that is
> being detected as Large.  Will the bucket generate a new "marker" reference
> in the bucket metadata?
>
> I've been reading this page to try and get a better understanding of this
> http://docs.ceph.com/docs/luminous/radosgw/layout/
>
> However I'm no clearer on this (and what the "marker" is used for), or why
> there are multiple separate "bucket_id" values (with different mtime
> stamps) that all show as having the same number of shards.
>
> If I were to remove the old bucket would I just be looking to execute
>
> rados - p .rgw.buckets.index rm .dir.default.5689810.107
>
> Is the differing marker/bucket_id in the other buckets that was found also
> an indicator?  As I say, there's a good number of these, here's some
> 

Re: [ceph-users] How to debug problem in MDS ?

2018-10-16 Thread Sergey Malinin
Are you running multiple active MDS daemons?
On MDS host issue "ceph-daemon mds.X config set debug_mds 20" for maximum 
logging verbosity.

> On 16.10.2018, at 19:23, Florent B  wrote:
> 
> Hi,
> 
> A few months ago I sent a message to that list about a problem with a
> Ceph + Dovecot setup.
> 
> Bug disappeared and I didn't answer to the thread.
> 
> Now the bug has come again (Luminous up-to-date cluster + Dovecot
> up-to-date + Debian Stretch up-to-date).
> 
> I know how to reproduce it, but it seems very related to my user's
> Dovecot data (few GB) and is related to file locking system (bug occurs
> when I set locking method to "fcntl" or "flock" in Dovecot, but not with
> "dotlock".
> 
> It ends to a unresponsive MDS (100% CPU hang, switching to another MDS
> but always staying at 100% CPU usage). I can't even use the admin socket
> when MDS is hanged.
> 
> I would like to know *exactly* which information do you need to
> investigate that bug ? (which commands, when, how to report large log
> files...)
> 
> Thank you.
> 
> Florent
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] warning: fast-diff map is invalid operation may be slow; object map invalid

2018-10-16 Thread Jason Dillaman
On Mon, Oct 15, 2018 at 4:04 PM Anthony D'Atri  wrote:
>
>
> We turned on all the RBD v2 features while running Jewel; since then all 
> clusters have been updated to Luminous 12.2.2 and additional clusters added 
> that have never run Jewel.
>
> Today I find that a few percent of volumes in each cluster have issues, 
> examples below.
>
> I'm concerned that these issues may present problems when using rbd-mirror to 
> move volumes between clusters.  Many instances involve heads or nodes of 
> snapshot trees; it's possible but unverified that those not currently 
> snap-related may have been in the past.
>
> In the Jewel days we retroactively applied fast-diff, object-map to existing 
> volumes but did not bother with tombstones.
>
> Any thoughts on
>
> 1) How this happens?

If you enabled object-map and/or fast-diff on pre-existing images,
then the object-map is automatically flagged as invalid since just
enabling the feature doesn't rebuild the object-map. This just
instructs librbd clients not to trust the object-map so all
optimizations are disabled.

> 2) Is rbd object-map rebuild"  always safe, especially on volumes that are in 
> active use?

Yes, the live-rebuild of the HEAD image is just proxied over to the
current exclusive-lock owner. Rebuilds of any snapshot object-maps are
performed by the rbd CLI.

> 3) The disturbing messages spewed by `rbd ls` -- related or not?

Some of the errors spewed by "rbd ls" are not specifically related to
the object-map feature. For example, it appears that you have at least
two cloned images where the parent image snapshot is no longer
available (librbd::image::RefreshParentRequest: failed to locate
snapshot). It also appears that at least two of the images in your RBD
directory don't exist (librbd::image::OpenRequest: failed to retreive
immutable metadata).

However, for the "librbd::object_map::RefreshRequest: failed to load
object map" logs, those are harmless if you enabled the object-map
after the snapshot was created and haven't rebuilt the object map yet.

> 4) Would this as I fear confound successful rbd-mirror migration?

Nope -- rbd-mirror uses the journal for synchronization.

> I've found 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-August/012137.html 
> that *seems* to indicate that a live rebuild is safe,but I'm still uncertain 
> about the root cause, and if it's still happening.  I've never ventured into 
> this dark corner before so I'm being careful.
>
> All clients are QEMU/libvirt; most are 12.2.2 but there are some lingering 
> Jewel, most likely 10.2.6 or perhaps 10.2.3.  Eg:
>
>
> # ceph features
> {
> "mon": {
> "group": {
> "features": "0x1ffddff8eea4fffb",
> "release": "luminous",
> "num": 5
> }
> },
> "osd": {
> "group": {
> "features": "0x1ffddff8eea4fffb",
> "release": "luminous",
> "num": 983
> }
> },
> "client": {
> "group": {
> "features": "0x7fddff8ee84bffb",
> "release": "jewel",
> "num": 15
> },
> "group": {
> "features": "0x1ffddff8eea4fffb",
> "release": "luminous",
> "num": 3352
> }
> }
> }
>
>
> # rbd ls  -l |wc
> 2018-10-05 20:55:17.397288 7f976cff9700 -1 
> librbd::image::RefreshParentRequest: failed to locate snapshot: Snapshot with 
> this id not found
> 2018-10-05 20:55:17.397334 7f976cff9700 -1 librbd::image::RefreshRequest: 
> failed to refresh parent image: (2) No such file or directory
> 2018-10-05 20:55:17.397397 7f976cff9700 -1 librbd::image::OpenRequest: failed 
> to refresh image: (2) No such file or directory
> 2018-10-05 20:55:17.398025 7f976cff9700 -1 librbd::io::AioCompletion: 
> 0x7f978667b570 fail: (2) No such file or directory
> 2018-10-05 20:55:17.398075 7f976cff9700 -1 
> librbd::image::RefreshParentRequest: failed to locate snapshot: Snapshot with 
> this id not found
> 2018-10-05 20:55:17.398079 7f976cff9700 -1 librbd::image::RefreshRequest: 
> failed to refresh parent image: (2) No such file or directory
> 2018-10-05 20:55:17.398096 7f976cff9700 -1 librbd::image::OpenRequest: failed 
> to refresh image: (2) No such file or directory
> 2018-10-05 20:55:17.398659 7f976cff9700 -1 librbd::io::AioCompletion: 
> 0x7f978660c240 fail: (2) No such file or directory
> 2018-10-05 20:55:30.416174 7f976cff9700 -1 librbd::io::AioCompletion: 
> 0x7f9786cd5ee0 fail: (2) No such file or directory
> 2018-10-05 20:55:34.083188 7f976d7fa700 -1 
> librbd::object_map::RefreshRequest: failed to load object map: 
> rbd_object_map.b18d634146825.2d8f
> 2018-10-05 20:55:34.084101 7f976cff9700 -1 
> librbd::object_map::InvalidateRequest: 0x7f97544d11e0 should_complete: r=0
> 2018-10-05 20:55:38.597014 7f976d7fa700 -1 librbd::image::OpenRequest: failed 
> to retreive immutable metadata: (2) No such file or directory
> 2018-10-05 20:55:38.597109 7f976cff9700 -1 

Re: [ceph-users] Ceph mds is stuck in creating status

2018-10-16 Thread Kisik Jeong
Oh my god. The network configuration was the problem. Fixing network
configuration, I successfully created CephFS. Thank you very much.

-Kisik

2018년 10월 16일 (화) 오후 9:58, John Spray 님이 작성:

> On Mon, Oct 15, 2018 at 7:15 PM Kisik Jeong 
> wrote:
> >
> > I attached osd & fs dumps. There are two pools (cephfs_data,
> cephfs_metadata) for CephFS clearly. And this system's network is 40Gbps
> ethernet for public & cluster. So I don't think the network speed would be
> problem. Thank you.
>
> Ah, your pools do exist, I had just been looking at the start of the
> MDS log where it hadn't seen the osdmap yet.
>
> Looking again at your original log together with your osdmap, I notice
> that your stuck operations are targeting OSDs 10,11,13,14,15, and all
> these OSDs have public addresses in the 192.168.10.x range rather than
> the 192.168.40.x range like the others.
>
> So my guess would be that you are intending your OSDs to be in the
> 192.168.40.x range, but are missing some config settings for certain
> daemons.
>
> John
>
>
> > 2018년 10월 16일 (화) 오전 1:18, John Spray 님이 작성:
> >>
> >> On Mon, Oct 15, 2018 at 4:24 PM Kisik Jeong 
> wrote:
> >> >
> >> > Thank you for your reply, John.
> >> >
> >> > I  restarted my Ceph cluster and captured the mds logs.
> >> >
> >> > I found that mds shows slow request because some OSDs are laggy.
> >> >
> >> > I followed the ceph mds troubleshooting with 'mds slow request', but
> there is no operation in flight:
> >> >
> >> > root@hpc1:~/iodc# ceph daemon mds.hpc1 dump_ops_in_flight
> >> > {
> >> > "ops": [],
> >> > "num_ops": 0
> >> > }
> >> >
> >> > Is there any other reason that mds shows slow request? Thank you.
> >>
> >> Those stuck requests seem to be stuck because they're targeting pools
> >> that don't exist.  Has something strange happened in the history of
> >> this cluster that might have left a filesystem referencing pools that
> >> no longer exist?  Ceph is not supposed to permit removal of pools in
> >> use by CephFS, but perhaps something went wrong.
> >>
> >> Check out the "ceph osd dump --format=json-pretty" and "ceph fs dump
> >> --format=json-pretty" outputs and how the pool ID's relate.  According
> >> to those logs, data pool with ID 1 and metadata pool with ID 2 do not
> >> exist.
> >>
> >> John
> >>
> >> > -Kisik
> >> >
> >> > 2018년 10월 15일 (월) 오후 11:43, John Spray 님이 작성:
> >> >>
> >> >> On Mon, Oct 15, 2018 at 3:34 PM Kisik Jeong <
> kisik.je...@csl.skku.edu> wrote:
> >> >> >
> >> >> > Hello,
> >> >> >
> >> >> > I successfully deployed Ceph cluster with 16 OSDs and created
> CephFS before.
> >> >> > But after rebooting due to mds slow request problem, when creating
> CephFS, Ceph mds goes creating status and never changes.
> >> >> > Seeing Ceph status, there is no other problem I think. Here is
> 'ceph -s' result:
> >> >>
> >> >> That's pretty strange.  Usually if an MDS is stuck in "creating",
> it's
> >> >> because an OSD operation is stuck, but in your case all your PGs are
> >> >> healthy.
> >> >>
> >> >> I would suggest setting "debug mds=20" and "debug objecter=10" on
> your
> >> >> MDS, restarting it and capturing those logs so that we can see where
> >> >> it got stuck.
> >> >>
> >> >> John
> >> >>
> >> >> > csl@hpc1:~$ ceph -s
> >> >> >   cluster:
> >> >> > id: 1a32c483-cb2e-4ab3-ac60-02966a8fd327
> >> >> > health: HEALTH_OK
> >> >> >
> >> >> >   services:
> >> >> > mon: 1 daemons, quorum hpc1
> >> >> > mgr: hpc1(active)
> >> >> > mds: cephfs-1/1/1 up  {0=hpc1=up:creating}
> >> >> > osd: 16 osds: 16 up, 16 in
> >> >> >
> >> >> >   data:
> >> >> > pools:   2 pools, 640 pgs
> >> >> > objects: 7 objects, 124B
> >> >> > usage:   34.3GiB used, 116TiB / 116TiB avail
> >> >> > pgs: 640 active+clean
> >> >> >
> >> >> > However, CephFS still works in case of 8 OSDs.
> >> >> >
> >> >> > If there is any doubt of this phenomenon, please let me know.
> Thank you.
> >> >> >
> >> >> > PS. I attached my ceph.conf contents:
> >> >> >
> >> >> > [global]
> >> >> > fsid = 1a32c483-cb2e-4ab3-ac60-02966a8fd327
> >> >> > mon_initial_members = hpc1
> >> >> > mon_host = 192.168.40.10
> >> >> > auth_cluster_required = cephx
> >> >> > auth_service_required = cephx
> >> >> > auth_client_required = cephx
> >> >> >
> >> >> > public_network = 192.168.40.0/24
> >> >> > cluster_network = 192.168.40.0/24
> >> >> >
> >> >> > [osd]
> >> >> > osd journal size = 1024
> >> >> > osd max object name len = 256
> >> >> > osd max object namespace len = 64
> >> >> > osd mount options f2fs = active_logs=2
> >> >> >
> >> >> > [osd.0]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.40.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.1]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.40.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > [osd.2]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.40.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.3]
> >> >> > host = hpc10
> >> 

[ceph-users] weekly report 41(ifed)

2018-10-16 Thread Igor Fedotov


[read]:

[amber]:

[green]:


* Almost done with: os/bluestore: allow ceph-bluestore-tool to coalesce 
BlueFS backing volumes [#core]
https://github.com/ceph/ceph/pull/23103 

Got a preliminary approval from Sage but still working on bringing in 
the latest updates.


* Submitted : os/bluestore: debug_omit_block_device_write isn't always 
respected [#core]
https://github.com/ceph/ceph/pull/24545 



* Did some investigation and proposed the redesign to fix inability to 
recover from ENOSPC in BlueFS. [#core]

https://tracker.ceph.com/issues/36268
https://marc.info/?l=ceph-devel=153935051621191=2
Got a positive feedback, will probably start implementing soon.

* Started to track and collect information on a potential BlueStore 
issue which results in slow responsive OSD and high disk unitilsation 
under scrubbing and probably other read load. [#core]

Corresponding ticket is:
https://tracker.ceph.com/issues/36284
We've seen somewhat similar behavior for CONA and iirc another customer.
Also I could several issues with the similar symptoms at ceph-users.
Will try to investigate a bit more with Wido (guy from the community)who 
has suffering cluster at the moment.







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disabling RGW Encryption support in Luminous

2018-10-16 Thread Casey Bodley
That's not currently possible, no. And I don't think it's a good idea to 
add such a feature; if the client requests that something be encrypted, 
the server should either encrypt it or reject the request.


There is a config called rgw_crypt_s3_kms_encryption_keys that we use 
for testing, though, which allows you to specify a mapping of kms keyids 
to actual keys. If your client is using a limited number of kms keyids, 
you can provide keys for them and get limited sse-kms support without 
setting up an actual kms.


For example, this is our test configuration for use with s3tests:

rgw crypt s3 kms encryption keys = 
testkey-1=YmluCmJvb3N0CmJvb3N0LWJ1aWxkCmNlcGguY29uZgo= 
testkey-2=aWIKTWFrZWZpbGUKbWFuCm91dApzcmMKVGVzdGluZwo=


Where s3tests is sending requests with header 
x-amz-server-side-encryption-aws-kms-key-id: testkey1 or testkey2.


I hope that helps!
Casey

On 10/16/18 8:43 AM, Arvydas Opulskis wrote:

Hi,

got no success on IRC, maybe someone will help me here.

After RGW upgrade from Jewel to Luminous, one S3 user started to 
receive errors from his postgre wal-e solution. Error is like this: 
"Server Side Encryption with KMS managed key requires HTTP header 
x-amz-server-side-encryption : aws:kms".
After some reading, seems, like this client is forcing Server side 
encryption (SSE) on RGW and it is not configured. Because user can't 
disable encryption in his solution for now (it will be possible in 
future release), can I somehow disable Encryption support on Luminous 
RGW?


Thank you for your insights.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous with osd flapping, slow requests when deep scrubbing

2018-10-16 Thread Andrei Mikhailovsky
Hi Christian,


- Original Message -
> From: "Christian Balzer" 
> To: "ceph-users" 
> Cc: "Andrei Mikhailovsky" 
> Sent: Tuesday, 16 October, 2018 08:51:36
> Subject: Re: [ceph-users] Luminous with osd flapping, slow requests when deep 
> scrubbing

> Hello,
> 
> On Mon, 15 Oct 2018 12:26:50 +0100 (BST) Andrei Mikhailovsky wrote:
> 
>> Hello,
>> 
>> I am currently running Luminous 12.2.8 on Ubuntu with 4.15.0-36-generic 
>> kernel
>> from the official ubuntu repo. The cluster has 4 mon + osd servers. Each osd
>> server has the total of 9 spinning osds and 1 ssd for the hdd and ssd pools.
>> The hdds are backed by the S3710 ssds for journaling with a ration of 1:5. 
>> The
>> ssd pool osds are not using external journals. Ceph is used as a Primary
>> storage for Cloudstack - all vm disk images are stored on the cluster.
>>
> 
> For the record, are you seeing the flapping only on HDD pools or with SSD
> pools as well?
> 


I think so, this tend to happen to the HDD pool.



> When migrating to Bluestore, did you see this starting to happen before
> the migration was complete (and just on Bluestore OSDs of course)?
> 


Nope, not that I can recall. I did have some issues with performance initially, 
but I've added a few temp disks to the servers to help with the free space. The 
cluster was well unhappy when the usage spiked above 90% on some of the osds. 
After the temp disks were in place, the cluster was back to being a happy.



> What's your HW like, in particular RAM? Current output of "free"?

Each of the mon/osd servers has 64GB of ram. Currently, one of the server's mem 
usage is (it has been restarted 30 mins ago):

root@arh-ibstorage4-ib:/home/andrei# free -h
  totalusedfree  shared  buff/cache   available
Mem:62G 11G 50G 10M575M 49G
Swap:   45G  0B 45G


The servers with 24 hours uptime have a similar picture, but a slightly larger 
used amount.

> 
> If you didn't tune your bluestore cache you're likely just using a
> fraction of the RAM for caching, making things a LOT harder for OSDs to
> keep up when compared to filestore and the global (per node) page cache.
> 

I haven't done any bluestore cache changes at all after moving to the bluestore 
type. Could you please point me in the right direction?


> See the various bluestore cache threads here, one quite recently.
> 
> If your cluster was close to the brink with filestore just moving it to
> bluestore would nicely fit into what you're seeing, especially for the
> high stress and cache bypassing bluestore deep scrubbing.
> 


I have put in place the following config settings in the [global] section:


# Settings to try to minimise IO client impact / slow requests / osd flapping 
from scrubbing and snap trimming
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 5
#osd_scrub_begin_hour = 21
#osd_scrub_end_hour = 5
osd_scrub_sleep = 0.1
osd_scrub_max_interval = 1209600
osd_scrub_min_interval = 259200
osd_deep_scrub_interval = 1209600
osd_deep_scrub_stride = 1048576
osd_scrub_priority = 1
osd_snap_trim_priority = 1


Following the restart of the servers and doing a few tests by manually invoking 
6 deep scrubbing processes I haven't seen any more issues with osd flapping or 
the slow requests. I will keep an eye on it over the next few weeks to see if 
the issue is resolved.



> Regards,
> 
> Christian
>> I have recently migrated all osds to the bluestore, which was a long process
>> with ups and downs, but I am happy to say that the migration is done. During
>> the migration I've disabled the scrubbing (both deep and standard). After
>> reenabling the scrubbing I have noticed the cluster started having a large
>> number of slow requests and poor client IO (to the point of vms stall for
>> minutes). Further investigation showed that the slow requests happen because 
>> of
>> the osds flapping. In a single day my logs have over 1000 entries which 
>> report
>> osd going down. This effects random osds. Disabling deep-scrubbing stabilises
>> the cluster and the osds are no longer flap and the slow requests disappear. 
>> As
>> a short term solution I've disabled the deepscurbbing, but was hoping to fix
>> the issues with your help.
>> 
>> At the moment, I am running the cluster with default settings apart from the
>> following settings:
>> 
>> [global]
>> osd_disk_thread_ioprio_priority = 7
>> osd_disk_thread_ioprio_class = idle
>> osd_recovery_op_priority = 1
>> 
>> [osd]
>> debug_ms = 0
>> debug_auth = 0
>> debug_osd = 0
>> debug_bluestore = 0
>> debug_bluefs = 0
>> debug_bdev = 0
>> debug_rocksdb = 0
>> 
>> 
>> Could you share experiences with deep scrubbing of bluestore osds? Are there 
>> any
>> options that I should set to make sure the osds are not flapping and the 
>> client
>> IO is still available?
>> 
>> Thanks
>> 
>> Andrei
> 
> 
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Rakuten 

Re: [ceph-users] Ceph mds is stuck in creating status

2018-10-16 Thread John Spray
On Mon, Oct 15, 2018 at 7:15 PM Kisik Jeong  wrote:
>
> I attached osd & fs dumps. There are two pools (cephfs_data, cephfs_metadata) 
> for CephFS clearly. And this system's network is 40Gbps ethernet for public & 
> cluster. So I don't think the network speed would be problem. Thank you.

Ah, your pools do exist, I had just been looking at the start of the
MDS log where it hadn't seen the osdmap yet.

Looking again at your original log together with your osdmap, I notice
that your stuck operations are targeting OSDs 10,11,13,14,15, and all
these OSDs have public addresses in the 192.168.10.x range rather than
the 192.168.40.x range like the others.

So my guess would be that you are intending your OSDs to be in the
192.168.40.x range, but are missing some config settings for certain
daemons.

John


> 2018년 10월 16일 (화) 오전 1:18, John Spray 님이 작성:
>>
>> On Mon, Oct 15, 2018 at 4:24 PM Kisik Jeong  wrote:
>> >
>> > Thank you for your reply, John.
>> >
>> > I  restarted my Ceph cluster and captured the mds logs.
>> >
>> > I found that mds shows slow request because some OSDs are laggy.
>> >
>> > I followed the ceph mds troubleshooting with 'mds slow request', but there 
>> > is no operation in flight:
>> >
>> > root@hpc1:~/iodc# ceph daemon mds.hpc1 dump_ops_in_flight
>> > {
>> > "ops": [],
>> > "num_ops": 0
>> > }
>> >
>> > Is there any other reason that mds shows slow request? Thank you.
>>
>> Those stuck requests seem to be stuck because they're targeting pools
>> that don't exist.  Has something strange happened in the history of
>> this cluster that might have left a filesystem referencing pools that
>> no longer exist?  Ceph is not supposed to permit removal of pools in
>> use by CephFS, but perhaps something went wrong.
>>
>> Check out the "ceph osd dump --format=json-pretty" and "ceph fs dump
>> --format=json-pretty" outputs and how the pool ID's relate.  According
>> to those logs, data pool with ID 1 and metadata pool with ID 2 do not
>> exist.
>>
>> John
>>
>> > -Kisik
>> >
>> > 2018년 10월 15일 (월) 오후 11:43, John Spray 님이 작성:
>> >>
>> >> On Mon, Oct 15, 2018 at 3:34 PM Kisik Jeong  
>> >> wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > I successfully deployed Ceph cluster with 16 OSDs and created CephFS 
>> >> > before.
>> >> > But after rebooting due to mds slow request problem, when creating 
>> >> > CephFS, Ceph mds goes creating status and never changes.
>> >> > Seeing Ceph status, there is no other problem I think. Here is 'ceph 
>> >> > -s' result:
>> >>
>> >> That's pretty strange.  Usually if an MDS is stuck in "creating", it's
>> >> because an OSD operation is stuck, but in your case all your PGs are
>> >> healthy.
>> >>
>> >> I would suggest setting "debug mds=20" and "debug objecter=10" on your
>> >> MDS, restarting it and capturing those logs so that we can see where
>> >> it got stuck.
>> >>
>> >> John
>> >>
>> >> > csl@hpc1:~$ ceph -s
>> >> >   cluster:
>> >> > id: 1a32c483-cb2e-4ab3-ac60-02966a8fd327
>> >> > health: HEALTH_OK
>> >> >
>> >> >   services:
>> >> > mon: 1 daemons, quorum hpc1
>> >> > mgr: hpc1(active)
>> >> > mds: cephfs-1/1/1 up  {0=hpc1=up:creating}
>> >> > osd: 16 osds: 16 up, 16 in
>> >> >
>> >> >   data:
>> >> > pools:   2 pools, 640 pgs
>> >> > objects: 7 objects, 124B
>> >> > usage:   34.3GiB used, 116TiB / 116TiB avail
>> >> > pgs: 640 active+clean
>> >> >
>> >> > However, CephFS still works in case of 8 OSDs.
>> >> >
>> >> > If there is any doubt of this phenomenon, please let me know. Thank you.
>> >> >
>> >> > PS. I attached my ceph.conf contents:
>> >> >
>> >> > [global]
>> >> > fsid = 1a32c483-cb2e-4ab3-ac60-02966a8fd327
>> >> > mon_initial_members = hpc1
>> >> > mon_host = 192.168.40.10
>> >> > auth_cluster_required = cephx
>> >> > auth_service_required = cephx
>> >> > auth_client_required = cephx
>> >> >
>> >> > public_network = 192.168.40.0/24
>> >> > cluster_network = 192.168.40.0/24
>> >> >
>> >> > [osd]
>> >> > osd journal size = 1024
>> >> > osd max object name len = 256
>> >> > osd max object namespace len = 64
>> >> > osd mount options f2fs = active_logs=2
>> >> >
>> >> > [osd.0]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.1]
>> >> > host = hpc10
>> >> > public_addr = 192.168.40.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.2]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.3]
>> >> > host = hpc10
>> >> > public_addr = 192.168.40.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.4]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.5]
>> >> > host = hpc10
>> >> > public_addr = 192.168.40.19
>> >> > cluster_addr = 192.168.40.19
>> >> >
>> >> > [osd.6]
>> >> > host = hpc9
>> >> > public_addr = 192.168.40.18
>> >> > cluster_addr = 192.168.40.18
>> >> >
>> >> > [osd.7]
>> >> > 

[ceph-users] Disabling RGW Encryption support in Luminous

2018-10-16 Thread Arvydas Opulskis
Hi,

got no success on IRC, maybe someone will help me here.

After RGW upgrade from Jewel to Luminous, one S3 user started to receive
errors from his postgre wal-e solution. Error is like this: "Server Side
Encryption with KMS managed key requires HTTP header
x-amz-server-side-encryption : aws:kms".
After some reading, seems, like this client is forcing Server side
encryption (SSE) on RGW and it is not configured. Because user can't
disable encryption in his solution for now (it will be possible in future
release), can I somehow disable Encryption support on Luminous RGW?

Thank you for your insights.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-16 Thread Wido den Hollander


On 10/16/18 11:32 AM, Igor Fedotov wrote:
> 
> 
> On 10/16/2018 6:57 AM, Wido den Hollander wrote:
>>
>> On 10/16/2018 12:04 AM, Igor Fedotov wrote:
>>> On 10/15/2018 11:47 PM, Wido den Hollander wrote:
 Hi,

 On 10/15/2018 10:43 PM, Igor Fedotov wrote:
> Hi Wido,
>
> once you apply the PR you'll probably see the initial error in the log
> that triggers the dump. Which is most probably the lack of space
> reported by _balance_bluefs_freespace() function. If so this means
> that
> BlueFS rebalance is unable to allocate contiguous 1M chunk at main
> device to gift to BlueFS. I.e. your main device space is very
> fragmented.
>
> Unfortunately I don't know any ways to recover from this state but OSD
> redeployment or data removal.
>
 We are moving data away from these OSDs. Lucky us that we have HDD OSDs
 in there as well, moving a lot of data there.

 How would re-deployment work? Just wipe the OSDs and bring them back
 into the cluster again? That would be a very painful task.. :-(
>>> Good chances that you'll face the same issue again one day.
>>> May be allocate some SSDs to serve as DB devices?
>> Maybe, but this is a very common use-case where people run WAL+DB+DATA
>> on a single SSD.
> Yeah, but I'd consider that as a workaround until better solution is
> provided.
> Before your case this high fragmentation issue had been rather a
> theoretical outlook with a very small probability.
> 

Understood. What I see now is that after offloading the RGW bucket
indexes from these OSDs there is free space.

The OSDs are now randomly going to 100% util (their disk) in reading a
lot from their disk.

Cranking up the logging shows me:

2018-10-16 11:50:20.325674 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547cd93c~f0b
2018-10-16 11:50:20.325836 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547ce847~ef9
2018-10-16 11:50:20.325997 7fa52b9a4700  5 bdev(0x5627cc185200
/var/lib/ceph/osd/ceph-118/block) read_random 0xe7547cf740

bluefs / bdev seem to be reading a lot, a lot. From what I can tell they
are discarding OMAP data from RocksDB which causes a lot of reads on BlueFS.

>> Now we are running into it, but aren't the chances big other people will
>> run into it as well?
>>
> Upcoming PR that brings an ability for offline BlueFS volume
> manipulation (https://github.com/ceph/ceph/pull/23103) will probably
> help to recover from this issue in future by migrating BlueFS data
> to a
> new larger DB volume. (targeted for Nautilus, not sure about
> backporting
> to Mimic or Luminous).
>
> For now I can suggest the only preventive mean to avoid the case -
> have
> large enough space at your standalone DB volume. So that master device
> isn't used for DB at all or as minimum as possible. Hence no rebalance
> is needed and no fragmentation is present.
>
 I see, but these are SSD-only OSDs.

> BTW wondering if you have one for your OSDs? How large if so?
>
 The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.

 The fullest OSD currently is 78% full, which is 348GiB free on the
 1.75TiB device.

 Does this information help?
>>> Yeah, thanks for sharing.
>> Let me know if you need more!
> Additionally wondering if you know how many data (in average) has been
> totally written to these OSDs. I mean an aggregation over all writes
> (even ones that already has been removed) not the current usage.
> Use patterns, object sizes etc would be interesting as well.
>

These OSDs were migrated from FileStore to BlueStore in March 2018.

Using blkdiscard the SSDs (Samsung PM863a 1.92TB) were wiped and then
deployed with BlueStore.

They run:

- RBD (couple of pools)
- RGW indexes

The actual RGW data is on HDD OSDs in the same cluster.

The whole cluster does a steady 20k IOps (R+W) during the day I would
say. Roughly 200MB/sec write and 100MB/sec read.

Suddenly OSDs started to grind to a halt last week and have been
flapping and showing slow requests ever since.

The RGW indexes have been offloaded to the HDD nodes in the meantime.

Wido

>>>
>> Wido
>>
 Thanks!

 Wido

> Everything above is "IMO", some chances that I missed something..
>
>
> Thanks,
>
> Igor
>
>
> On 10/15/2018 10:12 PM, Wido den Hollander wrote:
>> On 10/15/2018 08:23 PM, Gregory Farnum wrote:
>>> I don't know anything about the BlueStore code, but given the
>>> snippets
>>> you've posted this appears to be a debug thing that doesn't expect
>>> to be
>>> invoked (or perhaps only in an unexpected case that it's trying
>>> hard to
>>> recover from). Have you checked where the dump() function is invoked
>>> from? I'd imagine it's something about having to try extra-hard to
>>> allocate free space or 

Re: [ceph-users] OSD log being spammed with BlueStore stupidallocator dump

2018-10-16 Thread Igor Fedotov



On 10/16/2018 6:57 AM, Wido den Hollander wrote:


On 10/16/2018 12:04 AM, Igor Fedotov wrote:

On 10/15/2018 11:47 PM, Wido den Hollander wrote:

Hi,

On 10/15/2018 10:43 PM, Igor Fedotov wrote:

Hi Wido,

once you apply the PR you'll probably see the initial error in the log
that triggers the dump. Which is most probably the lack of space
reported by _balance_bluefs_freespace() function. If so this means that
BlueFS rebalance is unable to allocate contiguous 1M chunk at main
device to gift to BlueFS. I.e. your main device space is very
fragmented.

Unfortunately I don't know any ways to recover from this state but OSD
redeployment or data removal.


We are moving data away from these OSDs. Lucky us that we have HDD OSDs
in there as well, moving a lot of data there.

How would re-deployment work? Just wipe the OSDs and bring them back
into the cluster again? That would be a very painful task.. :-(

Good chances that you'll face the same issue again one day.
May be allocate some SSDs to serve as DB devices?

Maybe, but this is a very common use-case where people run WAL+DB+DATA
on a single SSD.
Yeah, but I'd consider that as a workaround until better solution is 
provided.
Before your case this high fragmentation issue had been rather a 
theoretical outlook with a very small probability.



Now we are running into it, but aren't the chances big other people will
run into it as well?


Upcoming PR that brings an ability for offline BlueFS volume
manipulation (https://github.com/ceph/ceph/pull/23103) will probably
help to recover from this issue in future by migrating BlueFS data to a
new larger DB volume. (targeted for Nautilus, not sure about backporting
to Mimic or Luminous).

For now I can suggest the only preventive mean to avoid the case - have
large enough space at your standalone DB volume. So that master device
isn't used for DB at all or as minimum as possible. Hence no rebalance
is needed and no fragmentation is present.


I see, but these are SSD-only OSDs.


BTW wondering if you have one for your OSDs? How large if so?


The cluster consists out of 96 OSDs with Samsung PM863a 1.92TB OSDs.

The fullest OSD currently is 78% full, which is 348GiB free on the
1.75TiB device.

Does this information help?

Yeah, thanks for sharing.

Let me know if you need more!
Additionally wondering if you know how many data (in average) has been 
totally written to these OSDs. I mean an aggregation over all writes 
(even ones that already has been removed) not the current usage.

Use patterns, object sizes etc would be interesting as well.




Wido


Thanks!

Wido


Everything above is "IMO", some chances that I missed something..


Thanks,

Igor


On 10/15/2018 10:12 PM, Wido den Hollander wrote:

On 10/15/2018 08:23 PM, Gregory Farnum wrote:

I don't know anything about the BlueStore code, but given the snippets
you've posted this appears to be a debug thing that doesn't expect
to be
invoked (or perhaps only in an unexpected case that it's trying
hard to
recover from). Have you checked where the dump() function is invoked
from? I'd imagine it's something about having to try extra-hard to
allocate free space or something.

It seems BlueFS that is having a hard time finding free space.

I'm trying this PR now: https://github.com/ceph/ceph/pull/24543

It will stop the spamming, but that's not the root cause. The OSDs in
this case are at max 80% full and they do have a lot of OMAP (RGW
indexes) in them, but that's all.

I'm however not sure why this is happening suddenly in this cluster.

Wido


-Greg

On Mon, Oct 15, 2018 at 10:02 AM Wido den Hollander mailto:w...@42on.com>> wrote:



   On 10/11/2018 12:08 AM, Wido den Hollander wrote:
   > Hi,
   >
   > On a Luminous cluster running a mix of 12.2.4, 12.2.5 and
12.2.8 I'm
   > seeing OSDs writing heavily to their logfiles spitting out
these
   lines:
   >
   >
   > 2018-10-10 21:52:04.019037 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd2078000~34000
   > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd22cc000~24000
   > 2018-10-10 21:52:04.019038 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd230~2
   > 2018-10-10 21:52:04.019039 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd2324000~24000
   > 2018-10-10 21:52:04.019040 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd26c~24000
   > 2018-10-10 21:52:04.019041 7f90c2f0f700  0 stupidalloc
   0x0x55828ae047d0
   > dump  0x15cd2704000~3
   >
   > It goes so fast that the OS-disk in this case can't keep up
and become
   > 100% util.
   >
   > This causes the OSD to slow down and cause slow requests and
   starts to flap.
   >

   I've set 'log_file' to /dev/null for now, but that doesn't
solve it
   either. Randomly OSDs just start spitting out 

[ceph-users] how can i config pg_num

2018-10-16 Thread xiang . dai
I install a ceph cluster with 8 osds, 3 pools and 1 replication(as 
osd_pool_default_size) in 2 machines. 

I follow formula in 
http://docs.ceph.com/docs/mimic/rados/operations/placement-groups/#choosing-the-number-of-placement-groups
 to count pg_nu. 
Then get total pg_num equal to 192, i set each pool as 64. 

I get below warn: 
$ ceph -s 
cluster: id: fd64d9e4-e33b-4e1c-927c-c0bb56d072cf 
health: HEALTH_WARN too few PGs per OSD (24 < min 30 ) 

Then i change osd_pool_size to 2, warning miss which makes me confused. 

I read docs again, i have below questions: 

1.Between 5 and 10 OSDs set pg_num to 512 in doc, this pg_num is total pg num? 
If so, for 2 replications, the pg per osd is too low. 
If not, it means pg num per pool, for more pools, the pg per osd is too high. 

2.How count the min is 30? 

3.Why only change replication, the warning miss, seems that not count by the 
formula. 

4.The formula does not consider pool num, just consider replication and osd 
num. 
So for more pool, the formula need to divide pool num too, right? 

5.In http://docs.ceph.com/docs/mimic/rados/configuration/pool-pg-config-ref/, 
it says set 250 as default. 
This num is not power of 2, why set it? Is it right? 

If i set osd_pool_default_size as 2, does it mean need to set 
osd_pool_default_min_size as 1? 
If so, when osd_pool_default_size is 1, osd_pool_default_min_size equal to 
Zero? 
If not, for 2 machine: 
1) set osd_pool_default_size as 2 is meaningless, but it can solve ceph status 
warning. 
2) set osd_pool_default_size and osd_pool_default_min_size both 1? 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous with osd flapping, slow requests when deep scrubbing

2018-10-16 Thread Christian Balzer


Hello,

On Mon, 15 Oct 2018 12:26:50 +0100 (BST) Andrei Mikhailovsky wrote:

> Hello, 
> 
> I am currently running Luminous 12.2.8 on Ubuntu with 4.15.0-36-generic 
> kernel from the official ubuntu repo. The cluster has 4 mon + osd servers. 
> Each osd server has the total of 9 spinning osds and 1 ssd for the hdd and 
> ssd pools. The hdds are backed by the S3710 ssds for journaling with a ration 
> of 1:5. The ssd pool osds are not using external journals. Ceph is used as a 
> Primary storage for Cloudstack - all vm disk images are stored on the 
> cluster. 
>

For the record, are you seeing the flapping only on HDD pools or with SSD
pools as well?

When migrating to Bluestore, did you see this starting to happen before
the migration was complete (and just on Bluestore OSDs of course)?

What's your HW like, in particular RAM? Current output of "free"?

If you didn't tune your bluestore cache you're likely just using a
fraction of the RAM for caching, making things a LOT harder for OSDs to
keep up when compared to filestore and the global (per node) page cache.

See the various bluestore cache threads here, one quite recently.

If your cluster was close to the brink with filestore just moving it to
bluestore would nicely fit into what you're seeing, especially for the
high stress and cache bypassing bluestore deep scrubbing.

Regards,

Christian
> I have recently migrated all osds to the bluestore, which was a long process 
> with ups and downs, but I am happy to say that the migration is done. During 
> the migration I've disabled the scrubbing (both deep and standard). After 
> reenabling the scrubbing I have noticed the cluster started having a large 
> number of slow requests and poor client IO (to the point of vms stall for 
> minutes). Further investigation showed that the slow requests happen because 
> of the osds flapping. In a single day my logs have over 1000 entries which 
> report osd going down. This effects random osds. Disabling deep-scrubbing 
> stabilises the cluster and the osds are no longer flap and the slow requests 
> disappear. As a short term solution I've disabled the deepscurbbing, but was 
> hoping to fix the issues with your help. 
> 
> At the moment, I am running the cluster with default settings apart from the 
> following settings: 
> 
> [global] 
> osd_disk_thread_ioprio_priority = 7 
> osd_disk_thread_ioprio_class = idle 
> osd_recovery_op_priority = 1 
> 
> [osd] 
> debug_ms = 0 
> debug_auth = 0 
> debug_osd = 0 
> debug_bluestore = 0 
> debug_bluefs = 0 
> debug_bdev = 0 
> debug_rocksdb = 0 
> 
> 
> Could you share experiences with deep scrubbing of bluestore osds? Are there 
> any options that I should set to make sure the osds are not flapping and the 
> client IO is still available? 
> 
> Thanks 
> 
> Andrei 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph pg/pgp number calculation

2018-10-16 Thread Zhenshi Zhou
Hi,

I have a cluster serving rbd and cephfs storage for a period of
time. I added rgw in the cluster yesterday and wanted it to server
object storage. Everything seems good.

What I'm confused is how to calculate the pg/pgp number. As we
all know, the formula of calculating pgs is:

Total PGs = ((Total_number_of_OSD * 100) / max_replication_count) /
pool_count

Before I created rgw, the cluster had 3 pools(rbd, cephfs_data,
cephfs_meta).
But now it has 8 pools, which object service may use, including
'.rgw.root',
'default.rgw.control', 'default.rgw.meta', 'default.rgw.log' and
'defualt.rgw.buckets.index'.

Should I calculate pg number again using new pool number as 8, or should I
continue to use the old pg number?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com