Re: [ceph-users] Ceph pg repair clone_missing?

2019-10-02 Thread Brad Hubbard
On Wed, Oct 2, 2019 at 9:00 PM Marc Roos  wrote:
>
>
>
> Hi Brad,
>
> I was following the thread where you adviced on this pg repair
>
> I ran these rados 'list-inconsistent-obj'/'rados
> list-inconsistent-snapset' and have output on the snapset. I tried to
> extrapolate your comment on the data/omap_digest_mismatch_info onto my
> situation. But I don't know how to proceed. I got on this mailing list
> the advice to delete snapshot 4, but if I see this output, that might
> not have been the smartest thing to do.

That remains to be seen. Can you post the actual scrub error you are getting?

>
>
>
>
> [0]
> http://tracker.ceph.com/issues/24994

At first glance this appears to be a different issue to yours.

>
> [1]
> {
>   "epoch": 66082,
>   "inconsistents": [
> {
>   "name": "rbd_data.1f114174b0dc51.0974",

rbd_data.1f114174b0dc51 is the block_name_prefix for this image. You
can run 'rbd info' on the images in this pool to see which image is
actually affected and how important the data is.

>   "nspace": "",
>   "locator": "",
>   "snap": "head",
>   "snapset": {
> "snap_context": {
>   "seq": 63,
>   "snaps": [
> 63,
> 35,
> 13,
> 4
>   ]
> },
> "head_exists": 1,
> "clones": [
>   {
> "snap": 4,
> "size": 4194304,
> "overlap": "[]",
> "snaps": [
>   4
> ]
>   },
>   {
> "snap": 63,
> "size": 4194304,
> "overlap": "[0~4194304]",
> "snaps": [
>   63,
>   35,
>   13
> ]
>   }
> ]
>   },
>   "errors": [
> "clone_missing"
>   ],
>   "missing": [
> 4
>   ]
> }
>   ]
> }



--
Cheers,
Brad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw S3 lifecycle cannot keep up

2019-10-02 Thread Robin H. Johnson
On Wed, Oct 02, 2019 at 01:48:40PM +0200, Christian Pedersen wrote:
> Hi Martin,
> 
> Even before adding cold storage on HDD, I had the cluster with SSD only. That 
> also could not keep up with deleting the files.
> I am no where near I/O exhaustion on the SSDs or even the HDDs.
Please see my presentation from Cephalic on 2019 about RGW S3 where I
touch on slowness in Lifecycle processing and deletion. 

The efficiency of the code is very low: it requires a full scan of
the bucket index every single day. Depending on the traversal order
(unordered listing helps), this might mean it takes a very long time to
find the items that can be deleted, and even when it gets to them, it's
bound by the deletion time, which is also slow (that the head of the
objects is a synchronous deletion in many cases, while the tails are
async garbage-collected).

Fixing this isn't trivial: either you have to scan the entire bucket, or
you have to maintain a secondary index in insertion-order for EACH
prefix in a lifecycle policy.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Mike Christie
On 10/02/2019 02:15 PM, Kilian Ries wrote:
> Ok i just compared my local python files and the git commit you sent me
> - it really looks like i have the old files installed. All the changes
> are missing in my local files.
> 
> 
> 
> Where can i get a new ceph-iscsi-config package that has the fixe
> included? I have installed version:

They are on shaman only right now:

https://4.chacra.ceph.com/r/ceph-iscsi-config/master/24deeb206ed2354d44b0f33d7d26d475e1014f76/centos/7/flavors/default/noarch/

https://4.chacra.ceph.com/r/ceph-iscsi-cli/master/4802654a6963df6bf5f4a968782cfabfae835067/centos/7/flavors/default/noarch/

The shaman rpms above have one bug we just fixed in ceph-iscsi-config
where if DNS is not setup correctly gwcli commands can take minutes.

I am going to try and get download.ceph.com updated.

> 
> ceph-iscsi-config-2.6-2.6.el7.noarch
> 
> *Von:* ceph-users  im Auftrag von
> Kilian Ries 
> *Gesendet:* Mittwoch, 2. Oktober 2019 21:04:45
> *An:* dilla...@redhat.com
> *Cc:* ceph-users@lists.ceph.com
> *Betreff:* Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image
> size
>  
> 
> Yes, i created all four luns with these sizes:
> 
> 
> lun0 - 5120G
> 
> lun1 - 5121G
> 
> lun2 - 5122G
> 
> lun3 - 5123G
> 
> 
> Its always one GB more per LUN... Is there any newer ceph-iscsi-config
> package than i have installed?
> 
> 
> ceph-iscsi-config-2.6-2.6.el7.noarch
> 
> 
> Then i could try to update the package and see if the error is fixed ...
> 
> 
> *Von:* Jason Dillaman 
> *Gesendet:* Mittwoch, 2. Oktober 2019 16:00:03
> *An:* Kilian Ries
> *Cc:* ceph-users@lists.ceph.com
> *Betreff:* Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image
> size
>  
> On Wed, Oct 2, 2019 at 9:50 AM Kilian Ries  wrote:
>>
>> Hi,
>>
>>
>> i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was 
>> setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that 
>> only two of the four configured iscsi gw nodes are working correct. I first 
>> noticed via gwcli:
>>
>>
>> ###
>>
>>
>> $gwcli -d ls
>>
>> Traceback (most recent call last):
>>
>>   File "/usr/bin/gwcli", line 191, in 
>>
>> main()
>>
>>   File "/usr/bin/gwcli", line 103, in main
>>
>> root_node.refresh()
>>
>>   File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in 
>> refresh
>>
>> raise GatewayError
>>
>> gwcli.utils.GatewayError
>>
>>
>> ###
>>
>>
>> I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" 
>> are not running. I were not able to restart them via systemd. I then found 
>> that even tcmu-runner is not running and it exits with the following error:
>>
>>
>>
>> ###
>>
>>
>> tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD 
>> image size 5498631880704. Requested new size 5497558138880.
>>
>>
>> ###
>>
>>
>> Now i have the situation that two nodes are running correct and two cant 
>> start tcmu-runner. I don't know where the image size mismatches are coming 
>> from - i haven't configured or resized any of the images.
>>
>>
>> Is there any chance to get my two iscsi gw nodes back working?
> 
> It sounds like you are potentially hitting [1]. The ceph-iscsi-config
> library thinks your image size is 5TiB but you actually have a 5121GiB
> (~5.001TiB) RBD image. Any clue how your RBD image got to be 1GiB
> larger than an even 5TiB?
> 
>>
>>
>>
>> The following packets are installed:
>>
>>
>> rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel"
>>
>>
>> libtcmu-1.4.0-106.gd17d24e.el7.x86_64
>>
>> ceph-iscsi-cli-2.7-2.7.el7.noarch
>>
>> kernel-3.10.0-957.5.1.el7.x86_64
>>
>> ceph-base-13.2.5-0.el7.x86_64
>>
>> ceph-iscsi-config-2.6-2.6.el7.noarch
>>
>> ceph-common-13.2.5-0.el7.x86_64
>>
>> ceph-selinux-13.2.5-0.el7.x86_64
>>
>> kernel-tools-libs-3.10.0-957.5.1.el7.x86_64
>>
>> python-cephfs-13.2.5-0.el7.x86_64
>>
>> ceph-osd-13.2.5-0.el7.x86_64
>>
>> kernel-headers-3.10.0-957.5.1.el7.x86_64
>>
>> kernel-tools-3.10.0-957.5.1.el7.x86_64
>>
>> kernel-3.10.0-957.1.3.el7.x86_64
>>
>> libcephfs2-13.2.5-0.el7.x86_64
>>
>> kernel-3.10.0-862.14.4.el7.x86_64
>>
>> tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64
>>
>>
>>
>> Thanks,
>>
>> Greets
>>
>>
>> Kilian
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> [1] https://github.com/ceph/ceph-iscsi-config/pull/68
> 
> -- 
> Jason
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-02 Thread Vladimir Brik

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory 
consumption of our OSDs started to unexpectedly grow on all 5 nodes, 
after being stable for about 6 months.


Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very 
light (typically <10 iops) during this period, and the number of objects 
stayed about the same.


The only unusual occurrence was the reboot of one of the nodes the day 
before (a firmware update). For the reboot, I ran "ceph osd set noout", 
but forgot to unset it until several days later. Unsetting noout did not 
stop the increase in memory consumption.


I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I 
don't know why there is such a big spread. All HDDs are 10TB, 72-76% 
utilized, with 101-104 PGs.


Does anybody know what might be the problem here and how to address or 
debug it?



Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local Device Health PG inconsistent

2019-10-02 Thread Reed Dier
And now to fill in the full circle.

Sadly my solution was to run
> $ ceph pg repair 33.0
which returned
> 2019-10-02 15:38:54.499318 osd.12 (osd.12) 181 : cluster [DBG] 33.0 repair 
> starts
> 2019-10-02 15:38:55.502606 osd.12 (osd.12) 182 : cluster [ERR] 33.0 repair : 
> stat mismatch, got 264/265 objects, 0/0 clones, 264/265 dirty, 264/265 omap, 
> 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 manifest 
> objects, 0/0 hit_set_archive bytes.
> 2019-10-02 15:38:55.503066 osd.12 (osd.12) 183 : cluster [ERR] 33.0 repair 1 
> errors, 1 fixed
And now my cluster is happy once more.

So, in case anyone else runs into this issue, and doesn't think to run pg 
repair on the pg in question, in this case, go for it.

Reed

> On Sep 23, 2019, at 9:07 AM, Reed Dier  wrote:
> 
> And to come full circle,
> 
> After this whole saga, I now have a scrub error on the new device health 
> metrics pool/PG in what looks to be the exact same way.
> So I am at a loss for what ever it is that I am doing incorrectly, as a scrub 
> error obviously makes the monitoring suite very happy.
> 
>> $ ceph health detail
> 
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 33.0 is active+clean+inconsistent, acting [12,138,15]
>> $ rados list-inconsistent-pg device_health_metrics
>> ["33.0"]
>> $ rados list-inconsistent-obj 33.0 | jq
>> {
>>   "epoch": 176348,
>>   "inconsistents": []
>> }
> 
> I assume that this is the root cause:
>> ceph.log.5.gz:2019-09-18 11:12:16.466118 osd.138 (osd.138) 154 : cluster 
>> [WRN] bad locator @33 on object @33 op osd_op(client.1769585636.0:466 33.0 
>> 33:b08b92bdhead [omap-set-vals] snapc 0=[] 
>> ondisk+write+known_if_redirected e176327) v8
>> ceph.log.1.gz:2019-09-22 20:41:44.937841 osd.12 (osd.12) 53 : cluster [DBG] 
>> 33.0 scrub starts
>> ceph.log.1.gz:2019-09-22 20:41:45.000638 osd.12 (osd.12) 54 : cluster [ERR] 
>> 33.0 scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 dirty, 
>> 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 
>> manifest objects, 0/0 hit_set_archive bytes.
>> ceph.log.1.gz:2019-09-22 20:41:45.000643 osd.12 (osd.12) 55 : cluster [ERR] 
>> 33.0 scrub 1 errors
> 
> Nothing fancy set for the plugin:
>> $ ceph config dump | grep device
>> global  basicdevice_failure_prediction_mode local
>>   mgr   advanced mgr/devicehealth/enable_monitoring true
> 
> 
> Reed
> 
>> On Sep 18, 2019, at 11:33 AM, Reed Dier > > wrote:
>> 
>> And to provide some further updates,
>> 
>> I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4.
>> Unclear why this would improve things, but it at least got me running again.
>> 
>>> $ ceph versions
>>> {
>>> "mon": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 3
>>> },
>>> "mgr": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 3
>>> },
>>> "osd": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 199,
>>> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
>>> nautilus (stable)": 5
>>> },
>>> "mds": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 1
>>> },
>>> "overall": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 206,
>>> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
>>> nautilus (stable)": 5
>>> }
>>> }
>> 
>> 
>> Reed
>> 
>>> On Sep 18, 2019, at 10:12 AM, Reed Dier >> > wrote:
>>> 
>>> To answer the question, if it is safe to disable the module and delete the 
>>> pool, the answer is no.
>>> 
>>> After disabling the diskprediction_local module, I then proceeded to remove 
>>> the pool created by the module, device_health_metrics.
>>> 
>>> This is where things went south quickly,
>>> 
>>> Ceph health showed: 
 Module 'devicehealth' has failed: [errno 2] Failed to operate write op for 
 oid SAMSUNG_$MODEL_$SERIAL
>>> 
>>> That module apparently can't be disabled:
 $ ceph mgr module disable devicehealth
 Error EINVAL: module 'devicehealth' cannot be disabled (always-on)
>>> 
>>> Then 5 osd's went down, crashing with:
-12> 2019-09-18 10:53:00.299 7f95940ac700  5 osd.5 pg_epoch: 176304 
 pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] 
 local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 
 les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 
 lpr=176304 pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 
 peering m=17 mbc={}] enter Started/Primary/Peering/WaitUpThru
-11> 2019-09-18 10:53:00.303 7f959fd6f700  2 osd.5 176304 
 ms_handle_reset con 0x564078474d00 session 0x56407878ea00
   

Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Kilian Ries
Ok i just compared my local python files and the git commit you sent me - it 
really looks like i have the old files installed. All the changes are missing 
in my local files.



Where can i get a new ceph-iscsi-config package that has the fixe included? I 
have installed version:

ceph-iscsi-config-2.6-2.6.el7.noarch

Von: ceph-users  im Auftrag von Kilian Ries 

Gesendet: Mittwoch, 2. Oktober 2019 21:04:45
An: dilla...@redhat.com
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size


Yes, i created all four luns with these sizes:


lun0 - 5120G

lun1 - 5121G

lun2 - 5122G

lun3 - 5123G


Its always one GB more per LUN... Is there any newer ceph-iscsi-config package 
than i have installed?


ceph-iscsi-config-2.6-2.6.el7.noarch


Then i could try to update the package and see if the error is fixed ...


Von: Jason Dillaman 
Gesendet: Mittwoch, 2. Oktober 2019 16:00:03
An: Kilian Ries
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

On Wed, Oct 2, 2019 at 9:50 AM Kilian Ries  wrote:
>
> Hi,
>
>
> i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was 
> setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that only 
> two of the four configured iscsi gw nodes are working correct. I first 
> noticed via gwcli:
>
>
> ###
>
>
> $gwcli -d ls
>
> Traceback (most recent call last):
>
>   File "/usr/bin/gwcli", line 191, in 
>
> main()
>
>   File "/usr/bin/gwcli", line 103, in main
>
> root_node.refresh()
>
>   File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in 
> refresh
>
> raise GatewayError
>
> gwcli.utils.GatewayError
>
>
> ###
>
>
> I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" are 
> not running. I were not able to restart them via systemd. I then found that 
> even tcmu-runner is not running and it exits with the following error:
>
>
>
> ###
>
>
> tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD 
> image size 5498631880704. Requested new size 5497558138880.
>
>
> ###
>
>
> Now i have the situation that two nodes are running correct and two cant 
> start tcmu-runner. I don't know where the image size mismatches are coming 
> from - i haven't configured or resized any of the images.
>
>
> Is there any chance to get my two iscsi gw nodes back working?

It sounds like you are potentially hitting [1]. The ceph-iscsi-config
library thinks your image size is 5TiB but you actually have a 5121GiB
(~5.001TiB) RBD image. Any clue how your RBD image got to be 1GiB
larger than an even 5TiB?

>
>
>
> The following packets are installed:
>
>
> rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel"
>
>
> libtcmu-1.4.0-106.gd17d24e.el7.x86_64
>
> ceph-iscsi-cli-2.7-2.7.el7.noarch
>
> kernel-3.10.0-957.5.1.el7.x86_64
>
> ceph-base-13.2.5-0.el7.x86_64
>
> ceph-iscsi-config-2.6-2.6.el7.noarch
>
> ceph-common-13.2.5-0.el7.x86_64
>
> ceph-selinux-13.2.5-0.el7.x86_64
>
> kernel-tools-libs-3.10.0-957.5.1.el7.x86_64
>
> python-cephfs-13.2.5-0.el7.x86_64
>
> ceph-osd-13.2.5-0.el7.x86_64
>
> kernel-headers-3.10.0-957.5.1.el7.x86_64
>
> kernel-tools-3.10.0-957.5.1.el7.x86_64
>
> kernel-3.10.0-957.1.3.el7.x86_64
>
> libcephfs2-13.2.5-0.el7.x86_64
>
> kernel-3.10.0-862.14.4.el7.x86_64
>
> tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64
>
>
>
> Thanks,
>
> Greets
>
>
> Kilian
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://github.com/ceph/ceph-iscsi-config/pull/68

--
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Kilian Ries
Yes, i created all four luns with these sizes:


lun0 - 5120G

lun1 - 5121G

lun2 - 5122G

lun3 - 5123G


Its always one GB more per LUN... Is there any newer ceph-iscsi-config package 
than i have installed?


ceph-iscsi-config-2.6-2.6.el7.noarch


Then i could try to update the package and see if the error is fixed ...


Von: Jason Dillaman 
Gesendet: Mittwoch, 2. Oktober 2019 16:00:03
An: Kilian Ries
Cc: ceph-users@lists.ceph.com
Betreff: Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

On Wed, Oct 2, 2019 at 9:50 AM Kilian Ries  wrote:
>
> Hi,
>
>
> i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was 
> setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that only 
> two of the four configured iscsi gw nodes are working correct. I first 
> noticed via gwcli:
>
>
> ###
>
>
> $gwcli -d ls
>
> Traceback (most recent call last):
>
>   File "/usr/bin/gwcli", line 191, in 
>
> main()
>
>   File "/usr/bin/gwcli", line 103, in main
>
> root_node.refresh()
>
>   File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in 
> refresh
>
> raise GatewayError
>
> gwcli.utils.GatewayError
>
>
> ###
>
>
> I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" are 
> not running. I were not able to restart them via systemd. I then found that 
> even tcmu-runner is not running and it exits with the following error:
>
>
>
> ###
>
>
> tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD 
> image size 5498631880704. Requested new size 5497558138880.
>
>
> ###
>
>
> Now i have the situation that two nodes are running correct and two cant 
> start tcmu-runner. I don't know where the image size mismatches are coming 
> from - i haven't configured or resized any of the images.
>
>
> Is there any chance to get my two iscsi gw nodes back working?

It sounds like you are potentially hitting [1]. The ceph-iscsi-config
library thinks your image size is 5TiB but you actually have a 5121GiB
(~5.001TiB) RBD image. Any clue how your RBD image got to be 1GiB
larger than an even 5TiB?

>
>
>
> The following packets are installed:
>
>
> rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel"
>
>
> libtcmu-1.4.0-106.gd17d24e.el7.x86_64
>
> ceph-iscsi-cli-2.7-2.7.el7.noarch
>
> kernel-3.10.0-957.5.1.el7.x86_64
>
> ceph-base-13.2.5-0.el7.x86_64
>
> ceph-iscsi-config-2.6-2.6.el7.noarch
>
> ceph-common-13.2.5-0.el7.x86_64
>
> ceph-selinux-13.2.5-0.el7.x86_64
>
> kernel-tools-libs-3.10.0-957.5.1.el7.x86_64
>
> python-cephfs-13.2.5-0.el7.x86_64
>
> ceph-osd-13.2.5-0.el7.x86_64
>
> kernel-headers-3.10.0-957.5.1.el7.x86_64
>
> kernel-tools-3.10.0-957.5.1.el7.x86_64
>
> kernel-3.10.0-957.1.3.el7.x86_64
>
> libcephfs2-13.2.5-0.el7.x86_64
>
> kernel-3.10.0-862.14.4.el7.x86_64
>
> tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64
>
>
>
> Thanks,
>
> Greets
>
>
> Kilian
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://github.com/ceph/ceph-iscsi-config/pull/68

--
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS Stability with lots of CAPS

2019-10-02 Thread Stefan Kooman
Hi,

According to [1] there are new parameters in place to have the MDS
behave more stable. Quoting that blog post "One of the more recent
issues weve discovered is that an MDS with a very large cache (64+GB)
will hang during certain recovery events."

For all of us that are not (yet) running Nautilus I wonder what the best
course of action is to prevent instable MDS during recovery situations.

Artificially limit the "mds_cache_memory_limit" to say 32 GB?

I wonder if the amount of clients is of influence in a MDS being
overwhelmed by release messages. Of are a handfull of clients (with
millions of CAPS) able to overload an MDS?

Is there a way, other than unmounting cephfs on clients, to decrease the
amount of CAPS the MDS has handed out, before an upgrade to a newer Ceph
release is undertaken when running luminous / Mimic?

I'm assuming you need to restart the MDS to make the
"mds_cache_memory_limit" effective, is that correct?

Gr. Stefan

[1]: https://ceph.com/community/nautilus-cephfs/


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Jason Dillaman
On Wed, Oct 2, 2019 at 9:50 AM Kilian Ries  wrote:
>
> Hi,
>
>
> i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was 
> setup via ceph-ansible v3.2-stable. I just checked my nodes and saw that only 
> two of the four configured iscsi gw nodes are working correct. I first 
> noticed via gwcli:
>
>
> ###
>
>
> $gwcli -d ls
>
> Traceback (most recent call last):
>
>   File "/usr/bin/gwcli", line 191, in 
>
> main()
>
>   File "/usr/bin/gwcli", line 103, in main
>
> root_node.refresh()
>
>   File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in 
> refresh
>
> raise GatewayError
>
> gwcli.utils.GatewayError
>
>
> ###
>
>
> I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" are 
> not running. I were not able to restart them via systemd. I then found that 
> even tcmu-runner is not running and it exits with the following error:
>
>
>
> ###
>
>
> tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD 
> image size 5498631880704. Requested new size 5497558138880.
>
>
> ###
>
>
> Now i have the situation that two nodes are running correct and two cant 
> start tcmu-runner. I don't know where the image size mismatches are coming 
> from - i haven't configured or resized any of the images.
>
>
> Is there any chance to get my two iscsi gw nodes back working?

It sounds like you are potentially hitting [1]. The ceph-iscsi-config
library thinks your image size is 5TiB but you actually have a 5121GiB
(~5.001TiB) RBD image. Any clue how your RBD image got to be 1GiB
larger than an even 5TiB?

>
>
>
> The following packets are installed:
>
>
> rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel"
>
>
> libtcmu-1.4.0-106.gd17d24e.el7.x86_64
>
> ceph-iscsi-cli-2.7-2.7.el7.noarch
>
> kernel-3.10.0-957.5.1.el7.x86_64
>
> ceph-base-13.2.5-0.el7.x86_64
>
> ceph-iscsi-config-2.6-2.6.el7.noarch
>
> ceph-common-13.2.5-0.el7.x86_64
>
> ceph-selinux-13.2.5-0.el7.x86_64
>
> kernel-tools-libs-3.10.0-957.5.1.el7.x86_64
>
> python-cephfs-13.2.5-0.el7.x86_64
>
> ceph-osd-13.2.5-0.el7.x86_64
>
> kernel-headers-3.10.0-957.5.1.el7.x86_64
>
> kernel-tools-3.10.0-957.5.1.el7.x86_64
>
> kernel-3.10.0-957.1.3.el7.x86_64
>
> libcephfs2-13.2.5-0.el7.x86_64
>
> kernel-3.10.0-862.14.4.el7.x86_64
>
> tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64
>
>
>
> Thanks,
>
> Greets
>
>
> Kilian
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://github.com/ceph/ceph-iscsi-config/pull/68

-- 
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tcmu-runner: mismatched sizes for rbd image size

2019-10-02 Thread Kilian Ries
Hi,


i'm running a ceph mimic cluster with 4x ISCSI gateway nodes. Cluster was setup 
via ceph-ansible v3.2-stable. I just checked my nodes and saw that only two of 
the four configured iscsi gw nodes are working correct. I first noticed via 
gwcli:


###


$gwcli -d ls

Traceback (most recent call last):

  File "/usr/bin/gwcli", line 191, in 

main()

  File "/usr/bin/gwcli", line 103, in main

root_node.refresh()

  File "/usr/lib/python2.7/site-packages/gwcli/gateway.py", line 87, in refresh

raise GatewayError

gwcli.utils.GatewayError


###


I investigated and noticed that both "rbd-target-api" and "rbd-target-gw" are 
not running. I were not able to restart them via systemd. I then found that 
even tcmu-runner is not running and it exits with the following error:



###


tcmu_rbd_check_image_size:827 rbd/production.lun1: Mismatched sizes. RBD image 
size 5498631880704. Requested new size 5497558138880.


###


Now i have the situation that two nodes are running correct and two cant start 
tcmu-runner. I don't know where the image size mismatches are coming from - i 
haven't configured or resized any of the images.


Is there any chance to get my two iscsi gw nodes back working?



The following packets are installed:


rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel"


libtcmu-1.4.0-106.gd17d24e.el7.x86_64

ceph-iscsi-cli-2.7-2.7.el7.noarch

kernel-3.10.0-957.5.1.el7.x86_64

ceph-base-13.2.5-0.el7.x86_64

ceph-iscsi-config-2.6-2.6.el7.noarch

ceph-common-13.2.5-0.el7.x86_64

ceph-selinux-13.2.5-0.el7.x86_64

kernel-tools-libs-3.10.0-957.5.1.el7.x86_64

python-cephfs-13.2.5-0.el7.x86_64

ceph-osd-13.2.5-0.el7.x86_64

kernel-headers-3.10.0-957.5.1.el7.x86_64

kernel-tools-3.10.0-957.5.1.el7.x86_64

kernel-3.10.0-957.1.3.el7.x86_64

libcephfs2-13.2.5-0.el7.x86_64

kernel-3.10.0-862.14.4.el7.x86_64

tcmu-runner-1.4.0-106.gd17d24e.el7.x86_64



Thanks,

Greets


Kilian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw S3 lifecycle cannot keep up

2019-10-02 Thread Christian Pedersen
Hi Martin,

Even before adding cold storage on HDD, I had the cluster with SSD only. That 
also could not keep up with deleting the files.
I am no where near I/O exhaustion on the SSDs or even the HDDs.

Cheers,
Christian

On Oct 2 2019, at 1:23 pm, Martin Verges  wrote:
> Hello Christian,
>
> the problem is, that HDD is not capable of providing lots of IOs required for 
> "~4 million small files".
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io (mailto:martin.ver...@croit.io)
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
>
>
> Am Mi., 2. Okt. 2019 um 11:56 Uhr schrieb Christian Pedersen 
> mailto:chrip...@gmail.com)>:
> > Hi,
> >
> > Using the S3 gateway I store ~4 million small files in my cluster every 
> > day. I have a lifecycle setup to move these files to cold storage after a 
> > day and delete them after two days.
> > The default storage is SSD based and the cold storage is HDD.
> > However the rgw lifecycle process cannot keep up with this. In a 24 hour 
> > period. A little less than a million files are moved per day ( 
> > https://imgur.com/a/H52hD2h ). I have tried only enabling the delete part 
> > of the lifecycle, but even though it deleted from SSD storage, the result 
> > is the same. The screenshots are taken while there is no incoming files to 
> > the cluster.
> > I'm running 5 rgw servers, but that doesn't really change anything from 
> > when I was running less. I've tried adjusting rgw lc max objs, but again no 
> > change in performance.
> > Any suggestions on how I can tune the lifecycle process?
> > Cheers,
> > Christian
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw S3 lifecycle cannot keep up

2019-10-02 Thread Martin Verges
Hello Christian,

the problem is, that HDD is not capable of providing lots of IOs required
for "~4 million small files".

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Mi., 2. Okt. 2019 um 11:56 Uhr schrieb Christian Pedersen <
chrip...@gmail.com>:

> Hi,
>
> Using the S3 gateway I store ~4 million small files in my cluster every
> day. I have a lifecycle setup to move these files to cold storage after a
> day and delete them after two days.
>
> The default storage is SSD based and the cold storage is HDD.
>
> However the rgw lifecycle process cannot keep up with this. In a 24 hour
> period. A little less than a million files are moved per day (
> https://imgur.com/a/H52hD2h ). I have tried only enabling the delete part
> of the lifecycle, but even though it deleted from SSD storage, the result
> is the same. The screenshots are taken while there is no incoming files to
> the cluster.
>
> I'm running 5 rgw servers, but that doesn't really change anything from
> when I was running less. I've tried adjusting rgw lc max objs, but again no
> change in performance.
>
> Any suggestions on how I can tune the lifecycle process?
>
> Cheers,
> Christian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph pg repair clone_missing?

2019-10-02 Thread Marc Roos


 
Hi Brad, 

I was following the thread where you adviced on this pg repair

I ran these rados 'list-inconsistent-obj'/'rados 
list-inconsistent-snapset' and have output on the snapset. I tried to 
extrapolate your comment on the data/omap_digest_mismatch_info onto my 
situation. But I don't know how to proceed. I got on this mailing list 
the advice to delete snapshot 4, but if I see this output, that might 
not have been the smartest thing to do.




[0]
http://tracker.ceph.com/issues/24994

[1]
{
  "epoch": 66082,
  "inconsistents": [
{
  "name": "rbd_data.1f114174b0dc51.0974",
  "nspace": "",
  "locator": "",
  "snap": "head",
  "snapset": {
"snap_context": {
  "seq": 63,
  "snaps": [
63,
35,
13,
4
  ]
},
"head_exists": 1,
"clones": [
  {
"snap": 4,
"size": 4194304,
"overlap": "[]",
"snaps": [
  4
]
  },
  {
"snap": 63,
"size": 4194304,
"overlap": "[0~4194304]",
"snaps": [
  63,
  35,
  13
]
  }
]
  },
  "errors": [
"clone_missing"
  ],
  "missing": [
4
  ]
}
  ]
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw S3 lifecycle cannot keep up

2019-10-02 Thread Christian Pedersen
Hi,

Using the S3 gateway I store ~4 million small files in my cluster every
day. I have a lifecycle setup to move these files to cold storage after a
day and delete them after two days.

The default storage is SSD based and the cold storage is HDD.

However the rgw lifecycle process cannot keep up with this. In a 24 hour
period. A little less than a million files are moved per day (
https://imgur.com/a/H52hD2h ). I have tried only enabling the delete part
of the lifecycle, but even though it deleted from SSD storage, the result
is the same. The screenshots are taken while there is no incoming files to
the cluster.

I'm running 5 rgw servers, but that doesn't really change anything from
when I was running less. I've tried adjusting rgw lc max objs, but again no
change in performance.

Any suggestions on how I can tune the lifecycle process?

Cheers,
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-02 Thread Stefan Kooman
> 
> I created this issue: https://tracker.ceph.com/issues/42116
> 
> Seems to be related to the 'crash' module not enabled.
> 
> If you enable the module the problem should be gone. Now I need to check
> why this message is popping up.

Yup, crash module enabled and error message is gone. Either way it
makes sense to enable the crash module anyway.

Thanks,

Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-02 Thread Wido den Hollander



On 10/1/19 4:38 PM, Stefan Kooman wrote:
> Quoting Wido den Hollander (w...@42on.com):
>> Hi,
>>
>> The Telemetry [0] module has been in Ceph since the Mimic release and
>> when enabled it sends back a anonymized JSON back to
>> https://telemetry.ceph.com/ every 72 hours with information about the
>> cluster.
>>
>> For example:
>>
>> - Version(s)
>> - Number of MONs, OSDs, FS, RGW
>> - Operating System used
>> - CPUs used by MON and OSD
>>
>> Enabling the module is very simple:
>>
>> $ ceph mgr module enable telemetry
> 
> This worked.
> 
> ceph mgr module ls
> {
> "enabled_modules": [
> ...
> ...
> "telemetry"
> ],
> 
>> Before enabling the module you can also view the JSON document it will
>> send back:
>>
>> $ ceph telemetry show
> 
> This gives me:
> 
> ceph telemetry show
> Error EINVAL: Traceback (most recent call last):
>   File "/usr/lib/ceph/mgr/telemetry/module.py", line 325, in handle_command
> report = self.compile_report()
>   File "/usr/lib/ceph/mgr/telemetry/module.py", line 291, in compile_report
> report['crashes'] = self.gather_crashinfo()
>   File "/usr/lib/ceph/mgr/telemetry/module.py", line 214, in gather_crashinfo
> errno, crashids, err = self.remote('crash', 'do_ls', '', '')
>   File "/usr/lib/ceph/mgr/mgr_module.py", line 845, in remote
> args, kwargs)
> ImportError: Module not found
> 
> Running 13.2.6 on Ubuntu Xenial 16.04.6 LTS

I created this issue: https://tracker.ceph.com/issues/42116

Seems to be related to the 'crash' module not enabled.

If you enable the module the problem should be gone. Now I need to check
why this message is popping up.

> 
> Gr. Stefan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd@n crash dumps

2019-10-02 Thread Del Monaco, Andrea
Hi Brad,

Apologies for the flow of messages - the previous messages went for approval 
because of their length.
Here you can see the requested output: https://pastebin.com/N8jG08sH

Regards,

[Atos logo]


Andrea Del Monaco
HPC Consultant – Big Data & Security
M: +31 612031174
Burgemeester Rijnderslaan 30 – 1185 MC Amstelveen – The Netherlands
atos.net
[LinkedIn icon] [Twitter icon] 
  [Facebook icon]   
[Youtube icon] 


This e-mail and the documents attached are confidential and intended solely for 
the addressee; it may also be privileged. If you receive this e-mail in error, 
please notify the sender immediately and destroy it. As its integrity cannot be 
secured on the Internet, Atos’ liability cannot be triggered for the message 
content. Although the sender endeavours to maintain a computer virus-free 
network, the sender does not warrant that this transmission is virus-free and 
will not be liable for any damages resulting from any virus transmitted. On all 
offers and agreements under which Atos Nederland B.V. supplies goods and/or 
services of whatever nature, the Terms of Delivery from Atos Nederland B.V. 
exclusively apply. The Terms of Delivery shall be promptly submitted to you on 
your request.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-02 Thread Wido den Hollander



On 10/1/19 5:11 PM, Mattia Belluco wrote:
> Hi all,
> 
> Same situation here:
> 
> Ceph 13.2.6 on Ubuntu 16.04.
> 

Thanks for the feedback both! I enabled it on a Ubuntu 18.04 with
Nautilus 14.2.4 system.

> Best
> Mattia
> 
> On 10/1/19 4:38 PM, Stefan Kooman wrote:
>> Quoting Wido den Hollander (w...@42on.com):
>>> Hi,
>>>
>>> The Telemetry [0] module has been in Ceph since the Mimic release and
>>> when enabled it sends back a anonymized JSON back to
>>> https://telemetry.ceph.com/ every 72 hours with information about the
>>> cluster.
>>>
>>> For example:
>>>
>>> - Version(s)
>>> - Number of MONs, OSDs, FS, RGW
>>> - Operating System used
>>> - CPUs used by MON and OSD
>>>
>>> Enabling the module is very simple:
>>>
>>> $ ceph mgr module enable telemetry
>>
>> This worked.
>>
>> ceph mgr module ls
>> {
>> "enabled_modules": [
>> ...
>> ...
>> "telemetry"
>> ],
>>
>>> Before enabling the module you can also view the JSON document it will
>>> send back:
>>>
>>> $ ceph telemetry show
>>
>> This gives me:
>>
>> ceph telemetry show
>> Error EINVAL: Traceback (most recent call last):
>>   File "/usr/lib/ceph/mgr/telemetry/module.py", line 325, in handle_command
>> report = self.compile_report()
>>   File "/usr/lib/ceph/mgr/telemetry/module.py", line 291, in compile_report
>> report['crashes'] = self.gather_crashinfo()
>>   File "/usr/lib/ceph/mgr/telemetry/module.py", line 214, in gather_crashinfo
>> errno, crashids, err = self.remote('crash', 'do_ls', '', '')
>>   File "/usr/lib/ceph/mgr/mgr_module.py", line 845, in remote
>> args, kwargs)
>> ImportError: Module not found
>>
>> Running 13.2.6 on Ubuntu Xenial 16.04.6 LTS
>>
>> Gr. Stefan
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com