[ceph-users] PG status is "active+undersized+degraded"

2018-06-20 Thread Dave.Chen
Hi all,

I have setup a ceph cluster in my lab recently, the configuration per my 
understanding should be okay, 4 OSD across 3 nodes, 3 replicas, but couple of 
PG stuck with state "active+undersized+degraded", I think this should be very 
generic issue, could anyone help me out?

Here is the details about the ceph cluster,

$ ceph -v  (jewel)
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)

# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.89049 root default
-2 1.81360 host ceph3
2 1.81360 osd.2   up  1.0  1.0
-3 0.44969 host ceph4
3 0.44969 osd.3   up  1.0  1.0
-4 3.62720 host ceph1
0 1.81360 osd.0   up  1.0  1.0
1 1.81360 osd.1   up  1.0  1.0


# ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs 
stuck undersized; 2 pgs undersized
pg 17.58 is stuck unclean for 61033.947719, current state 
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck unclean for 61033.948201, current state 
active+undersized+degraded, last acting [0,2]
pg 17.58 is stuck undersized for 61033.343824, current state 
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck undersized for 61033.327566, current state 
active+undersized+degraded, last acting [0,2]
pg 17.58 is stuck degraded for 61033.343835, current state 
active+undersized+degraded, last acting [2,0]
pg 17.16 is stuck degraded for 61033.327576, current state 
active+undersized+degraded, last acting [0,2]
pg 17.16 is active+undersized+degraded, acting [0,2]
pg 17.58 is active+undersized+degraded, acting [2,0]



# rados lspools
rbdbench


$ ceph osd pool get rbdbench size
size: 3



Where can I get the details about the issue?   Appreciate for any comments!

Best Regards,
Dave Chen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] issues with ceph nautilus version

2018-06-20 Thread Raju Rangoju
Hey Igor, patch that you pointed worked for me.
Thanks Again.

From: ceph-users  On Behalf Of Igor Fedotov
Sent: 20 June 2018 21:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] issues with ceph nautilus version


Hi Raju,



This is a bug in new BlueStore's bitmap allocator.

This PR will most probably fix that:

https://github.com/ceph/ceph/pull/22610



Also you may try to switch bluestore and bluefs allocators (bluestore_allocator 
and bluefs_allocator parameters respectively) to stupid and restart OSDs.

This should help.



Thanks,

Igor

On 6/20/2018 6:41 PM, Raju Rangoju wrote:
Hi,

Recently I have upgraded my ceph cluster to version 14.0.0 - nautilus(dev) from 
ceph version 13.0.1, after this, I noticed some weird data usage numbers on the 
cluster.
Here are the issues I'm seeing...

  1.  The data usage reported is much more than what is available

usage:   16 EiB used, 164 TiB / 158 TiB avail



before this upgradation, it used to report properly

usage:   1.10T used, 157T / 158T avail



  1.  it reports that all the osds/pool are full

Can someone please shed some light? Any helps is greatly appreciated.

[root@hadoop1 my-ceph]# ceph --version
ceph version 14.0.0-480-g6c1e8ee (6c1e8ee14f9b25dc96684dbc1f8c8255c47f0bb9) 
nautilus (dev)

[root@hadoop1 my-ceph]# ceph -s
  cluster:
id: ee4660fd-167b-42e6-b27b-126526dab04d
health: HEALTH_ERR
87 full osd(s)
11 pool(s) full

  services:
mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6
mgr: hadoop6(active), standbys: hadoop1, hadoop4
mds: cephfs-1/1/1 up  {0=hadoop3=up:creating}, 2 up:standby
osd: 88 osds: 87 up, 87 in

  data:
pools:   11 pools, 32588 pgs
objects: 0  objects, 0 B
usage:   16 EiB used, 164 TiB / 158 TiB avail
pgs: 32588 active+clean

Thanks in advance
-Raj





___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent, scrub stat mismatch on bytes

2018-06-20 Thread David Turner
As a part of the repair operation it runs a deep-scrub on the PG.  If it
showed active+clean after the repair and deep-scrub finished, then the next
run of a scrub on the PG shouldn't change the PG status at all.

On Wed, Jun 6, 2018 at 8:57 PM Adrian  wrote:

> Update to this.
>
> The affected pg didn't seem inconsistent:
>
> [root@admin-ceph1-qh2 ~]# ceph health detail
>
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
>pg 6.20 is active+clean+inconsistent, acting [114,26,44]
> [root@admin-ceph1-qh2 ~]# rados list-inconsistent-obj 6.20
> --format=json-pretty
> {
>"epoch": 210034,
>"inconsistents": []
> }
>
> Although pg query showed the primary info.stats.stat_sum.num_bytes
> differed from the peers
>
> A pg repair on 6.20 seems to have resolved the issue for now but the
> info.stats.stat_sum.num_bytes still differs so presumably will become
> inconsistent again next time it scrubs.
>
> Adrian.
>
> On Tue, Jun 5, 2018 at 12:09 PM, Adrian  wrote:
>
>> Hi Cephers,
>>
>> We recently upgraded one of our clusters from hammer to jewel and then to
>> luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
>> deep-scubs we have an inconsistent pg with a log message we've not seen
>> before:
>>
>> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 6.20 is active+clean+inconsistent, acting [114,26,44]
>>
>>
>> Ceph log shows
>>
>> 2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395 : 
>> cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87 clones, 
>> 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
>> 25952454144/25952462336 bytes, 0/0 hit_set_archive bytes.
>> 2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396 : 
>> cluster [ERR] 6.20 scrub 1 errors
>> 2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41298 
>> : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
>> 2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41299 
>> : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent 
>> (PG_DAMAGED)
>> 2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41345 
>> : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 
>> pg inconsistent
>>
>> There are no EC pools - looks like it may be the same as
>> https://tracker.ceph.com/issues/22656 although as in #7 this is not a
>> cache pool.
>>
>> Wondering if this is ok to issue a pg repair on 6.20 or if there's
>> something else we should be looking at first ?
>>
>> Thanks in advance,
>> Adrian.
>>
>> ---
>> Adrian : aussie...@gmail.com
>> If violence doesn't solve your problem, you're not using enough of it.
>>
>
>
>
> --
> ---
> Adrian : aussie...@gmail.com
> If violence doesn't solve your problem, you're not using enough of it.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw failover help

2018-06-20 Thread David Turner
We originally used pacemaker to move a VIP between our RGWs, but ultimately
decided to go with an LB in front of them.  With an LB you can utilize both
RGWs while they're up, but the LB will shy away from either if they're down
until the check starts succeeding for that host again.  We do have 2 LBs
with pacemaker, but the LBs are in charge of 3 prod RGW realms and 2
staging realms.  Moving to the LB with pacemaker simplified our setup quite
a bit for HA.

On Wed, Jun 20, 2018 at 12:08 PM Simon Ironside 
wrote:

> Hi,
>
> Perhaps not optimal nor exactly what you want but round robin DNS works
> with two (or more) vanilla radosgw servers ok for me as a very rudimentary
> form of failover and load balancing.
>
> If you wanted active/standby you could use something like pacemaker to
> start services and move the vIP around, you don't need DRBD or similar to
> keep things in sync as all the data is obviously in ceph.
>
>
> Simon
>
>
> On 20/06/18 16:44, nigel davies wrote:
>
> Hay All
>
> Has any one, done or working a way to do S3(radosgw) failover.
>
> I am trying to work out away to have 2 radosgw servers, with an VIP
> when one server goes down it will go over to the other.
>
> I am trying this with CTDB, but while testing the upload can fail and then
> carry on or just hand and time out.
>
> Any advise on this would be grateful as i am lousing my mind.
>
> Thanks Nigdav007
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] issues with ceph nautilus version

2018-06-20 Thread Raju Rangoju
Hi Igor,
Great! Thanks for the quick response.

Will try the fix and let you know how it goes.

-Raj
From: ceph-users  On Behalf Of Igor Fedotov
Sent: 20 June 2018 21:55
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] issues with ceph nautilus version


Hi Raju,



This is a bug in new BlueStore's bitmap allocator.

This PR will most probably fix that:

https://github.com/ceph/ceph/pull/22610



Also you may try to switch bluestore and bluefs allocators (bluestore_allocator 
and bluefs_allocator parameters respectively) to stupid and restart OSDs.

This should help.



Thanks,

Igor

On 6/20/2018 6:41 PM, Raju Rangoju wrote:
Hi,

Recently I have upgraded my ceph cluster to version 14.0.0 - nautilus(dev) from 
ceph version 13.0.1, after this, I noticed some weird data usage numbers on the 
cluster.
Here are the issues I'm seeing...

  1.  The data usage reported is much more than what is available

usage:   16 EiB used, 164 TiB / 158 TiB avail



before this upgradation, it used to report properly

usage:   1.10T used, 157T / 158T avail



  1.  it reports that all the osds/pool are full

Can someone please shed some light? Any helps is greatly appreciated.

[root@hadoop1 my-ceph]# ceph --version
ceph version 14.0.0-480-g6c1e8ee (6c1e8ee14f9b25dc96684dbc1f8c8255c47f0bb9) 
nautilus (dev)

[root@hadoop1 my-ceph]# ceph -s
  cluster:
id: ee4660fd-167b-42e6-b27b-126526dab04d
health: HEALTH_ERR
87 full osd(s)
11 pool(s) full

  services:
mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6
mgr: hadoop6(active), standbys: hadoop1, hadoop4
mds: cephfs-1/1/1 up  {0=hadoop3=up:creating}, 2 up:standby
osd: 88 osds: 87 up, 87 in

  data:
pools:   11 pools, 32588 pgs
objects: 0  objects, 0 B
usage:   16 EiB used, 164 TiB / 158 TiB avail
pgs: 32588 active+clean

Thanks in advance
-Raj





___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] issues with ceph nautilus version

2018-06-20 Thread Igor Fedotov

Hi Raju,


This is a bug in new BlueStore's bitmap allocator.

This PR will most probably fix that:

https://github.com/ceph/ceph/pull/22610


Also you may try to switch bluestore and bluefs allocators 
(bluestore_allocator and bluefs_allocator parameters respectively) to 
stupid and restart OSDs.


This should help.


Thanks,

Igor


On 6/20/2018 6:41 PM, Raju Rangoju wrote:


Hi,

Recently I have upgraded my ceph cluster to version 14.0.0 - 
nautilus(dev) from ceph version 13.0.1, after this, I noticed some 
weird data usage numbers on the cluster.


Here are the issues I’m seeing…

 1. The data usage reported is much more than what is available

usage:   16 EiB used, 164 TiB / 158 TiB avail

before this upgradation, it used to report properly

usage:   1.10T used, 157T / 158T avail

 2. it reports that all the osds/pool are full

Can someone please shed some light? Any helps is greatly appreciated.

[root@hadoop1 my-ceph]# ceph --version

ceph version 14.0.0-480-g6c1e8ee 
(6c1e8ee14f9b25dc96684dbc1f8c8255c47f0bb9) nautilus (dev)


[root@hadoop1 my-ceph]# ceph -s

  cluster:

    id: ee4660fd-167b-42e6-b27b-126526dab04d

    health: HEALTH_ERR

    87 full osd(s)

    11 pool(s) full

  services:

    mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6

    mgr: hadoop6(active), standbys: hadoop1, hadoop4

    mds: cephfs-1/1/1 up {0=hadoop3=up:creating}, 2 up:standby

    osd: 88 osds: 87 up, 87 in

  data:

    pools:   11 pools, 32588 pgs

    objects: 0  objects, 0 B

    usage:   16 EiB used, 164 TiB / 158 TiB avail

    pgs: 32588 active+clean

Thanks in advance

-Raj



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] issues with ceph nautilus version

2018-06-20 Thread Paul Emmerich
I've also seen something similar with Luminous once on broken OSDs reporting
nonsense stats that overflowed some variables and reporting 1000% full.

In my case it was Bluestore OSDs running on too tiny VMs.

Paul


2018-06-20 17:41 GMT+02:00 Raju Rangoju :

> Hi,
>
>
>
> Recently I have upgraded my ceph cluster to version 14.0.0 - nautilus(dev)
> from ceph version 13.0.1, after this, I noticed some weird data usage
> numbers on the cluster.
>
> Here are the issues I’m seeing…
>
>1. The data usage reported is much more than what is available
>
> usage:   16 EiB used, 164 TiB / 158 TiB avail
>
>
>
> before this upgradation, it used to report properly
>
> usage:   1.10T used, 157T / 158T avail
>
>
>
>1. it reports that all the osds/pool are full
>
>
>
> Can someone please shed some light? Any helps is greatly appreciated.
>
>
>
> [root@hadoop1 my-ceph]# ceph --version
>
> ceph version 14.0.0-480-g6c1e8ee (6c1e8ee14f9b25dc96684dbc1f8c8255c47f0bb9)
> nautilus (dev)
>
>
>
> [root@hadoop1 my-ceph]# ceph -s
>
>   cluster:
>
> id: ee4660fd-167b-42e6-b27b-126526dab04d
>
> health: HEALTH_ERR
>
> 87 full osd(s)
>
> 11 pool(s) full
>
>
>
>   services:
>
> mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6
>
> mgr: hadoop6(active), standbys: hadoop1, hadoop4
>
> mds: cephfs-1/1/1 up  {0=hadoop3=up:creating}, 2 up:standby
>
> osd: 88 osds: 87 up, 87 in
>
>
>
>   data:
>
> pools:   11 pools, 32588 pgs
>
> objects: 0  objects, 0 B
>
> usage:   16 EiB used, 164 TiB / 158 TiB avail
>
> pgs: 32588 active+clean
>
>
>
> Thanks in advance
>
> -Raj
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfill stops after a while after OSD reweight

2018-06-20 Thread Paul Emmerich
Yeah, your tunables are ancient. Probably wouldn't have happened with
modern ones.
If this was my cluster I would probably update the clients and update that
(caution: lots of data movement!),
but I know how annoying it can be to chase down everyone who runs ancient
clients.

For comparison, this is what a fresh installation of Luminous looks like:
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 1,
"straw_calc_version": 1,
"allowed_bucket_algs": 54,
"profile": "jewel",
"optimal_tunables": 1,
"legacy_tunables": 0,
"minimum_required_version": "jewel",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 1,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 1,
"require_feature_tunables5": 1,
"has_v5_rules": 0
}


For a work-around/fix, I'd probably either figure out which can be adjusted
without breaking the oldest clients. Incrementing choose*tries in the crush
rule
or tunables is probably sufficient.
But since you are apparently running into data balance problems you'll have
to update that to something more modern sooner or later.

You can also play around with crushtool, it can simulate how PGs are mapped,
that's usually better than changing random things on a production cluster:
http://docs.ceph.com/docs/mimic/man/8/crushtool/

Good luck


Paul

2018-06-20 17:57 GMT+02:00 Oliver Schulz :

> Hi Paul,
>
> ah, right, "ceph pg dump | grep remapped", that's what I was looking
> for. I added the output and the result of the pg query at the end of
>
> https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb
>
>
> > But my guess here is that you are running a CRUSH rule to distribute
> across 3 racks
> > and you only have 3 racks in total.
>
> Yes - I always assumed that 3 failure domains would be suitable
> for replication factor of 3. The three racks are absolutely
> identical, though, hardware-wise, including HDD sizes, and we
> never had any trouble like this before Luminous (we often used
> significant reweighting in the past).
>
> We are way behind on Ceph tunables though:
>
> # ceph osd crush show-tunables
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 0,
> "chooseleaf_stable": 0,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 22,
> "profile": "bobtail",
> "optimal_tunables": 0,
> "legacy_tunables": 0,
> "minimum_required_version": "bobtail",
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "has_v2_rules": 0,
> "require_feature_tunables3": 0,
> "has_v3_rules": 0,
> "has_v4_buckets": 0,
> "require_feature_tunables5": 0,
> "has_v5_rules": 0
> }
>
> We still have some old clients (trying to get rid of those, so I
> can activate more recent tunables, but it may be a while) ...
>
> Are my tunables at fault? If so, can you recommend a solution
> or a temporary workaround?
>
>
> Cheers (and thanks for helping!),
>
> Oliver
>
>
>
>
> On 06/20/2018 05:01 PM, Paul Emmerich wrote:
>
>> Hi,
>>
>> have a look at "ceph pg dump" to see which ones are stuck in remapped.
>>
>> But my guess here is that you are running a CRUSH rule to distribute
>> across 3 racks
>> and you only have 3 racks in total.
>> CRUSH will sometimes fail to find a mapping in this scenario. There are a
>> few parameters
>> that you can tune in your CRUSH rule to increase the number of retries.
>> For example, the settings set_chooseleaf_tries and set_choose_tries can
>> help, they are
>> set by default for erasure coding rules (where this scenario is more
>> common). Values used
>> for EC are set_chooseleaf_tries = 5 and set_choose_tries = 100.
>> You can configure them by adding them as the first steps of the rule.
>>
>> You can also configure an upmap exception.
>>
>> But in general it is often not the best idea to have only 3 racks for
>> replica = 3 if you want
>> to achieve a good data balance.
>>
>>
>>
>> Paul
>>
>>
>> 2018-06-20 16:50 GMT+02:00 Oliver Schulz > >:
>>
>> Dear Paul,
>>
>> thanks, here goes (output of "ceph -s", etc.):
>>
>> https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb <
>> https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb>
>>
>> > Also please run "ceph pg X.YZ query" on one of the PGs not
>> backfilling.
>>
>> Silly question: How do I get a list of the PGs not backfilling?
>>
>>
>>
>> On 06/20/2018 04:00 PM, Paul Emmerich wrote:
>>
>> Can you post the full output of "ceph -s", "ceph health detail,
>> and ceph osd df tree
>> Also please run "ceph pg X.YZ query" on one of the PGs not
>> backfilling.
>>
>>
>> Paul
>>
>> 2018-06-20 15:25 GMT+02:00 Oliver 

Re: [ceph-users] Backfill stops after a while after OSD reweight

2018-06-20 Thread Oliver Schulz

Hi Paul,

ah, right, "ceph pg dump | grep remapped", that's what I was looking
for. I added the output and the result of the pg query at the end of

https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb


> But my guess here is that you are running a CRUSH rule to distribute across 3 
racks
> and you only have 3 racks in total.

Yes - I always assumed that 3 failure domains would be suitable
for replication factor of 3. The three racks are absolutely
identical, though, hardware-wise, including HDD sizes, and we
never had any trouble like this before Luminous (we often used
significant reweighting in the past).

We are way behind on Ceph tunables though:

# ceph osd crush show-tunables
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 0,
"chooseleaf_stable": 0,
"straw_calc_version": 1,
"allowed_bucket_algs": 22,
"profile": "bobtail",
"optimal_tunables": 0,
"legacy_tunables": 0,
"minimum_required_version": "bobtail",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 0,
"require_feature_tunables3": 0,
"has_v3_rules": 0,
"has_v4_buckets": 0,
"require_feature_tunables5": 0,
"has_v5_rules": 0
}

We still have some old clients (trying to get rid of those, so I
can activate more recent tunables, but it may be a while) ...

Are my tunables at fault? If so, can you recommend a solution
or a temporary workaround?


Cheers (and thanks for helping!),

Oliver




On 06/20/2018 05:01 PM, Paul Emmerich wrote:

Hi,

have a look at "ceph pg dump" to see which ones are stuck in remapped.

But my guess here is that you are running a CRUSH rule to distribute across 3 
racks
and you only have 3 racks in total.
CRUSH will sometimes fail to find a mapping in this scenario. There are a few 
parameters
that you can tune in your CRUSH rule to increase the number of retries.
For example, the settings set_chooseleaf_tries and set_choose_tries can help, 
they are
set by default for erasure coding rules (where this scenario is more common). 
Values used
for EC are set_chooseleaf_tries = 5 and set_choose_tries = 100.
You can configure them by adding them as the first steps of the rule.

You can also configure an upmap exception.

But in general it is often not the best idea to have only 3 racks for replica = 
3 if you want
to achieve a good data balance.



Paul


2018-06-20 16:50 GMT+02:00 Oliver Schulz mailto:oliver.sch...@tu-dortmund.de>>:

Dear Paul,

thanks, here goes (output of "ceph -s", etc.):

https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb 


> Also please run "ceph pg X.YZ query" on one of the PGs not backfilling.

Silly question: How do I get a list of the PGs not backfilling?



On 06/20/2018 04:00 PM, Paul Emmerich wrote:

Can you post the full output of "ceph -s", "ceph health detail, and 
ceph osd df tree
Also please run "ceph pg X.YZ query" on one of the PGs not backfilling.


Paul

2018-06-20 15:25 GMT+02:00 Oliver Schulz mailto:oliver.sch...@tu-dortmund.de> >>:

     Dear all,

     we (somewhat) recently extended our Ceph cluster,
     and updated it to Luminous. By now, the fill level
     on some ODSs is quite high again, so I'd like to
     re-balance via "OSD reweight".

     I'm running into the following problem, however:
     Not matter what I do (reweigt a little, or a lot,
     or only reweight a single OSD by 5%) - after a
     while, backfilling simply stops and lots of objects
     stay misplaced.

     I do have up to 250 PGs per OSD (early sins from
     the first days of the cluster), but I've set
     "mon_max_pg_per_osd = 400" and
     "osd_max_pg_per_osd_hard_ratio = 1.5" to compensate.

     How can I find out why backfill stops? Any advice
     would be very much appreciated.


     Cheers,

     Oliver
     ___
     ceph-users mailing list
ceph-users@lists.ceph.com  
>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 
>




-- 
Paul Emmerich


Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io  
Tel: 

[ceph-users] radosgw failover help

2018-06-20 Thread nigel davies
Hay All

Has any one, done or working a way to do S3(radosgw) failover.

I am trying to work out away to have 2 radosgw servers, with an VIP
when one server goes down it will go over to the other.

I am trying this with CTDB, but while testing the upload can fail and then
carry on or just hand and time out.

Any advise on this would be grateful as i am lousing my mind.

Thanks Nigdav007
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] issues with ceph nautilus version

2018-06-20 Thread Raju Rangoju
Hi,

Recently I have upgraded my ceph cluster to version 14.0.0 - nautilus(dev) from 
ceph version 13.0.1, after this, I noticed some weird data usage numbers on the 
cluster.
Here are the issues I'm seeing...

  1.  The data usage reported is much more than what is available

usage:   16 EiB used, 164 TiB / 158 TiB avail



before this upgradation, it used to report properly

usage:   1.10T used, 157T / 158T avail



  1.  it reports that all the osds/pool are full

Can someone please shed some light? Any helps is greatly appreciated.

[root@hadoop1 my-ceph]# ceph --version
ceph version 14.0.0-480-g6c1e8ee (6c1e8ee14f9b25dc96684dbc1f8c8255c47f0bb9) 
nautilus (dev)

[root@hadoop1 my-ceph]# ceph -s
  cluster:
id: ee4660fd-167b-42e6-b27b-126526dab04d
health: HEALTH_ERR
87 full osd(s)
11 pool(s) full

  services:
mon: 3 daemons, quorum hadoop1,hadoop4,hadoop6
mgr: hadoop6(active), standbys: hadoop1, hadoop4
mds: cephfs-1/1/1 up  {0=hadoop3=up:creating}, 2 up:standby
osd: 88 osds: 87 up, 87 in

  data:
pools:   11 pools, 32588 pgs
objects: 0  objects, 0 B
usage:   16 EiB used, 164 TiB / 158 TiB avail
pgs: 32588 active+clean

Thanks in advance
-Raj

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CentOS Dojo at CERN

2018-06-20 Thread Dan van der Ster
And BTW, if you can't make it to this event we're in the early days of
planning a dedicated Ceph + OpenStack Days at CERN around May/June
2019.
More news on that later...

-- Dan @ CERN


On Tue, Jun 19, 2018 at 10:23 PM Leonardo Vaz  wrote:
>
> Hey Cephers,
>
> We will join our friends from OpenStack and CentOS projects at CERN in
> Geneva on October 19th for the CentOS Dojo:
>
>https://blog.centos.org/2018/05/cern-dojo-october-19th-2018/
>
> The call for papers is currently open and more details about the event
> are available on the URL above.
>
> Kindest regards,
>
> Leo
>
> --
> Leonardo Vaz
> Ceph Community Manager
> Open Source and Standards Team
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfill stops after a while after OSD reweight

2018-06-20 Thread Paul Emmerich
Hi,

have a look at "ceph pg dump" to see which ones are stuck in remapped.

But my guess here is that you are running a CRUSH rule to distribute across
3 racks
and you only have 3 racks in total.
CRUSH will sometimes fail to find a mapping in this scenario. There are a
few parameters
that you can tune in your CRUSH rule to increase the number of retries.
For example, the settings set_chooseleaf_tries and set_choose_tries can
help, they are
set by default for erasure coding rules (where this scenario is more
common). Values used
for EC are set_chooseleaf_tries = 5 and set_choose_tries = 100.
You can configure them by adding them as the first steps of the rule.

You can also configure an upmap exception.

But in general it is often not the best idea to have only 3 racks for
replica = 3 if you want
to achieve a good data balance.



Paul


2018-06-20 16:50 GMT+02:00 Oliver Schulz :

> Dear Paul,
>
> thanks, here goes (output of "ceph -s", etc.):
>
> https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb
>
> > Also please run "ceph pg X.YZ query" on one of the PGs not backfilling.
>
> Silly question: How do I get a list of the PGs not backfilling?
>
>
>
> On 06/20/2018 04:00 PM, Paul Emmerich wrote:
>
>> Can you post the full output of "ceph -s", "ceph health detail, and ceph
>> osd df tree
>> Also please run "ceph pg X.YZ query" on one of the PGs not backfilling.
>>
>>
>> Paul
>>
>> 2018-06-20 15:25 GMT+02:00 Oliver Schulz > >:
>>
>> Dear all,
>>
>> we (somewhat) recently extended our Ceph cluster,
>> and updated it to Luminous. By now, the fill level
>> on some ODSs is quite high again, so I'd like to
>> re-balance via "OSD reweight".
>>
>> I'm running into the following problem, however:
>> Not matter what I do (reweigt a little, or a lot,
>> or only reweight a single OSD by 5%) - after a
>> while, backfilling simply stops and lots of objects
>> stay misplaced.
>>
>> I do have up to 250 PGs per OSD (early sins from
>> the first days of the cluster), but I've set
>> "mon_max_pg_per_osd = 400" and
>> "osd_max_pg_per_osd_hard_ratio = 1.5" to compensate.
>>
>> How can I find out why backfill stops? Any advice
>> would be very much appreciated.
>>
>>
>> Cheers,
>>
>> Oliver
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>>
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io 
>> Tel: +49 89 1896585 90
>>
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] Ceph Tech Talk Calendar

2018-06-20 Thread Lenz Grimmer
Hi Leo,

On 06/20/2018 01:47 AM, Leonardo Vaz wrote:

> We created the following etherpad to organize the calendar for the
> future Ceph Tech Talks.
> 
> For the Ceph Tech Talk of June 28th our fellow George Mihaiescu will
> tell us how Ceph is being used on cancer research at OICR (Ontario
> Institute for Cancer Research).
> 
> If you're interested to contribute, please choose one of the available
> dates, add the topic you want to present and your name (or feel free
> to contact me).
> 
>   https://pad.ceph.com/p/ceph-tech-talks-2018

Hmm. IMHO, an Etherpad is somewhat too volatile for this kind of
information, but it of course makes it easier for others to submit
proposals.

If maintaining this on https://ceph.com/ceph-tech-talks/ is too
difficult, would it make sense to use the Wiki for this instead?

In any case, the page at which this information is collected should be
linked from prominent places, e.g. ceph.com/community or
https://tracker.ceph.com/projects/ceph/wiki/Community

Thanks,

Lenz

-- 
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Important] Ceph Developer Monthly of July 2018

2018-06-20 Thread Leonardo Vaz
Hi Cephers,

Due the July 4th holiday in US we are postponing the Ceph Developer
Monthly meeting to July 11th.

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-20 Thread Andrei Mikhailovsky
Hi Brad,

Yes, but it doesn't show much:

ceph pg 18.2 query
Error EPERM: problem getting command descriptions from pg.18.2

Cheers



- Original Message -
> From: "Brad Hubbard" 
> To: "andrei" 
> Cc: "ceph-users" 
> Sent: Wednesday, 20 June, 2018 00:02:07
> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG

> Can you post the output of a pg query?
> 
> On Tue, Jun 19, 2018 at 11:44 PM, Andrei Mikhailovsky  
> wrote:
>> A quick update on my issue. I have noticed that while I was trying to move
>> the problem object on osds, the file attributes got lost on one of the osds,
>> which is I guess why the error messages showed the no attribute bit.
>>
>> I then copied the attributes metadata to the problematic object and
>> restarted the osds in question. Following a pg repair I got a different
>> error:
>>
>> 2018-06-19 13:51:05.846033 osd.21 osd.21 192.168.168.203:6828/24339 2 :
>> cluster [ERR] 18.2 shard 21: soid 18:45f87722:::.dir.default.80018061.2:head
>> omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi
>> 18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910
>> dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871
>> alloc_hint [0 0 0])
>> 2018-06-19 13:51:05.846042 osd.21 osd.21 192.168.168.203:6828/24339 3 :
>> cluster [ERR] 18.2 shard 28: soid 18:45f87722:::.dir.default.80018061.2:head
>> omap_digest 0x25e8a1da != omap_digest 0x21c7f871 from auth oi
>> 18:45f87722:::.dir.default.80018061.2:head(106137'603495 osd.21.0:41403910
>> dirty|omap|data_digest|omap_digest s 0 uv 603494 dd  od 21c7f871
>> alloc_hint [0 0 0])
>> 2018-06-19 13:51:05.846046 osd.21 osd.21 192.168.168.203:6828/24339 4 :
>> cluster [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed
>> to pick suitable auth object
>> 2018-06-19 13:51:05.846118 osd.21 osd.21 192.168.168.203:6828/24339 5 :
>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_'
>> attr
>> 2018-06-19 13:51:05.846129 osd.21 osd.21 192.168.168.203:6828/24339 6 :
>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no
>> 'snapset' attr
>> 2018-06-19 13:51:09.810878 osd.21 osd.21 192.168.168.203:6828/24339 7 :
>> cluster [ERR] 18.2 repair 4 errors, 0 fixed
>>
>> It mentions that there is an incorrect omap_digest . How do I go about
>> fixing this?
>>
>> Cheers
>>
>> 
>>
>> From: "andrei" 
>> To: "ceph-users" 
>> Sent: Tuesday, 19 June, 2018 11:16:22
>> Subject: [ceph-users] fixing unrepairable inconsistent PG
>>
>> Hello everyone
>>
>> I am having trouble repairing one inconsistent and stubborn PG. I get the
>> following error in ceph.log:
>>
>>
>>
>> 2018-06-19 11:00:00.000225 mon.arh-ibstorage1-ib mon.0
>> 192.168.168.201:6789/0 675 : cluster [ERR] overall HEALTH_ERR noout flag(s)
>> set; 4 scrub errors; Possible data damage: 1 pg inconsistent; application
>> not enabled on 4 pool(s)
>> 2018-06-19 11:09:24.586392 mon.arh-ibstorage1-ib mon.0
>> 192.168.168.201:6789/0 841 : cluster [ERR] Health check update: Possible
>> data damage: 1 pg inconsistent, 1 pg repair (PG_DAMAGED)
>> 2018-06-19 11:09:27.139504 osd.21 osd.21 192.168.168.203:6828/4003 2 :
>> cluster [ERR] 18.2 soid 18:45f87722:::.dir.default.80018061.2:head: failed
>> to pick suitable object info
>> 2018-06-19 11:09:27.139545 osd.21 osd.21 192.168.168.203:6828/4003 3 :
>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no '_'
>> attr
>> 2018-06-19 11:09:27.139550 osd.21 osd.21 192.168.168.203:6828/4003 4 :
>> cluster [ERR] repair 18.2 18:45f87722:::.dir.default.80018061.2:head no
>> 'snapset' attr
>>
>> 2018-06-19 11:09:35.484402 osd.21 osd.21 192.168.168.203:6828/4003 5 :
>> cluster [ERR] 18.2 repair 4 errors, 0 fixed
>> 2018-06-19 11:09:40.601657 mon.arh-ibstorage1-ib mon.0
>> 192.168.168.201:6789/0 844 : cluster [ERR] Health check update: Possible
>> data damage: 1 pg inconsistent (PG_DAMAGED)
>>
>>
>> I have tried to follow a few instructions on the PG repair, including
>> removal of the 'broken' object .dir.default.80018061.2
>>  from primary osd following by the pg repair. After that didn't work, I've
>> done the same for the secondary osd. Still the same issue.
>>
>> Looking at the actual object on the file system, the file size is 0 for both
>> primary and secondary objects. The md5sum is the same too. The broken PG
>> belongs to the radosgw bucket called .rgw.buckets.index
>>
>> What else can I try to get the thing fixed?
>>
>> Cheers
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 
> --
> Cheers,
> Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Backfill stops after a while after OSD reweight

2018-06-20 Thread Paul Emmerich
Can you post the full output of "ceph -s", "ceph health detail, and ceph
osd df tree
Also please run "ceph pg X.YZ query" on one of the PGs not backfilling.


Paul

2018-06-20 15:25 GMT+02:00 Oliver Schulz :

> Dear all,
>
> we (somewhat) recently extended our Ceph cluster,
> and updated it to Luminous. By now, the fill level
> on some ODSs is quite high again, so I'd like to
> re-balance via "OSD reweight".
>
> I'm running into the following problem, however:
> Not matter what I do (reweigt a little, or a lot,
> or only reweight a single OSD by 5%) - after a
> while, backfilling simply stops and lots of objects
> stay misplaced.
>
> I do have up to 250 PGs per OSD (early sins from
> the first days of the cluster), but I've set
> "mon_max_pg_per_osd = 400" and
> "osd_max_pg_per_osd_hard_ratio = 1.5" to compensate.
>
> How can I find out why backfill stops? Any advice
> would be very much appreciated.
>
>
> Cheers,
>
> Oliver
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EPEL dependency on CENTOS

2018-06-20 Thread Alfredo Deza
On Wed, Jun 20, 2018 at 7:27 AM, Bernhard Dick  wrote:
> Hi,
>
> I'm experimenting with CEPH and have seen that ceph-deploy and ceph-ansible
> have the EPEL repositories as requirement, when installing CEPH on CENTOS
> hosts. Due to the nature of the EPEL repos this might cause trouble (i.e.
> when combining CEPH with oVirt on the same host).
> When using the CEPH repos from the storage-SIG of centos EPEL is not needed,
> so I'm asking whether it is still required to explicitly require
> installation of the EPEL repository when using the official ways of
> installing CEPH?

For vanilla CentOS and download.ceph.com packages, yes, you need EPEL.

If you are using ceph-deploy, you can configure it to use a different
repo (e.g. storage-SIG) with the `--repo-url` flag, or
by configuring the cephdeploy.conf file (see
http://docs.ceph.com/ceph-deploy/docs/conf.html )

That would skip installing EPEL because it is understood that a user
wants to have explicit control on the source of packages.

Unsure for ceph-ansible except for what you've tried (cc'ing the
ceph-ansible list)

> My solution for it was changing the centos_package_dependencies list in the
> ceph-ansible tree by replacing epel-release with
> centos-release-ceph-luminous and setting ceph_origin to distro.
> Would it be an idea to add direct support for the centos SIG packages in
> ceph-ansible or to decide on when to install epel based on the used package
> repository?
>
>   Regards
> Bernhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Backfill stops after a while after OSD reweight

2018-06-20 Thread Oliver Schulz

Dear all,

we (somewhat) recently extended our Ceph cluster,
and updated it to Luminous. By now, the fill level
on some ODSs is quite high again, so I'd like to
re-balance via "OSD reweight".

I'm running into the following problem, however:
Not matter what I do (reweigt a little, or a lot,
or only reweight a single OSD by 5%) - after a
while, backfilling simply stops and lots of objects
stay misplaced.

I do have up to 250 PGs per OSD (early sins from
the first days of the cluster), but I've set
"mon_max_pg_per_osd = 400" and
"osd_max_pg_per_osd_hard_ratio = 1.5" to compensate.

How can I find out why backfill stops? Any advice
would be very much appreciated.


Cheers,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS: journaler.pq decode error

2018-06-20 Thread Benjeman Meekhof
Thanks for the response.  I was also hoping to be able to debug better
once we got onto Mimic.  We just finished that upgrade yesterday and
cephfs-journal-tool does find a corruption in the purge queue though
our MDS continues to startup and the filesystem appears to be
functional as usual.

How can I modify the purge queue to remove damaged sections?  Is there
some way to scan known FS objects and remove any that might now be
orphaned once the damage is removed/repaired?

# cephfs-journal-tool --journal=purge_queue journal inspect

Overall journal integrity: DAMAGED
Corrupt regions:
  0x6819f8-681a55

# cephfs-journal-tool --journal=purge_queue header get

{
"magic": "ceph fs volume v011",
"write_pos": 203357732,
"expire_pos": 6822392,
"trimmed_pos": 4194304,
"stream_format": 1,
"layout": {
"stripe_unit": 4194304,
"stripe_count": 1,
"object_size": 4194304,
"pool_id": 64,
"pool_ns": ""
}
}

thanks,
Ben

On Fri, Jun 15, 2018 at 11:54 AM, John Spray  wrote:
> On Fri, Jun 15, 2018 at 2:55 PM, Benjeman Meekhof  wrote:
>> Have seen some posts and issue trackers related to this topic in the
>> past but haven't been able to put it together to resolve the issue I'm
>> having.  All on Luminous 12.2.5 (upgraded over time from past
>> releases).  We are going to upgrade to Mimic near future if that would
>> somehow resolve the issue.
>>
>> Summary:
>>
>> 1.  We have a CephFS data pool which has steadily and slowly grown in
>> size without corresponding writes to the directory placed on it - a
>> plot of usage over a few hours shows a very regular upward rate of
>> increase.   The pool is now 300TB vs 16TB of actual space used in
>> directory.
>>
>> 2.  Reading through some email posts and issue trackers led me to
>> disabling 'standby replay' though we are not and have not ever used
>> snapshots.   Disabling that feature on our 3 MDS stopped the steady
>> climb.  However the pool remains with 300TB of unaccounted for space
>> usage.  http://tracker.ceph.com/issues/19593 and
>> http://tracker.ceph.com/issues/21551
>
> This is pretty strange -- if you were already on 12.2.5 then the
> http://tracker.ceph.com/issues/19593 should have been fixed and
> switching standby replays on/off shouldn't make a difference (unless
> there's some similar bug that crept back into luminous).
>
>> 3.   I've never had any issue starting the MDS or with filesystem
>> functionality but looking through the mds logs I see a single
>> 'journaler.pg(rw) _decode error from assimilate_prefetch' at every
>> startup.  A log snippet with context is below with debug_mds and
>> debug_journaler at 20.
>
> This message suggests that the purge queue has been corrupted, but the
> MDS is ignoring this -- something is wrong with the error handling.
> The MDS should be marked damaged when something like this happens, but
> in this case PurgeQueue is apparently dropping the error on the floor
> after it gets logged by Journaler.  I've opened a ticket+PR for the
> error handling here: http://tracker.ceph.com/issues/24533 (however,
> the loading path in PurgeQueue::_recover *does* have error handling so
> I'm not clear why that isn't happening in your case).
>
> I believe cephfs-journal-tool in mimic was enhanced to be able to
> optionally operate on the purge queue as well as the metadata journal
> (they use the same underlying format), so upgrading to mimic would
> give you better tooling for debugging this.
>
> John
>
>
>> As noted, there is at least one past email thread on the topic but I'm
>> not quite having the same issue as this person and I couldn't glean
>> any information as to what I should do to repair this error and get
>> stale objects purged from this pool (if that is in fact the issue):
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021379.html
>>
>> Any thoughts on troubleshooting steps I could try next?
>>
>> Here is the log snippet:
>>
>> 2018-06-15 09:14:50.746831 7fb47251b700 20 mds.0.journaler.pq(rw)
>> write_buf_throttle get, delta 101
>> 2018-06-15 09:14:50.746835 7fb47251b700 10 mds.0.journaler.pq(rw)
>> append_entry len 81 to 88121773~101
>> 2018-06-15 09:14:50.746838 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
>> 2018-06-15 09:14:50.746863 7fb47251b700 20 mds.0.journaler.pq(rw)
>> write_buf_throttle get, delta 101
>> 2018-06-15 09:14:50.746864 7fb47251b700 10 mds.0.journaler.pq(rw)
>> append_entry len 81 to 88121874~101
>> 2018-06-15 09:14:50.746867 7fb47251b700 10 mds.0.journaler.pq(rw) _prefetch
>> 2018-06-15 09:14:50.746901 7fb46fd16700 10 mds.0.journaler.pq(rw)
>> _finish_read got 6822392~1566216
>> 2018-06-15 09:14:50.746909 7fb46fd16700 10 mds.0.journaler.pq(rw)
>> _assimilate_prefetch 6822392~1566216
>> 2018-06-15 09:14:50.746911 7fb46fd16700 10 mds.0.journaler.pq(rw)
>> _assimilate_prefetch gap of 4194304 from received_pos 8388608 to first
>> prefetched buffer 12582912
>> 2018-06-15 09:14:50.746913 7fb46fd16700 10 

[ceph-users] Fwd: Planning all flash cluster

2018-06-20 Thread Luis Periquito
adding back in the list :)

-- Forwarded message -
From: Luis Periquito 
Date: Wed, Jun 20, 2018 at 1:54 PM
Subject: Re: [ceph-users] Planning all flash cluster
To: 


On Wed, Jun 20, 2018 at 1:35 PM Nick A  wrote:
>
> Thank you, I was under the impression that 4GB RAM per 1TB was quite 
> generous, or is that not the case with all flash clusters? What's the 
> recommended RAM per OSD currently? Happy to throw more at it for a 
> performance boost. The important thing is that I'd like all nodes to be 
> absolutely identical.
I'm doing 8G per OSD, though I use 1.9T SSDs.

>
> Based on replies so far, it looks like 5 nodes might be a better idea, maybe 
> each with 14 OSD's (960GB SSD's)? Plenty of 16 slot 2U chassis around to make 
> it a no brainer if that's what you'd recommend!
I tend to add more nodes: 1U with 4-8 SSDs per chassis to start with,
and using a single CPU with high frequency. For IOPS/latency cpu
frequency is really important.
I have started a cluster that only has 2 SSDs (which I share with the
OS) for data, but has 8 nodes. Those servers can take up to 10 drives.

I'm using the Fujitsu RX1330, believe Dell would be the R330, with a
Intel E3-1230v6 cpu and 64G of ram, dual 10G and PSAS (passthrough
controller).

>
> The H710 doesn't do JBOD or passthrough, hence looking for an alternative 
> HBA. It would be nice to do the boot drives as hardware RAID 1 though, so a 
> card that can do both at the same time (like the H730 found R630's etc) would 
> be ideal.
>
> Regards,
> Nick
>
> On 20 June 2018 at 13:18, Luis Periquito  wrote:
>>
>> Adding more nodes from the beginning would probably be a good idea.
>>
>> On Wed, Jun 20, 2018 at 12:58 PM Nick A  wrote:
>> >
>> > Hello Everyone,
>> >
>> > We're planning a small cluster on a budget, and I'd like to request any 
>> > feedback or tips.
>> >
>> > 3x Dell R720XD with:
>> > 2x Xeon E5-2680v2 or very similar
>> The CPUs look good and sufficiently fast for IOPS.
>>
>> > 96GB RAM
>> 4GB per OSD looks a bit on the short side. Probably 192G would help.
>>
>> > 2x Samsung SM863 240GB boot/OS drives
>> > 4x Samsung SM863 960GB OSD drives
>> > Dual 40/56Gbit Infiniband using IPoIB.
>> >
>> > 3 replica, MON on OSD nodes, RBD only (no object or CephFS).
>> >
>> > We'll probably add another 2 OSD drives per month per node until full (24 
>> > SSD's per node), at which point, more nodes. We've got a few SM863's in 
>> > production on other system and are seriously impressed with them, so would 
>> > like to use them for Ceph too.
>> >
>> > We're hoping this is going to provide a decent amount of IOPS, 20k would 
>> > be ideal. I'd like to avoid NVMe Journals unless it's going to make a 
>> > truly massive difference. Same with carving up the SSD's, would rather 
>> > not, and just keep it as simple as possible.
>> I agree: those SSDs shouldn't really require a journal device. Not
>> sure about the 20k IOPS specially without any further information.
>> Doing 20k IOPS at 1kB block is totally different at 1MB block...
>> >
>> > Is there anything that obviously stands out as severely unbalanced? The 
>> > R720XD comes with a H710 - instead of putting them in RAID0, I'm thinking 
>> > a different HBA might be a better idea, any recommendations please?
>> Don't know that HBA. Does it support pass through mode or HBA mode?
>> >
>> > Regards,
>> > Nick
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Index rapidly expanding post tunables update (12.2.5)

2018-06-20 Thread Sean Redmond
Hi,

It sounds like the .rgw.bucket.index pool has grown maybe due to some
problem with dynamic bucket resharding.

I wonder if the (stale/old/not used) bucket index's needs to be purged
using something like the below

radosgw-admin bi purge --bucket= --bucket-id=

Not sure how you would find the old_bucket_id however.

Thanks

[1]
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/object_gateway_guide_for_ubuntu/administration_cli


On Wed, Jun 20, 2018 at 12:34 PM, Tom W  wrote:

> Hi all,
>
>
>
> We have recently upgraded from Jewel (10.2.10) to Luminous (12.2.5) and
> after this we decided to update our tunables configuration to the optimals,
> which were previously at Firefly. During this process, we have noticed the
> OSDs (bluestore) rapidly filling on the RGW index and GC pool. We estimated
> the index to consume around 30G of space and the GC negligible, but they
> are now filling all 4 OSDs per host which contain 2TB SSDs in each.
>
>
>
> Does anyone have any experience with this, or how to determine why the
> sudden growth has been encountered during recovery after the tunables
> update?
>
>
>
> We have disabled resharding activity due to this issue,
> https://tracker.ceph.com/issues/24551 and our gc queue is only a few
> items at present.
>
>
>
> Kind Regards,
>
>
>
> Tom
>
> --
>
> NOTICE AND DISCLAIMER
> This e-mail (including any attachments) is intended for the above-named
> person(s). If you are not the intended recipient, notify the sender
> immediately, delete this email from your system and do not disclose or use
> for any purpose. We may monitor all incoming and outgoing emails in line
> with current legislation. We have taken steps to ensure that this email and
> attachments are free from any virus, but it remains your responsibility to
> ensure that viruses do not adversely affect you
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Planning all flash cluster

2018-06-20 Thread Luis Periquito
Adding more nodes from the beginning would probably be a good idea.

On Wed, Jun 20, 2018 at 12:58 PM Nick A  wrote:
>
> Hello Everyone,
>
> We're planning a small cluster on a budget, and I'd like to request any 
> feedback or tips.
>
> 3x Dell R720XD with:
> 2x Xeon E5-2680v2 or very similar
The CPUs look good and sufficiently fast for IOPS.

> 96GB RAM
4GB per OSD looks a bit on the short side. Probably 192G would help.

> 2x Samsung SM863 240GB boot/OS drives
> 4x Samsung SM863 960GB OSD drives
> Dual 40/56Gbit Infiniband using IPoIB.
>
> 3 replica, MON on OSD nodes, RBD only (no object or CephFS).
>
> We'll probably add another 2 OSD drives per month per node until full (24 
> SSD's per node), at which point, more nodes. We've got a few SM863's in 
> production on other system and are seriously impressed with them, so would 
> like to use them for Ceph too.
>
> We're hoping this is going to provide a decent amount of IOPS, 20k would be 
> ideal. I'd like to avoid NVMe Journals unless it's going to make a truly 
> massive difference. Same with carving up the SSD's, would rather not, and 
> just keep it as simple as possible.
I agree: those SSDs shouldn't really require a journal device. Not
sure about the 20k IOPS specially without any further information.
Doing 20k IOPS at 1kB block is totally different at 1MB block...
>
> Is there anything that obviously stands out as severely unbalanced? The 
> R720XD comes with a H710 - instead of putting them in RAID0, I'm thinking a 
> different HBA might be a better idea, any recommendations please?
Don't know that HBA. Does it support pass through mode or HBA mode?
>
> Regards,
> Nick
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Planning all flash cluster

2018-06-20 Thread Paul Emmerich
Another great thing about lots of small servers vs. few big servers is that
you can use erasure coding.
You can save a lot of money by using erasure coding, but performance will
have to be evaluated
for your use case.

I'm working with several clusters that are 8-12 servers with 6-10 SSDs each
running erasure coding
for VMs with RBD. They perform surprisingly well: ~6-10k IOPS with ~30% cpu
load and ~30%
disk IO load.

But that requires at least 7 servers for a reasonable setup and some good
benchmarking to evaluate
it for your scenario. Especially the tail latencies can be prohibitive
sometimes.

Paul

2018-06-20 14:09 GMT+02:00 Wido den Hollander :

>
>
> On 06/20/2018 02:00 PM, Robert Sander wrote:
> > On 20.06.2018 13:58, Nick A wrote:
> >
> >> We'll probably add another 2 OSD drives per month per node until full
> >> (24 SSD's per node), at which point, more nodes.
> >
> > I would add more nodes earlier to achieve better overall performance.
>
> Exactly. Not only performance, but also failure domain.
>
> In a smaller setup I would always choose a 1U node with 8 ~ 10 SSDs per
> node.
>
> Wido
>
> >
> > Regards
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Planning all flash cluster

2018-06-20 Thread Blair Bethwaite
This is true, but misses the point that the OP is talking about old
hardware already - you're not going to save much money on removing a 2nd
hand CPU from a system.

On Wed, 20 Jun 2018 at 22:10, Wido den Hollander  wrote:

>
>
> On 06/20/2018 02:00 PM, Robert Sander wrote:
> > On 20.06.2018 13:58, Nick A wrote:
> >
> >> We'll probably add another 2 OSD drives per month per node until full
> >> (24 SSD's per node), at which point, more nodes.
> >
> > I would add more nodes earlier to achieve better overall performance.
>
> Exactly. Not only performance, but also failure domain.
>
> In a smaller setup I would always choose a 1U node with 8 ~ 10 SSDs per
> node.
>
> Wido
>
> >
> > Regards
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Planning all flash cluster

2018-06-20 Thread Wido den Hollander


On 06/20/2018 02:00 PM, Robert Sander wrote:
> On 20.06.2018 13:58, Nick A wrote:
> 
>> We'll probably add another 2 OSD drives per month per node until full
>> (24 SSD's per node), at which point, more nodes.
> 
> I would add more nodes earlier to achieve better overall performance.

Exactly. Not only performance, but also failure domain.

In a smaller setup I would always choose a 1U node with 8 ~ 10 SSDs per
node.

Wido

> 
> Regards
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Planning all flash cluster

2018-06-20 Thread Paul Emmerich
* More small servers give better performance then few big servers, maybe
twice the number of servers with half the disks, cpus and RAM
* 2x 10 gbit is usually enough, especially with more servers. that will
rarely be the bottleneck (unless you have extreme bandwidth requirements)
* maybe save money by using normal Ethernet unless you already got IB
infrastructure around
* you might need to reduce bluestore cache size a little bit (default is
3GB for SSDs) as you are running with 4GB ram per OSD (which is fine, you
just might need to tune the settings a little bit)
* SM863a is a great disk, good choice. NVMe db disks are not needed here
* raid controllers are evil in most cases, configure them as JBOD



Paul

2018-06-20 13:58 GMT+02:00 Nick A :

> Hello Everyone,
>
> We're planning a small cluster on a budget, and I'd like to request any
> feedback or tips.
>
> 3x Dell R720XD with:
> 2x Xeon E5-2680v2 or very similar
> 96GB RAM
> 2x Samsung SM863 240GB boot/OS drives
> 4x Samsung SM863 960GB OSD drives
> Dual 40/56Gbit Infiniband using IPoIB.
>
> 3 replica, MON on OSD nodes, RBD only (no object or CephFS).
>
> We'll probably add another 2 OSD drives per month per node until full (24
> SSD's per node), at which point, more nodes. We've got a few SM863's in
> production on other system and are seriously impressed with them, so would
> like to use them for Ceph too.
>
> We're hoping this is going to provide a decent amount of IOPS, 20k would
> be ideal. I'd like to avoid NVMe Journals unless it's going to make a truly
> massive difference. Same with carving up the SSD's, would rather not, and
> just keep it as simple as possible.
>
> Is there anything that obviously stands out as severely unbalanced? The
> R720XD comes with a H710 - instead of putting them in RAID0, I'm thinking a
> different HBA might be a better idea, any recommendations please?
>
> Regards,
> Nick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Planning all flash cluster

2018-06-20 Thread Robert Sander
On 20.06.2018 13:58, Nick A wrote:

> We'll probably add another 2 OSD drives per month per node until full
> (24 SSD's per node), at which point, more nodes.

I would add more nodes earlier to achieve better overall performance.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Planning all flash cluster

2018-06-20 Thread Nick A
Hello Everyone,

We're planning a small cluster on a budget, and I'd like to request any
feedback or tips.

3x Dell R720XD with:
2x Xeon E5-2680v2 or very similar
96GB RAM
2x Samsung SM863 240GB boot/OS drives
4x Samsung SM863 960GB OSD drives
Dual 40/56Gbit Infiniband using IPoIB.

3 replica, MON on OSD nodes, RBD only (no object or CephFS).

We'll probably add another 2 OSD drives per month per node until full (24
SSD's per node), at which point, more nodes. We've got a few SM863's in
production on other system and are seriously impressed with them, so would
like to use them for Ceph too.

We're hoping this is going to provide a decent amount of IOPS, 20k would be
ideal. I'd like to avoid NVMe Journals unless it's going to make a truly
massive difference. Same with carving up the SSD's, would rather not, and
just keep it as simple as possible.

Is there anything that obviously stands out as severely unbalanced? The
R720XD comes with a H710 - instead of putting them in RAID0, I'm thinking a
different HBA might be a better idea, any recommendations please?

Regards,
Nick
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Index rapidly expanding post tunables update (12.2.5)

2018-06-20 Thread Tom W
Hi all,

We have recently upgraded from Jewel (10.2.10) to Luminous (12.2.5) and after 
this we decided to update our tunables configuration to the optimals, which 
were previously at Firefly. During this process, we have noticed the OSDs 
(bluestore) rapidly filling on the RGW index and GC pool. We estimated the 
index to consume around 30G of space and the GC negligible, but they are now 
filling all 4 OSDs per host which contain 2TB SSDs in each.

Does anyone have any experience with this, or how to determine why the sudden 
growth has been encountered during recovery after the tunables update?

We have disabled resharding activity due to this issue, 
https://tracker.ceph.com/issues/24551 and our gc queue is only a few items at 
present.

Kind Regards,

Tom



NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named 
person(s). If you are not the intended recipient, notify the sender 
immediately, delete this email from your system and do not disclose or use for 
any purpose. We may monitor all incoming and outgoing emails in line with 
current legislation. We have taken steps to ensure that this email and 
attachments are free from any virus, but it remains your responsibility to 
ensure that viruses do not adversely affect you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] EPEL dependency on CENTOS

2018-06-20 Thread Bernhard Dick

Hi,

I'm experimenting with CEPH and have seen that ceph-deploy and 
ceph-ansible have the EPEL repositories as requirement, when installing 
CEPH on CENTOS hosts. Due to the nature of the EPEL repos this might 
cause trouble (i.e. when combining CEPH with oVirt on the same host).
When using the CEPH repos from the storage-SIG of centos EPEL is not 
needed, so I'm asking whether it is still required to explicitly require 
installation of the EPEL repository when using the official ways of 
installing CEPH?
My solution for it was changing the centos_package_dependencies list in 
the ceph-ansible tree by replacing epel-release with 
centos-release-ceph-luminous and setting ceph_origin to distro.
Would it be an idea to add direct support for the centos SIG packages in 
ceph-ansible or to decide on when to install epel based on the used 
package repository?


  Regards
Bernhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HDD-only performance, how far can it be sped up ?

2018-06-20 Thread Brian :
Hi Wladimir,

A combination of slow enough clock speed , erasure code, single node
and SATA spinners is probably going to lead to not a really great
evaluation. Some of the experts will chime in here with answers to
your specific questions I"m sure but this test really isn't ever going
to give great results.

Brian

On Wed, Jun 20, 2018 at 8:28 AM, Wladimir Mutel  wrote:
> Dear all,
>
> I set up a minimal 1-node Ceph cluster to evaluate its performance. We
> tried to save as much as possible on the hardware, so now the box has Asus
> P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and 8x3TB
> HDDs (WD30EFRX) connected to on-board SATA ports. Also we are trying to save
> on storage redundancy, so for most of our RBD images we use erasure-coded
> data-pool (default profile, jerasure 2+1) instead of 3x replication. I
> started with Luminous/Xenial 12.2.5 setup which initialized my OSDs as
> Bluestore during deploy, then updated it to Mimic/Bionic 13.2.0. Base OS is
> Ubuntu 18.04 with kernel updated to 4.17.2 from Ubuntu mainline PPA.
>
> With this setup, I created a number of RBD images to test iSCSI, rbd-nbd
> and QEMU+librbd performance (running QEMU VMs on the same box). And that
> worked moderately well as far as data volume transferred within one session
> was limited. The fastest transfers I had with 'rbd import' which pulled an
> ISO image file at up to 25 MBytes/sec from the remote CIFS share over
> Gigabit Ethernet and stored it into EC data-pool. Windows 2008 R2 & 2016
> setup, update installation, Win 2008 upgrade to 2012 and to 2016 within QEMU
> VM also went through tolerably well. I found that cache=writeback gives the
> best performance with librbd, unlike cache=unsafe which gave the best
> performance with VMs on plain local SATA drives. Also I have a subjective
> feeling (not confirmed by exact measurements) that providing a huge
> libRBD cache (like, cache size = 1GB, max dirty = 7/8GB, max dirty age = 60)
> improved Windows VM performance on bursty writes (like, during Windows
> update installations) as well as on reboots (due to cached reads).
>
> Now, what discouraged me, was my next attempt to clone an NTFS partition
> of ~2TB from a physical drive (via USB3-SATA3 convertor) to a partition on
> an RBD image. I tried to map RBD image with rbd-nbd either locally or
> remotely over Gigabit Ethernet, and the fastest speed I got with ntfsclone
> was about 8 MBytes/sec. Which means that it could spend up to 3 days copying
> these ~2TB of NTFS data. I thought about running
> ntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs to rewrite
> a part of existing RBD image starting from certain offset, so I decided this
> was not a solution in my situation. Now I am thinking about taking out one
> of OSDs and using it as a 'bcache' for this operation, but I am not sure how
> good is bcache performance with cache on rotating HDD. I know that keeping
> OSD logs and RocksDB on the same HDD creates a seeky workload which hurts
> overall transfer performance.
>
> Also I am thinking about a number of next-close possibilities, and I
> would like to hear your opinions on the benefits and drawbacks of each of
> them.
>
> 1. Would iSCSI access to that RBD image improve my performance (compared
> to rbd-nbd) ? I did not check that yet, but I noticed that Windows
> transferred about 2.5 MBytes/sec while formatting NTFS volume on this RBD
> attached to it by iSCSI. So, for seeky/sparse workloads like NTFS formatting
> the performance was not great.
>
> 2. Would it help to run ntfsclone in Linux VM, with RBD image accessed
> through QEMU+librbd ? (also going to measure that myself)
>
> 3. Is there any performance benefits in using Ceph cache-tier pools with
> my setup ? I hear now use of this technique is advised against, no?
>
> 4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430 CPU,
> 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to 2017, in
> perfectly working condition) which can be stuffed with up to 6 SATA HDDs and
> added to this Ceph cluster, so far with only Gigabit network interconnect.
> Like, move 4 OSDs out of first box into it, to have 2 boxes with 4 HDDs
> each. Is this going to improve Ceph performance with the setup described
> above ?
>
> 5. I hear that RAID controllers like Adaptec 5805, LSI 2108 provide
> better performance with SATA HDDs exported as JBODs than onboard SATA AHCI
> controllers due to more aggressive caching and reordering requests. Is this
> true ?
>
> 6. On the local market we can buy Kingston KC1000/960GB NVMe drive for
> moderately reasonable price. Its specification has rewrite limit of 1 PB and
> 0.58 DWPD (drive rewrite per day). Is there any counterindications against
> using it in production Ceph setup (i.e., too low rewrite limit, look for
> 8+PB) ? What is the difference between using it as a 'bcache' os as
> specifically-designed OSD log+rocksdb storage ? Can it 

[ceph-users] HDD-only performance, how far can it be sped up ?

2018-06-20 Thread Wladimir Mutel

Dear all,

I set up a minimal 1-node Ceph cluster to evaluate its performance. 
We tried to save as much as possible on the hardware, so now the box has 
Asus P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and 
8x3TB HDDs (WD30EFRX) connected to on-board SATA ports. Also we are 
trying to save on storage redundancy, so for most of our RBD images we 
use erasure-coded data-pool (default profile, jerasure 2+1) instead of 
3x replication. I started with Luminous/Xenial 12.2.5 setup which 
initialized my OSDs as Bluestore during deploy, then updated it to 
Mimic/Bionic 13.2.0. Base OS is Ubuntu 18.04 with kernel updated to 
4.17.2 from Ubuntu mainline PPA.


With this setup, I created a number of RBD images to test iSCSI, 
rbd-nbd and QEMU+librbd performance (running QEMU VMs on the same box). 
And that worked moderately well as far as data volume transferred within 
one session was limited. The fastest transfers I had with 'rbd import' 
which pulled an ISO image file at up to 25 MBytes/sec from the remote 
CIFS share over Gigabit Ethernet and stored it into EC data-pool. 
Windows 2008 R2 & 2016 setup, update installation, Win 2008 upgrade to 
2012 and to 2016 within QEMU VM also went through tolerably well. I 
found that cache=writeback gives the best performance with librbd, 
unlike cache=unsafe which gave the best performance with VMs on plain 
local SATA drives. Also I have a subjective feeling (not confirmed by 
exact measurements) that providing a huge libRBD cache (like, 
cache size = 1GB, max dirty = 7/8GB, max dirty age = 60) improved 
Windows VM performance on bursty writes (like, during Windows update 
installations) as well as on reboots (due to cached reads).


Now, what discouraged me, was my next attempt to clone an NTFS 
partition of ~2TB from a physical drive (via USB3-SATA3 convertor) to a 
partition on an RBD image. I tried to map RBD image with rbd-nbd either 
locally or remotely over Gigabit Ethernet, and the fastest speed I got 
with ntfsclone was about 8 MBytes/sec. Which means that it could spend 
up to 3 days copying these ~2TB of NTFS data. I thought about running
ntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs to 
rewrite a part of existing RBD image starting from certain offset, so I 
decided this was not a solution in my situation. Now I am thinking about 
taking out one of OSDs and using it as a 'bcache' for this operation, 
but I am not sure how good is bcache performance with cache on rotating 
HDD. I know that keeping OSD logs and RocksDB on the same HDD creates a 
seeky workload which hurts overall transfer performance.


Also I am thinking about a number of next-close possibilities, and 
I would like to hear your opinions on the benefits and drawbacks of each 
of them.


1. Would iSCSI access to that RBD image improve my performance 
(compared to rbd-nbd) ? I did not check that yet, but I noticed that 
Windows transferred about 2.5 MBytes/sec while formatting NTFS volume on 
this RBD attached to it by iSCSI. So, for seeky/sparse workloads like 
NTFS formatting the performance was not great.


2. Would it help to run ntfsclone in Linux VM, with RBD image 
accessed through QEMU+librbd ? (also going to measure that myself)


3. Is there any performance benefits in using Ceph cache-tier pools 
with my setup ? I hear now use of this technique is advised against, no?


4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430 
CPU, 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to 
2017, in perfectly working condition) which can be stuffed with up to 6 
SATA HDDs and added to this Ceph cluster, so far with only Gigabit 
network interconnect. Like, move 4 OSDs out of first box into it, to 
have 2 boxes with 4 HDDs each. Is this going to improve Ceph performance 
with the setup described above ?


5. I hear that RAID controllers like Adaptec 5805, LSI 2108 provide 
better performance with SATA HDDs exported as JBODs than onboard SATA 
AHCI controllers due to more aggressive caching and reordering requests. 
Is this true ?


6. On the local market we can buy Kingston KC1000/960GB NVMe drive 
for moderately reasonable price. Its specification has rewrite limit of 
1 PB and 0.58 DWPD (drive rewrite per day). Is there any 
counterindications against using it in production Ceph setup (i.e., too 
low rewrite limit, look for 8+PB) ? What is the difference between using 
it as a 'bcache' os as specifically-designed OSD log+rocksdb storage ? 
Can it be used as a single shared partition for all OSD daemons, or will 
it require spitting into 8 separate partitions ?


Thank you in advance for your replies.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-20 Thread Konstantin Shalygin

Hi,

at the moment, we use Icinga2, check_ceph* and Telegraf with the Ceph
plugin. I'm asking what I need to have a separate host, which knows all
about the Ceph cluster health. The reason is, that each OSD node has
mostly the exact same data, which is transmitted into our database (like
InfluxDB or MySQL) and wasting space. Also if something is going on, we
get alerts for each OSD.

So my idea is, to have a separate VM (on external host) and we use only
this host for monitoring the global cluster state and measurements.
Is it correct, that I need only to have mon and mgr as services ? Or
should I do monitoring in a different way?

cu denny



I use this together:

1. mgr/prometheus module for your Grafana;

2. https://github.com/Crapworks/ceph-dash + 
https://github.com/Crapworks/check_ceph_dash for monitoring cluster events;



First is not need for cephx, second works with python-rados and connects 
to cluster directly.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com