Re: [ceph-users] undersized pgs after removing smaller OSDs

2017-07-18 Thread David Turner
I would recommend sucking with the weight of 9.09560 for the osds as that
is the TiB size of the osds that ceph details to as supposed to the TB size
of the osds. New osds will have their weights based on the TiB value. What
is your `ceph osd df` output just to see what things look like? Hopefully
very healthy.

On Tue, Jul 18, 2017, 11:16 PM Roger Brown  wrote:

> Resolution confirmed!
>
> $ ceph -s
>   cluster:
> id: eea7b78c-b138-40fc-9f3e-3d77afb770f0
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum desktop,mon1,nuc2
> mgr: desktop(active), standbys: mon1
> osd: 3 osds: 3 up, 3 in
>
>   data:
> pools:   19 pools, 372 pgs
> objects: 54243 objects, 71722 MB
> usage:   129 GB used, 27812 GB / 27941 GB avail
> pgs: 372 active+clean
>
>
> On Tue, Jul 18, 2017 at 8:47 PM Roger Brown  wrote:
>
>> Ah, that was the problem!
>>
>> So I edited the crushmap (
>> http://docs.ceph.com/docs/master/rados/operations/crush-map/) with a
>> weight of 10.000 for all three 10TB OSD hosts. The instant result was all
>> those pgs with only 2 OSDs were replaced with 3 OSDs while the cluster
>> started rebalancing the data. I trust it will complete with time and I'll
>> be good to go!
>>
>> New OSD tree:
>> $ ceph osd tree
>> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 30.0 root default
>> -5 10.0 host osd1
>>  3 10.0 osd.3  up  1.0  1.0
>> -6 10.0 host osd2
>>  4 10.0 osd.4  up  1.0  1.0
>> -2 10.0 host osd3
>>  0 10.0 osd.0  up  1.0  1.0
>>
>> Kudos to Brad Hubbard for steering me in the right direction!
>>
>>
>> On Tue, Jul 18, 2017 at 8:27 PM Brad Hubbard  wrote:
>>
>>> ID WEIGHT   TYPE NAME
>>> -5  1.0 host osd1
>>> -6  9.09560 host osd2
>>> -2  9.09560 host osd3
>>>
>>> The weight allocated to host "osd1" should presumably be the same as
>>> the other two hosts?
>>>
>>> Dump your crushmap and take a good look at it, specifically the
>>> weighting of "osd1".
>>>
>>>
>>> On Wed, Jul 19, 2017 at 11:48 AM, Roger Brown 
>>> wrote:
>>> > I also tried ceph pg query, but it gave no helpful recommendations for
>>> any
>>> > of the stuck pgs.
>>> >
>>> >
>>> > On Tue, Jul 18, 2017 at 7:45 PM Roger Brown 
>>> wrote:
>>> >>
>>> >> Problem:
>>> >> I have some pgs with only two OSDs instead of 3 like all the other pgs
>>> >> have. This is causing active+undersized+degraded status.
>>> >>
>>> >> History:
>>> >> 1. I started with 3 hosts, each with 1 OSD process (min_size 2) for a
>>> 1TB
>>> >> drive.
>>> >> 2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
>>> >> 3. Removed the original 3 1TB OSD hosts from the osd tree (reweight 0,
>>> >> wait, stop, remove, del osd, rm).
>>> >> 4. The last OSD to be removed would never return to active+clean after
>>> >> reweight 0. It returned undersized instead, but I went on with removal
>>> >> anyway, leaving me stuck with 5 undersized pgs.
>>> >>
>>> >> Things tried that didn't help:
>>> >> * give it time to go away on its own
>>> >> * Replace replicated default.rgw.buckets.data pool with erasure-code
>>> 2+1
>>> >> version.
>>> >> * ceph osd lost 1 (and 2)
>>> >> * ceph pg repair (pgs from dump_stuck)
>>> >> * googled 'ceph pg undersized' and similar searches for help.
>>> >>
>>> >> Current status:
>>> >> $ ceph osd tree
>>> >> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>> >> -1 19.19119 root default
>>> >> -5  1.0 host osd1
>>> >>  3  1.0 osd.3  up  1.0  1.0
>>> >> -6  9.09560 host osd2
>>> >>  4  9.09560 osd.4  up  1.0  1.0
>>> >> -2  9.09560 host osd3
>>> >>  0  9.09560 osd.0  up  1.0  1.0
>>> >> $ ceph pg dump_stuck
>>> >> ok
>>> >> PG_STAT STATE  UPUP_PRIMARY ACTING
>>> ACTING_PRIMARY
>>> >> 88.3active+undersized+degraded [4,0]  4  [4,0]
>>>   4
>>> >> 97.3active+undersized+degraded [4,0]  4  [4,0]
>>>   4
>>> >> 85.6active+undersized+degraded [4,0]  4  [4,0]
>>>   4
>>> >> 87.5active+undersized+degraded [0,4]  0  [0,4]
>>>   0
>>> >> 70.0active+undersized+degraded [0,4]  0  [0,4]
>>>   0
>>> >> $ ceph osd pool ls detail
>>> >> pool 70 'default.rgw.rgw.gc' replicated size 3 min_size 2 crush_rule 0
>>> >> object_hash rjenkins pg_num 4 pgp_num 4 last_change 548 flags
>>> hashpspool
>>> >> stripe_width 0
>>> >> pool 83 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
>>> >> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 576
>>> owner
>>> >> 18446744073709551615 flags hashpspool stripe_width 0
>>> >> pool 85 'default.rgw.control' replicated size 3 min_size 2 crush_rule
>>> 0
>>> >> object_hash rjenkins pg_num 8 pgp_num 8 

[ceph-users] 答复: How's cephfs going?

2017-07-18 Thread 许雪寒
Is there anyone else willing to share some usage information of cephfs?
Could developers tell whether cephfs is a major effort in the whole ceph 
development?

发件人: 许雪寒 
发送时间: 2017年7月17日 11:00
收件人: ceph-users@lists.ceph.com
主题: How's cephfs going?

Hi, everyone.

We intend to use cephfs of Jewel version, however, we don’t know its status. Is 
it production ready in Jewel? Does it still have lots of bugs? Is it a major 
effort of the current ceph development? And who are using cephfs now?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving OSD node from root bucket to defined 'rack' bucket

2017-07-18 Thread David Turner
I would still always recommend having at least having n+1 failure domain in
any production cluster where n is your replica size.

On Tue, Jul 18, 2017, 11:20 PM David Turner  wrote:

> You do not need to empty the host before moving it in the crush map.  It
> will just cause data movement because you are removing an item under root
> and changing the crush weight of the rack.  There is no way I am aware of
> to really ease into this data movement other than to stare it head on and
> utilize osd_max_backfills to control disk io in your cluster.
>
> Are you changing you're failure domain to rack from host after this is
> done? Changing that in the crush map will cause everything to peer at once
> and then instigate a lot of data movement. You can do both moving the hosts
> into their racks and change the failure domain in the same update to only
> move data once. You would do that by downloading the crush map, modifying
> it, and then uploading it back into the cluster. It would be smart to test
> this on a test cluster. You could even do it on a 3 node cluster by
> changing each node to its own rack and setting the failure domain to rack.
>
> On Tue, Jul 18, 2017, 7:06 PM Mike Cave  wrote:
>
>> Greetings,
>>
>>
>>
>> I’m trying to figure out the best way to move our hosts from the
>> root=default bucket into their rack buckets.
>>
>>
>>
>> Our crush map has the notion of three racks which will hold all of our
>> osd nodes.
>>
>>
>>
>> As we have added new nodes, we have assigned them to their correct rack
>> location in the map. However, when the cluster was first conceived, the
>> majority of the nodes were left in the default bucket.
>>
>>
>>
>> Now I would like to move them into their correct rack buckets.
>>
>>
>>
>> I have a feeling I know the answer to this question, but I thought I’d
>> ask and hopefully be pleasantly surprised.
>>
>>
>>
>> Can I move a host from the root bucket into the correct rack without
>> draining it and then refilling it or do I need to reweight the host to 0,
>> move the host to the correct bucket, and then reweight it back to it’s
>> correct value?
>>
>>
>>
>> Any insights here will be appreciated.
>>
>>
>>
>> Thank you for your time,
>>
>> Mike Cave
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving OSD node from root bucket to defined 'rack' bucket

2017-07-18 Thread David Turner
You do not need to empty the host before moving it in the crush map.  It
will just cause data movement because you are removing an item under root
and changing the crush weight of the rack.  There is no way I am aware of
to really ease into this data movement other than to stare it head on and
utilize osd_max_backfills to control disk io in your cluster.

Are you changing you're failure domain to rack from host after this is
done? Changing that in the crush map will cause everything to peer at once
and then instigate a lot of data movement. You can do both moving the hosts
into their racks and change the failure domain in the same update to only
move data once. You would do that by downloading the crush map, modifying
it, and then uploading it back into the cluster. It would be smart to test
this on a test cluster. You could even do it on a 3 node cluster by
changing each node to its own rack and setting the failure domain to rack.

On Tue, Jul 18, 2017, 7:06 PM Mike Cave  wrote:

> Greetings,
>
>
>
> I’m trying to figure out the best way to move our hosts from the
> root=default bucket into their rack buckets.
>
>
>
> Our crush map has the notion of three racks which will hold all of our osd
> nodes.
>
>
>
> As we have added new nodes, we have assigned them to their correct rack
> location in the map. However, when the cluster was first conceived, the
> majority of the nodes were left in the default bucket.
>
>
>
> Now I would like to move them into their correct rack buckets.
>
>
>
> I have a feeling I know the answer to this question, but I thought I’d ask
> and hopefully be pleasantly surprised.
>
>
>
> Can I move a host from the root bucket into the correct rack without
> draining it and then refilling it or do I need to reweight the host to 0,
> move the host to the correct bucket, and then reweight it back to it’s
> correct value?
>
>
>
> Any insights here will be appreciated.
>
>
>
> Thank you for your time,
>
> Mike Cave
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] undersized pgs after removing smaller OSDs

2017-07-18 Thread Roger Brown
Resolution confirmed!

$ ceph -s
  cluster:
id: eea7b78c-b138-40fc-9f3e-3d77afb770f0
health: HEALTH_OK

  services:
mon: 3 daemons, quorum desktop,mon1,nuc2
mgr: desktop(active), standbys: mon1
osd: 3 osds: 3 up, 3 in

  data:
pools:   19 pools, 372 pgs
objects: 54243 objects, 71722 MB
usage:   129 GB used, 27812 GB / 27941 GB avail
pgs: 372 active+clean


On Tue, Jul 18, 2017 at 8:47 PM Roger Brown  wrote:

> Ah, that was the problem!
>
> So I edited the crushmap (
> http://docs.ceph.com/docs/master/rados/operations/crush-map/) with a
> weight of 10.000 for all three 10TB OSD hosts. The instant result was all
> those pgs with only 2 OSDs were replaced with 3 OSDs while the cluster
> started rebalancing the data. I trust it will complete with time and I'll
> be good to go!
>
> New OSD tree:
> $ ceph osd tree
> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 30.0 root default
> -5 10.0 host osd1
>  3 10.0 osd.3  up  1.0  1.0
> -6 10.0 host osd2
>  4 10.0 osd.4  up  1.0  1.0
> -2 10.0 host osd3
>  0 10.0 osd.0  up  1.0  1.0
>
> Kudos to Brad Hubbard for steering me in the right direction!
>
>
> On Tue, Jul 18, 2017 at 8:27 PM Brad Hubbard  wrote:
>
>> ID WEIGHT   TYPE NAME
>> -5  1.0 host osd1
>> -6  9.09560 host osd2
>> -2  9.09560 host osd3
>>
>> The weight allocated to host "osd1" should presumably be the same as
>> the other two hosts?
>>
>> Dump your crushmap and take a good look at it, specifically the
>> weighting of "osd1".
>>
>>
>> On Wed, Jul 19, 2017 at 11:48 AM, Roger Brown 
>> wrote:
>> > I also tried ceph pg query, but it gave no helpful recommendations for
>> any
>> > of the stuck pgs.
>> >
>> >
>> > On Tue, Jul 18, 2017 at 7:45 PM Roger Brown 
>> wrote:
>> >>
>> >> Problem:
>> >> I have some pgs with only two OSDs instead of 3 like all the other pgs
>> >> have. This is causing active+undersized+degraded status.
>> >>
>> >> History:
>> >> 1. I started with 3 hosts, each with 1 OSD process (min_size 2) for a
>> 1TB
>> >> drive.
>> >> 2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
>> >> 3. Removed the original 3 1TB OSD hosts from the osd tree (reweight 0,
>> >> wait, stop, remove, del osd, rm).
>> >> 4. The last OSD to be removed would never return to active+clean after
>> >> reweight 0. It returned undersized instead, but I went on with removal
>> >> anyway, leaving me stuck with 5 undersized pgs.
>> >>
>> >> Things tried that didn't help:
>> >> * give it time to go away on its own
>> >> * Replace replicated default.rgw.buckets.data pool with erasure-code
>> 2+1
>> >> version.
>> >> * ceph osd lost 1 (and 2)
>> >> * ceph pg repair (pgs from dump_stuck)
>> >> * googled 'ceph pg undersized' and similar searches for help.
>> >>
>> >> Current status:
>> >> $ ceph osd tree
>> >> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> >> -1 19.19119 root default
>> >> -5  1.0 host osd1
>> >>  3  1.0 osd.3  up  1.0  1.0
>> >> -6  9.09560 host osd2
>> >>  4  9.09560 osd.4  up  1.0  1.0
>> >> -2  9.09560 host osd3
>> >>  0  9.09560 osd.0  up  1.0  1.0
>> >> $ ceph pg dump_stuck
>> >> ok
>> >> PG_STAT STATE  UPUP_PRIMARY ACTING
>> ACTING_PRIMARY
>> >> 88.3active+undersized+degraded [4,0]  4  [4,0]
>>   4
>> >> 97.3active+undersized+degraded [4,0]  4  [4,0]
>>   4
>> >> 85.6active+undersized+degraded [4,0]  4  [4,0]
>>   4
>> >> 87.5active+undersized+degraded [0,4]  0  [0,4]
>>   0
>> >> 70.0active+undersized+degraded [0,4]  0  [0,4]
>>   0
>> >> $ ceph osd pool ls detail
>> >> pool 70 'default.rgw.rgw.gc' replicated size 3 min_size 2 crush_rule 0
>> >> object_hash rjenkins pg_num 4 pgp_num 4 last_change 548 flags
>> hashpspool
>> >> stripe_width 0
>> >> pool 83 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
>> >> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 576
>> owner
>> >> 18446744073709551615 flags hashpspool stripe_width 0
>> >> pool 85 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
>> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 652 flags
>> hashpspool
>> >> stripe_width 0
>> >> pool 86 'default.rgw.data.root' replicated size 3 min_size 2
>> crush_rule 0
>> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 653 flags
>> hashpspool
>> >> stripe_width 0
>> >> pool 87 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
>> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 654 flags
>> hashpspool
>> >> stripe_width 0
>> >> pool 88 'default.rgw.lc' replicated size 3 min_size 2 crush_rule 0
>> >> object_hash rjenkins pg_num 8 pgp_num 8 

Re: [ceph-users] undersized pgs after removing smaller OSDs

2017-07-18 Thread Roger Brown
Ah, that was the problem!

So I edited the crushmap (
http://docs.ceph.com/docs/master/rados/operations/crush-map/) with a weight
of 10.000 for all three 10TB OSD hosts. The instant result was all those
pgs with only 2 OSDs were replaced with 3 OSDs while the cluster started
rebalancing the data. I trust it will complete with time and I'll be good
to go!

New OSD tree:
$ ceph osd tree
ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 30.0 root default
-5 10.0 host osd1
 3 10.0 osd.3  up  1.0  1.0
-6 10.0 host osd2
 4 10.0 osd.4  up  1.0  1.0
-2 10.0 host osd3
 0 10.0 osd.0  up  1.0  1.0

Kudos to Brad Hubbard for steering me in the right direction!


On Tue, Jul 18, 2017 at 8:27 PM Brad Hubbard  wrote:

> ID WEIGHT   TYPE NAME
> -5  1.0 host osd1
> -6  9.09560 host osd2
> -2  9.09560 host osd3
>
> The weight allocated to host "osd1" should presumably be the same as
> the other two hosts?
>
> Dump your crushmap and take a good look at it, specifically the
> weighting of "osd1".
>
>
> On Wed, Jul 19, 2017 at 11:48 AM, Roger Brown 
> wrote:
> > I also tried ceph pg query, but it gave no helpful recommendations for
> any
> > of the stuck pgs.
> >
> >
> > On Tue, Jul 18, 2017 at 7:45 PM Roger Brown 
> wrote:
> >>
> >> Problem:
> >> I have some pgs with only two OSDs instead of 3 like all the other pgs
> >> have. This is causing active+undersized+degraded status.
> >>
> >> History:
> >> 1. I started with 3 hosts, each with 1 OSD process (min_size 2) for a
> 1TB
> >> drive.
> >> 2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
> >> 3. Removed the original 3 1TB OSD hosts from the osd tree (reweight 0,
> >> wait, stop, remove, del osd, rm).
> >> 4. The last OSD to be removed would never return to active+clean after
> >> reweight 0. It returned undersized instead, but I went on with removal
> >> anyway, leaving me stuck with 5 undersized pgs.
> >>
> >> Things tried that didn't help:
> >> * give it time to go away on its own
> >> * Replace replicated default.rgw.buckets.data pool with erasure-code 2+1
> >> version.
> >> * ceph osd lost 1 (and 2)
> >> * ceph pg repair (pgs from dump_stuck)
> >> * googled 'ceph pg undersized' and similar searches for help.
> >>
> >> Current status:
> >> $ ceph osd tree
> >> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> >> -1 19.19119 root default
> >> -5  1.0 host osd1
> >>  3  1.0 osd.3  up  1.0  1.0
> >> -6  9.09560 host osd2
> >>  4  9.09560 osd.4  up  1.0  1.0
> >> -2  9.09560 host osd3
> >>  0  9.09560 osd.0  up  1.0  1.0
> >> $ ceph pg dump_stuck
> >> ok
> >> PG_STAT STATE  UPUP_PRIMARY ACTING
> ACTING_PRIMARY
> >> 88.3active+undersized+degraded [4,0]  4  [4,0]
> 4
> >> 97.3active+undersized+degraded [4,0]  4  [4,0]
> 4
> >> 85.6active+undersized+degraded [4,0]  4  [4,0]
> 4
> >> 87.5active+undersized+degraded [0,4]  0  [0,4]
> 0
> >> 70.0active+undersized+degraded [0,4]  0  [0,4]
> 0
> >> $ ceph osd pool ls detail
> >> pool 70 'default.rgw.rgw.gc' replicated size 3 min_size 2 crush_rule 0
> >> object_hash rjenkins pg_num 4 pgp_num 4 last_change 548 flags hashpspool
> >> stripe_width 0
> >> pool 83 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
> >> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 576
> owner
> >> 18446744073709551615 flags hashpspool stripe_width 0
> >> pool 85 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 652 flags hashpspool
> >> stripe_width 0
> >> pool 86 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule
> 0
> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 653 flags hashpspool
> >> stripe_width 0
> >> pool 87 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 654 flags hashpspool
> >> stripe_width 0
> >> pool 88 'default.rgw.lc' replicated size 3 min_size 2 crush_rule 0
> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 600 flags hashpspool
> >> stripe_width 0
> >> pool 89 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 655 flags hashpspool
> >> stripe_width 0
> >> pool 90 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule
> 0
> >> object_hash rjenkins pg_num 8 pgp_num 8 last_change 662 flags hashpspool
> >> stripe_width 0
> >> pool 91 'default.rgw.users.email' replicated size 3 min_size 2
> crush_rule
> >> 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 660 flags
> hashpspool
> >> stripe_width 0
> >> pool 92 'default.rgw.users.keys' replicated 

Re: [ceph-users] undersized pgs after removing smaller OSDs

2017-07-18 Thread Brad Hubbard
ID WEIGHT   TYPE NAME
-5  1.0 host osd1
-6  9.09560 host osd2
-2  9.09560 host osd3

The weight allocated to host "osd1" should presumably be the same as
the other two hosts?

Dump your crushmap and take a good look at it, specifically the
weighting of "osd1".


On Wed, Jul 19, 2017 at 11:48 AM, Roger Brown  wrote:
> I also tried ceph pg query, but it gave no helpful recommendations for any
> of the stuck pgs.
>
>
> On Tue, Jul 18, 2017 at 7:45 PM Roger Brown  wrote:
>>
>> Problem:
>> I have some pgs with only two OSDs instead of 3 like all the other pgs
>> have. This is causing active+undersized+degraded status.
>>
>> History:
>> 1. I started with 3 hosts, each with 1 OSD process (min_size 2) for a 1TB
>> drive.
>> 2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
>> 3. Removed the original 3 1TB OSD hosts from the osd tree (reweight 0,
>> wait, stop, remove, del osd, rm).
>> 4. The last OSD to be removed would never return to active+clean after
>> reweight 0. It returned undersized instead, but I went on with removal
>> anyway, leaving me stuck with 5 undersized pgs.
>>
>> Things tried that didn't help:
>> * give it time to go away on its own
>> * Replace replicated default.rgw.buckets.data pool with erasure-code 2+1
>> version.
>> * ceph osd lost 1 (and 2)
>> * ceph pg repair (pgs from dump_stuck)
>> * googled 'ceph pg undersized' and similar searches for help.
>>
>> Current status:
>> $ ceph osd tree
>> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 19.19119 root default
>> -5  1.0 host osd1
>>  3  1.0 osd.3  up  1.0  1.0
>> -6  9.09560 host osd2
>>  4  9.09560 osd.4  up  1.0  1.0
>> -2  9.09560 host osd3
>>  0  9.09560 osd.0  up  1.0  1.0
>> $ ceph pg dump_stuck
>> ok
>> PG_STAT STATE  UPUP_PRIMARY ACTING ACTING_PRIMARY
>> 88.3active+undersized+degraded [4,0]  4  [4,0]  4
>> 97.3active+undersized+degraded [4,0]  4  [4,0]  4
>> 85.6active+undersized+degraded [4,0]  4  [4,0]  4
>> 87.5active+undersized+degraded [0,4]  0  [0,4]  0
>> 70.0active+undersized+degraded [0,4]  0  [0,4]  0
>> $ ceph osd pool ls detail
>> pool 70 'default.rgw.rgw.gc' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 4 pgp_num 4 last_change 548 flags hashpspool
>> stripe_width 0
>> pool 83 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
>> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 576 owner
>> 18446744073709551615 flags hashpspool stripe_width 0
>> pool 85 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 652 flags hashpspool
>> stripe_width 0
>> pool 86 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 653 flags hashpspool
>> stripe_width 0
>> pool 87 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 654 flags hashpspool
>> stripe_width 0
>> pool 88 'default.rgw.lc' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 600 flags hashpspool
>> stripe_width 0
>> pool 89 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 655 flags hashpspool
>> stripe_width 0
>> pool 90 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 662 flags hashpspool
>> stripe_width 0
>> pool 91 'default.rgw.users.email' replicated size 3 min_size 2 crush_rule
>> 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 660 flags hashpspool
>> stripe_width 0
>> pool 92 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 8 pgp_num 8 last_change 659 flags hashpspool
>> stripe_width 0
>> pool 93 'default.rgw.buckets.index' replicated size 3 min_size 2
>> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 664 flags
>> hashpspool stripe_width 0
>> pool 95 'default.rgw.intent-log' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 4 pgp_num 4 last_change 656 flags hashpspool
>> stripe_width 0
>> pool 96 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 4 pgp_num 4 last_change 657 flags hashpspool
>> stripe_width 0
>> pool 97 'default.rgw.usage' replicated size 3 min_size 2 crush_rule 0
>> object_hash rjenkins pg_num 4 pgp_num 4 last_change 658 flags hashpspool
>> stripe_width 0
>> pool 98 'default.rgw.users.swift' replicated size 3 min_size 2 crush_rule
>> 0 object_hash rjenkins pg_num 4 pgp_num 4 last_change 661 flags hashpspool
>> stripe_width 0
>> pool 99 

Re: [ceph-users] undersized pgs after removing smaller OSDs

2017-07-18 Thread Roger Brown
I also tried ceph pg query, but it gave no helpful recommendations for any
of the stuck pgs.


On Tue, Jul 18, 2017 at 7:45 PM Roger Brown  wrote:

> Problem:
> I have some pgs with only two OSDs instead of 3 like all the other pgs
> have. This is causing active+undersized+degraded status.
>
> History:
> 1. I started with 3 hosts, each with 1 OSD process (min_size 2) for a 1TB
> drive.
> 2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
> 3. Removed the original 3 1TB OSD hosts from the osd tree (reweight 0,
> wait, stop, remove, del osd, rm).
> 4. The last OSD to be removed would never return to active+clean after
> reweight 0. It returned undersized instead, but I went on with removal
> anyway, leaving me stuck with 5 undersized pgs.
>
> Things tried that didn't help:
> * give it time to go away on its own
> * Replace replicated default.rgw.buckets.data pool with erasure-code 2+1
> version.
> * ceph osd lost 1 (and 2)
> * ceph pg repair (pgs from dump_stuck)
> * googled 'ceph pg undersized' and similar searches for help.
>
> Current status:
> $ ceph osd tree
> ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 19.19119 root default
> -5  1.0 host osd1
>  3  1.0 osd.3  up  1.0  1.0
> -6  9.09560 host osd2
>  4  9.09560 osd.4  up  1.0  1.0
> -2  9.09560 host osd3
>  0  9.09560 osd.0  up  1.0  1.0
> $ ceph pg dump_stuck
> ok
> PG_STAT STATE  UPUP_PRIMARY ACTING ACTING_PRIMARY
> 88.3active+undersized+degraded [4,0]  4  [4,0]  4
> 97.3active+undersized+degraded [4,0]  4  [4,0]  4
> 85.6active+undersized+degraded [4,0]  4  [4,0]  4
> 87.5active+undersized+degraded [0,4]  0  [0,4]  0
> 70.0active+undersized+degraded [0,4]  0  [0,4]  0
> $ ceph osd pool ls detail
> pool 70 'default.rgw.rgw.gc' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 4 pgp_num 4 last_change 548 flags hashpspool
> stripe_width 0
> pool 83 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 576 owner
> 18446744073709551615 flags hashpspool stripe_width 0
> pool 85 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 652 flags hashpspool
> stripe_width 0
> pool 86 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 653 flags hashpspool
> stripe_width 0
> pool 87 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 654 flags hashpspool
> stripe_width 0
> pool 88 'default.rgw.lc' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 600 flags hashpspool
> stripe_width 0
> pool 89 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 655 flags hashpspool
> stripe_width 0
> pool 90 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 662 flags hashpspool
> stripe_width 0
> pool 91 'default.rgw.users.email' replicated size 3 min_size 2 crush_rule
> 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 660 flags hashpspool
> stripe_width 0
> pool 92 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 659 flags hashpspool
> stripe_width 0
> pool 93 'default.rgw.buckets.index' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 664 flags
> hashpspool stripe_width 0
> pool 95 'default.rgw.intent-log' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 4 pgp_num 4 last_change 656 flags hashpspool
> stripe_width 0
> pool 96 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 4 pgp_num 4 last_change 657 flags hashpspool
> stripe_width 0
> pool 97 'default.rgw.usage' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 4 pgp_num 4 last_change 658 flags hashpspool
> stripe_width 0
> pool 98 'default.rgw.users.swift' replicated size 3 min_size 2 crush_rule
> 0 object_hash rjenkins pg_num 4 pgp_num 4 last_change 661 flags hashpspool
> stripe_width 0
> pool 99 'default.rgw.buckets.extra' replicated size 3 min_size 2
> crush_rule 0 object_hash rjenkins pg_num 4 pgp_num 4 last_change 663 flags
> hashpspool stripe_width 0
> pool 100 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 4 pgp_num 4 last_change 651 flags hashpspool stripe_width 0
> pool 101 'default.rgw.reshard' replicated size 3 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 8 pgp_num 8 last_change 1529 owner
> 

[ceph-users] undersized pgs after removing smaller OSDs

2017-07-18 Thread Roger Brown
Problem:
I have some pgs with only two OSDs instead of 3 like all the other pgs
have. This is causing active+undersized+degraded status.

History:
1. I started with 3 hosts, each with 1 OSD process (min_size 2) for a 1TB
drive.
2. Added 3 more hosts, each with 1 OSD process for a 10TB drive.
3. Removed the original 3 1TB OSD hosts from the osd tree (reweight 0,
wait, stop, remove, del osd, rm).
4. The last OSD to be removed would never return to active+clean after
reweight 0. It returned undersized instead, but I went on with removal
anyway, leaving me stuck with 5 undersized pgs.

Things tried that didn't help:
* give it time to go away on its own
* Replace replicated default.rgw.buckets.data pool with erasure-code 2+1
version.
* ceph osd lost 1 (and 2)
* ceph pg repair (pgs from dump_stuck)
* googled 'ceph pg undersized' and similar searches for help.

Current status:
$ ceph osd tree
ID WEIGHT   TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 19.19119 root default
-5  1.0 host osd1
 3  1.0 osd.3  up  1.0  1.0
-6  9.09560 host osd2
 4  9.09560 osd.4  up  1.0  1.0
-2  9.09560 host osd3
 0  9.09560 osd.0  up  1.0  1.0
$ ceph pg dump_stuck
ok
PG_STAT STATE  UPUP_PRIMARY ACTING ACTING_PRIMARY
88.3active+undersized+degraded [4,0]  4  [4,0]  4
97.3active+undersized+degraded [4,0]  4  [4,0]  4
85.6active+undersized+degraded [4,0]  4  [4,0]  4
87.5active+undersized+degraded [0,4]  0  [0,4]  0
70.0active+undersized+degraded [0,4]  0  [0,4]  0
$ ceph osd pool ls detail
pool 70 'default.rgw.rgw.gc' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 548 flags hashpspool
stripe_width 0
pool 83 'default.rgw.buckets.non-ec' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 576 owner
18446744073709551615 flags hashpspool stripe_width 0
pool 85 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 652 flags hashpspool
stripe_width 0
pool 86 'default.rgw.data.root' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 653 flags hashpspool
stripe_width 0
pool 87 'default.rgw.gc' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 654 flags hashpspool
stripe_width 0
pool 88 'default.rgw.lc' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 600 flags hashpspool
stripe_width 0
pool 89 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 655 flags hashpspool
stripe_width 0
pool 90 'default.rgw.users.uid' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 662 flags hashpspool
stripe_width 0
pool 91 'default.rgw.users.email' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 660 flags hashpspool
stripe_width 0
pool 92 'default.rgw.users.keys' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 659 flags hashpspool
stripe_width 0
pool 93 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 664 flags hashpspool
stripe_width 0
pool 95 'default.rgw.intent-log' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 656 flags hashpspool
stripe_width 0
pool 96 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 657 flags hashpspool
stripe_width 0
pool 97 'default.rgw.usage' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 658 flags hashpspool
stripe_width 0
pool 98 'default.rgw.users.swift' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 4 pgp_num 4 last_change 661 flags hashpspool
stripe_width 0
pool 99 'default.rgw.buckets.extra' replicated size 3 min_size 2 crush_rule
0 object_hash rjenkins pg_num 4 pgp_num 4 last_change 663 flags hashpspool
stripe_width 0
pool 100 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash
rjenkins pg_num 4 pgp_num 4 last_change 651 flags hashpspool stripe_width 0
pool 101 'default.rgw.reshard' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 8 pgp_num 8 last_change 1529 owner
18446744073709551615 flags hashpspool stripe_width 0
pool 103 'default.rgw.buckets.data' erasure size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 256 pgp_num 256 last_change 2106 flags
hashpspool stripe_width 8192

I'll keep on googling, but I'm open to advice!

Thank you,

Roger
___
ceph-users mailing list

[ceph-users] Moving OSD node from root bucket to defined 'rack' bucket

2017-07-18 Thread Mike Cave
Greetings,

I’m trying to figure out the best way to move our hosts from the root=default 
bucket into their rack buckets.

Our crush map has the notion of three racks which will hold all of our osd 
nodes.

As we have added new nodes, we have assigned them to their correct rack 
location in the map. However, when the cluster was first conceived, the 
majority of the nodes were left in the default bucket.

Now I would like to move them into their correct rack buckets.

I have a feeling I know the answer to this question, but I thought I’d ask and 
hopefully be pleasantly surprised.

Can I move a host from the root bucket into the correct rack without draining 
it and then refilling it or do I need to reweight the host to 0, move the host 
to the correct bucket, and then reweight it back to it’s correct value?

Any insights here will be appreciated.

Thank you for your time,
Mike Cave
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Long OSD restart after upgrade to 10.2.9

2017-07-18 Thread Josh Durgin

On 07/17/2017 10:04 PM, Anton Dmitriev wrote:
My cluster stores more than 1.5 billion objects in RGW, cephfs I dont 
use. Bucket index pool stored on separate SSD placement. But compaction 
occurs on all OSD, also on those, which doesn`t contain bucket indexes. 
After restarting 5 times every OSD nothing changed, each of them doing 
comapct again and again.


As an example omap dir size on one of OSDs, which doesnt contain bucket 
indexes.


root@storage01:/var/lib/ceph/osd/ceph-0/current/omap$ ls -l | wc -l
1455
root@storage01:/var/lib/ceph/osd/ceph-0/current/omap$ du -hd1
2,8G

Not so big at first look.


That is smaller than I'd expect to get long delays from compaction.
Could you provide more details on your environment - distro version and 
leveldb version? Has leveldb been updated recently? Perhaps there's some 
commonality between setups hitting this.


There weren't any changes in the way ceph was using leveldb from 10.2.7
to 10.2.9 that I could find.

Josh


On 17.07.2017 22:03, Josh Durgin wrote:

Both of you are seeing leveldb perform compaction when the osd starts
up. This can take a while for large amounts of omap data (created by
things like cephfs directory metadata or rgw bucket indexes).

The 'leveldb_compact_on_mount' option wasn't changed in 10.2.9, but 
leveldb will compact automatically if there is enough work to do.


Does restarting an OSD affected by this with 10.2.9 again after it's
completed compaction still have these symptoms?

Josh

On 07/17/2017 05:57 AM, Lincoln Bryant wrote:

Hi Anton,

We observe something similar on our OSDs going from 10.2.7 to 10.2.9 
(see thread "some OSDs stuck down after 10.2.7 -> 10.2.9 update"). 
Some of our OSDs are not working at all on 10.2.9 or die with suicide 
timeouts. Those that come up/in take a very long time to boot up. 
Seems to not affect every OSD in our case though.


--Lincoln

On 7/17/2017 1:29 AM, Anton Dmitriev wrote:
During start it consumes ~90% CPU, strace shows, that OSD process 
doing something with LevelDB.

Compact is disabled:
r...@storage07.main01.ceph.apps.prod.int.grcc:~$ cat 
/etc/ceph/ceph.conf | grep compact

#leveldb_compact_on_mount = true

But with debug_leveldb=20 I see, that compaction is running, but why?

2017-07-17 09:27:37.394008 7f4ed2293700  1 leveldb: Compacting 1@1 + 
12@2 files
2017-07-17 09:27:37.593890 7f4ed2293700  1 leveldb: Generated table 
#76778: 277817 keys, 2125970 bytes
2017-07-17 09:27:37.718954 7f4ed2293700  1 leveldb: Generated table 
#76779: 221451 keys, 2124338 bytes
2017-07-17 09:27:37.777362 7f4ed2293700  1 leveldb: Generated table 
#76780: 63755 keys, 809913 bytes
2017-07-17 09:27:37.919094 7f4ed2293700  1 leveldb: Generated table 
#76781: 231475 keys, 2026376 bytes
2017-07-17 09:27:38.035906 7f4ed2293700  1 leveldb: Generated table 
#76782: 190956 keys, 1573332 bytes
2017-07-17 09:27:38.127597 7f4ed2293700  1 leveldb: Generated table 
#76783: 148675 keys, 1260956 bytes
2017-07-17 09:27:38.286183 7f4ed2293700  1 leveldb: Generated table 
#76784: 294105 keys, 2123438 bytes
2017-07-17 09:27:38.469562 7f4ed2293700  1 leveldb: Generated table 
#76785: 299617 keys, 2124267 bytes
2017-07-17 09:27:38.619666 7f4ed2293700  1 leveldb: Generated table 
#76786: 277305 keys, 2124936 bytes
2017-07-17 09:27:38.711423 7f4ed2293700  1 leveldb: Generated table 
#76787: 110536 keys, 951545 bytes
2017-07-17 09:27:38.869917 7f4ed2293700  1 leveldb: Generated table 
#76788: 296199 keys, 2123506 bytes
2017-07-17 09:27:39.028395 7f4ed2293700  1 leveldb: Generated table 
#76789: 248634 keys, 2096715 bytes
2017-07-17 09:27:39.028414 7f4ed2293700  1 leveldb: Compacted 1@1 + 
12@2 files => 21465292 bytes
2017-07-17 09:27:39.053288 7f4ed2293700  1 leveldb: compacted to: 
files[ 0 0 48 549 948 0 0 ]
2017-07-17 09:27:39.054014 7f4ed2293700  1 leveldb: Delete type=2 
#76741


Strace:

open("/var/lib/ceph/osd/ceph-195/current/omap/043788.ldb", O_RDONLY) 
= 18
stat("/var/lib/ceph/osd/ceph-195/current/omap/043788.ldb", 
{st_mode=S_IFREG|0644, st_size=2154394, ...}) = 0

mmap(NULL, 2154394, PROT_READ, MAP_SHARED, 18, 0) = 0x7f96a67a
close(18)   = 0
brk(0x55d15664) = 0x55d15664
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [PIPE], 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], 

Re: [ceph-users] updating the documentation

2017-07-18 Thread John Spray
On Tue, Jul 18, 2017 at 9:03 PM, Gregory Farnum  wrote:
> On Tue, Jul 18, 2017 at 6:51 AM, John Spray  wrote:
>> On Wed, Jul 12, 2017 at 8:28 PM, Sage Weil  wrote:
>>> On Wed, 12 Jul 2017, Patrick Donnelly wrote:
 On Wed, Jul 12, 2017 at 11:29 AM, Sage Weil  wrote:
 > In the meantime, we can also avoid making the problem worse by requiring
 > that all pull requests include any relevant documentation updates.  This
 > means (1) helping educate contributors that doc updates are needed, (2)
 > helping maintainers and reviewers remember that doc updates are part of
 > the merge criteria (it will likely take a bit of time before this is
 > second nature), and (3) generally inducing developers to become aware of
 > the documentation that exists so that they know what needs to be updated
 > when they make a change.

 There was a joke to add a bot which automatically fails PRs for no
 documentation but I think there is an way to make that work in a
 reasonable way. Perhaps the bot could simply comment on all PRs
 touching src/ that documentation is required and where to look, and
 then fails a doc check. A developer must comment on the PR to say it
 passes documentation requirements before the bot changes the check to
 pass.

 This addresses all three points in an automatic way.
>>>
>>> This is a great idea.  Greg brought up the idea of a bot but we
>>> didn't think of a "docs ok"-type comment to make it happy.
>>>
>>> Anybody interested in coding it up?
>>>
>>> Piotr makes a good point about config_opts.h, although that problem is
>>> about to go away (or at least change) with John's config update:
>>>
>>> https://github.com/ceph/ceph/pull/16211
>>>
>>> (Config options will be documented in the code where the schema is
>>> defined, and docs.ceph.com .rst will eventually be auto-generated from
>>> that.)
>>
>>
>> Separate to the discussion of bots, here's a proposed change to the
>> SubmittingPatches.rst to formalize the expectation that submitters
>> make doc changes in their PRs.
>
> https://github.com/ceph/ceph/pull/16394 was meant to go here, I think. :)

Correct!  I like to throw the mistakes out there every once in a while
to check someone is paying attention :-)

John

> -Greg
>
>>
>> The twist here is that in addition to requiring submitters to make
>> changes, there is a responsibility on the component tech leads to
>> ensure there is a proper place for doc changes to go.  That means that
>> if someone comes with a change to a completely undocumented area of
>> functionality, then it is not the submitter's responsibility to create
>> the whole page just to note their small change (although it would
>> obviously be awesome if they did).
>>
>> Cheers,
>> John
>>
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating 12.1.0 -> 12.1.1

2017-07-18 Thread Gregory Farnum
Yeah, some of the message formats changed (incompatibly) during
development. If you update all your nodes it should go away; that one I
think is just ephemeral state.

On Tue, Jul 18, 2017 at 3:09 AM Marc Roos  wrote:

>
> I just updated packages on one CentOS7 node and getting these errors:
>
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537510 7f4fa1c14e40 -1
> WARNING: the following dangerous and experimental features are enabled:
> bluestore
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537510 7f4fa1c14e40 -1
> WARNING: the following dangerous and experimental features are enabled:
> bluestore
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537725 7f4fa1c14e40 -1
> WARNING: the following dangerous and experimental features are enabled:
> bluestore
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537725 7f4fa1c14e40 -1
> WARNING: the following dangerous and experimental features are enabled:
> bluestore
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.567250 7f4fa1c14e40 -1
> WARNING: the following dangerous and experimental features are enabled:
> bluestore
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.567250 7f4fa1c14e40 -1
> WARNING: the following dangerous and experimental features are enabled:
> bluestore
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.589008 7f4fa1c14e40 -1
> mon.a@-1(probing).mgrstat failed to decode mgrstat state; luminous dev
> version?
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.589008 7f4fa1c14e40 -1
> mon.a@-1(probing).mgrstat failed to decode mgrstat state; luminous dev
> version?
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.724836 7f4f977d9700 -1
> mon.a@0(synchronizing).mgrstat failed to decode mgrstat state; luminous
> dev version?
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.724836 7f4f977d9700 -1
> mon.a@0(synchronizing).mgrstat failed to decode mgrstat state; luminous
> dev version?
> Jul 18 12:03:34 c01 ceph-mon:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
> H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
> 2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function
> 'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700
> time 2017-07-18 12:03:34.870230
> Jul 18 12:03:34 c01 ceph-mon:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
> H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
> 2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: 100: FAILED
> assert(msg)
> Jul 18 12:03:34 c01 ceph-mon:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
> H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
> 2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function
> 'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700
> time 2017-07-18 12:03:34.870230
> Jul 18 12:03:34 c01 ceph-mon:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
> H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
> 2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: 100: FAILED
> assert(msg)
> Jul 18 12:03:34 c01 ceph-mon: ceph version 12.1.1
> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
> Jul 18 12:03:34 c01 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*,
> char const*, int, char const*)+0x110) [0x7f4fa21f4310]
> Jul 18 12:03:34 c01 ceph-mon: 2:
> (Monitor::handle_forward(boost::intrusive_ptr)+0xd70)
> [0x7f4fa1fddcd0]
> Jul 18 12:03:34 c01 ceph-mon: 3:
> (Monitor::dispatch_op(boost::intrusive_ptr)+0xd8d)
> [0x7f4fa1fdb29d]
> Jul 18 12:03:34 c01 ceph-mon: 4: (Monitor::_ms_dispatch(Message*)+0x7de)
> [0x7f4fa1fdc06e]
> Jul 18 12:03:34 c01 ceph-mon: 5: (Monitor::ms_dispatch(Message*)+0x23)
> [0x7f4fa2004303]
> Jul 18 12:03:34 c01 ceph-mon: 6: (DispatchQueue::entry()+0x792)
> [0x7f4fa242c812]
> Jul 18 12:03:34 c01 ceph-mon: 7:
> (DispatchQueue::DispatchThread::entry()+0xd) [0x7f4fa229a3cd]
> Jul 18 12:03:34 c01 ceph-mon: 8: (()+0x7dc5) [0x7f4fa0fbedc5]
> Jul 18 12:03:34 c01 ceph-mon: 9: (clone()+0x6d) [0x7f4f9e34a76d]
> Jul 18 12:03:34 c01 ceph-mon: NOTE: a copy of the executable, or
> `objdump -rdS ` is needed to interpret this.
> Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.872654 7f4f977d9700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
> H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
> 2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function
> 'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700
> time 2017-07-18 12:03:34.870230
> Jul 18 12:03:34 c01 ceph-mon: ceph version 12.1.1
> (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
> Jul 18 12:03:34 c01 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*,
> char const*, int, char const*)+0x110) [0x7f4fa21f4310]
> Jul 18 12:03:34 c01 ceph-mon: 2:
> (Monitor::handle_forward(boost::intrusive_ptr)+0xd70)
> 

Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Gencer W . Genç
>> Not for 10GbE, but for public vs cluster network, for example:

Applied. Thanks!

>> Then I'm not sure what to expect... probably poor performance with sync 
>> writes on filestore, and not sure what would happen with
>> bluestore...
>> probably much better than filestore though if you use a large block size.

At the moment, It looks good but, can you explain a bit more on block size? (or 
a reference page could also work)

Gencer.

-Original Message-
From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de] 
Sent: Tuesday, July 18, 2017 5:59 PM
To: Gencer W. Genç 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Yet another performance tuning for CephFS

On 07/18/17 14:10, Gencer W. Genç wrote:
>>> Are you sure? Your config didn't show this.
> Yes. I have dedicated 10GbE network between ceph nodes. Each ceph node has 
> seperate network that have 10GbE network card and speed. Do I have to set 
> anything in the config for 10GbE?
Not for 10GbE, but for public vs cluster network, for example:

> public network = 10.10.10.0/24
> cluster network = 10.10.11.0/24

Mainly this is for replication performance.

And using jumbo frames (high MTU, like 9000, on hosts and higher on
switches) also increases performance a bit (especially on slow CPUs in theory). 
That's also not in the ceph.conf.

>>> What kind of devices are they? did you do the journal test?
> They are not connected via NVMe neither SSD's. Each node has 10x3TB SATA Hard 
> Disk Drives (HDD).
Then I'm not sure what to expect... probably poor performance with sync writes 
on filestore, and not sure what would happen with bluestore...
probably much better than filestore though if you use a large block size.
>
>
> -Gencer.
>
>
> -Original Message-
> From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de]
> Sent: Tuesday, July 18, 2017 2:47 PM
> To: gen...@gencgiyen.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Yet another performance tuning for CephFS
>
> On 07/17/17 22:49, gen...@gencgiyen.com wrote:
>> I have a seperate 10GbE network for ceph and another for public.
>>
> Are you sure? Your config didn't show this.
>
>> No they are not NVMe, unfortunately.
>>
> What kind of devices are they? did you do the journal test?
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-s
> sd-is-suitable-as-a-journal-device/
>
> Unlike most tests, with ceph journals, you can't look at the load on the 
> device and decide it's not the bottleneck; you have to test it another way. I 
> had some micron SSDs I tested which performed poorly, and that test showed 
> them performing poorly too. But from other benchmarks, and disk load during 
> journal tests, they looked ok, which was misleading.
>> Do you know any test command that i can try to see if this is the max.
>> Read speed from rsync?
> I don't know how you can improve your rsync test.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs metadata damage and scrub error

2017-07-18 Thread Mazzystr
Any update to this?  I also have the same problem

# for i in $(cat pg_dump | grep 'stale+active+clean' | awk {'print $1'});
do echo -n "$i: "; rados list-inconsistent-obj $i; echo; done
107.ff: {"epoch":10762,"inconsistents":[]}
.
and so on for 49 pg's that I think I had a problem with

# ceph tell mds.ceph damage ls | python -m "json.tool"
2017-07-18 16:28:08.657673 7f766e629700  0 client.1923797 ms_handle_reset
on 192.168.1.10:6800/1268574779
2017-07-18 16:28:08.665693 7f76577fe700  0 client.1923798 ms_handle_reset
on 192.168.1.10:6800/1268574779
[
{
"damage_type": "dir_frag",
"frag": "*",
"id": 4153356868,
"ino": 1099511661266
}
]

# cat dirs_ceph_inodes | grep 1099511661266
1099511661266 drwxr-xr-x 1 chris chris 1 May  4  2017 /mnt/ceph/2017/05/04/

# rm -rf /mnt/ceph/2017/05/04/
rm: cannot remove ‘/mnt/ceph/2017/05/04/’: Directory not empty

# ceph -v
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)

All osd's are on 11.2.0


I am A OK with losing the directory and the contents...just need to get to
happy pastures

Thanks,
/Chris Callegari



On Tue, May 30, 2017 at 7:42 AM, James Eckersall 
wrote:

> Further to this, we managed to repair the inconsistent PG by comparing the
> object digests and removing the one that didn't match (3 of 4 replicas had
> the same digest, 1 didn't) and then issuing a pg repair and scrub.
> This has removed the inconsistent flag on the PG, however, we are still
> seeing the mds report damage.
>
> We tried removing the damage from the mds with damage rm, then ran a
> recursive stat across the problem directory, but the damage re-appeared.
> Tried dong a scrub_path, but the command returned code -2 and the mds log
> shows that the scrub started and finished less than 1ms later.
>
> Any further help is greatly appreciated.
>
> On 17 May 2017 at 10:58, James Eckersall 
> wrote:
>
>> An update to this.  The cluster has been upgraded to Kraken, but I've
>> still got the same PG reporting inconsistent and the same error message
>> about mds metadata damaged.
>> Can anyone offer any further advice please?
>> If you need output from the ceph-osdomap-tool, could you please explain
>> how to use it?  I haven't been able to find any docs that explain.
>>
>> Thanks
>> J
>>
>> On 3 May 2017 at 14:35, James Eckersall 
>> wrote:
>>
>>> Hi David,
>>>
>>> Thanks for the reply, it's appreciated.
>>> We're going to upgrade the cluster to Kraken and see if that fixes the
>>> metadata issue.
>>>
>>> J
>>>
>>> On 2 May 2017 at 17:00, David Zafman  wrote:
>>>

 James,

 You have an omap corruption.  It is likely caused by a bug which
 has already been identified.  A fix for that problem is available but it is
 still pending backport for the next Jewel point release.  All 4 of your
 replicas have different "omap_digest" values.

 Instead of the xattrs the ceph-osdomap-tool --command
 dump-objects-with-keys output from OSDs 3, 10, 11, 23 would be interesting
 to compare.

 ***WARNING*** Please backup your data before doing any repair attempts.

 If you can upgrade to Kraken v11.2.0, it will auto repair the omaps on
 ceph-osd start up.  It will likely still require a ceph pg repair to make
 the 4 replicas consistent with each other.  The final result may be the
 reappearance of removed MDS files in the directory.

 If you can recover the data, you could remove the directory entirely
 and rebuild it.  The original bug was triggered during omap deletion
 typically in a large directory which corresponds to an individual unlink in
 cephfs.

 If you can build a branch in github to get the newer ceph-osdomap-tool
 you could try to use it to repair the omaps.

 David


 On 5/2/17 5:05 AM, James Eckersall wrote:

 Hi,

 I'm having some issues with a ceph cluster.  It's an 8 node cluster rnning
 Jewel ceph-10.2.7-0.el7.x86_64 on CentOS 7.
 This cluster provides RBDs and a CephFS filesystem to a number of clients.

 ceph health detail is showing the following errors:

 pg 2.9 is active+clean+inconsistent, acting [3,10,11,23]
 1 scrub errors
 mds0: Metadata damage detected


 The pg 2.9 is in the cephfs_metadata pool (id 2).

 I've looked at the OSD logs for OSD 3, which is the primary for this PG,
 but the only thing that appears relating to this PG is the following:

 log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors

 After initiating a ceph pg repair 2.9, I see the following in the primary
 OSD log:

 log_channel(cluster) log [ERR] : 2.9 repair 1 errors, 0 fixed
 log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors


 I found the below command in a previous ceph-users post.  Running this
 

Re: [ceph-users] updating the documentation

2017-07-18 Thread Gregory Farnum
On Tue, Jul 18, 2017 at 6:51 AM, John Spray  wrote:
> On Wed, Jul 12, 2017 at 8:28 PM, Sage Weil  wrote:
>> On Wed, 12 Jul 2017, Patrick Donnelly wrote:
>>> On Wed, Jul 12, 2017 at 11:29 AM, Sage Weil  wrote:
>>> > In the meantime, we can also avoid making the problem worse by requiring
>>> > that all pull requests include any relevant documentation updates.  This
>>> > means (1) helping educate contributors that doc updates are needed, (2)
>>> > helping maintainers and reviewers remember that doc updates are part of
>>> > the merge criteria (it will likely take a bit of time before this is
>>> > second nature), and (3) generally inducing developers to become aware of
>>> > the documentation that exists so that they know what needs to be updated
>>> > when they make a change.
>>>
>>> There was a joke to add a bot which automatically fails PRs for no
>>> documentation but I think there is an way to make that work in a
>>> reasonable way. Perhaps the bot could simply comment on all PRs
>>> touching src/ that documentation is required and where to look, and
>>> then fails a doc check. A developer must comment on the PR to say it
>>> passes documentation requirements before the bot changes the check to
>>> pass.
>>>
>>> This addresses all three points in an automatic way.
>>
>> This is a great idea.  Greg brought up the idea of a bot but we
>> didn't think of a "docs ok"-type comment to make it happy.
>>
>> Anybody interested in coding it up?
>>
>> Piotr makes a good point about config_opts.h, although that problem is
>> about to go away (or at least change) with John's config update:
>>
>> https://github.com/ceph/ceph/pull/16211
>>
>> (Config options will be documented in the code where the schema is
>> defined, and docs.ceph.com .rst will eventually be auto-generated from
>> that.)
>
>
> Separate to the discussion of bots, here's a proposed change to the
> SubmittingPatches.rst to formalize the expectation that submitters
> make doc changes in their PRs.

https://github.com/ceph/ceph/pull/16394 was meant to go here, I think. :)
-Greg

>
> The twist here is that in addition to requiring submitters to make
> changes, there is a responsibility on the component tech leads to
> ensure there is a proper place for doc changes to go.  That means that
> if someone comes with a change to a completely undocumented area of
> functionality, then it is not the submitter's responsibility to create
> the whole page just to note their small change (although it would
> obviously be awesome if they did).
>
> Cheers,
> John
>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Updating 12.1.0 -> 12.1.1 mon / osd wont start

2017-07-18 Thread Marc Roos
 

I just updated packages on one CentOS7 node and getting these errors. 
Anybody an idea how to resolve this?


Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537510 7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537510 7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537725 7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537725 7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.567250 7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.567250 7f4fa1c14e40 -1
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.589008 7f4fa1c14e40 -1 
mon.a@-1(probing).mgrstat failed to decode mgrstat state; luminous dev 
version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.589008 7f4fa1c14e40 -1 
mon.a@-1(probing).mgrstat failed to decode mgrstat state; luminous dev 
version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.724836 7f4f977d9700 -1 
mon.a@0(synchronizing).mgrstat failed to decode mgrstat state; luminous 
dev version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.724836 7f4f977d9700 -1 
mon.a@0(synchronizing).mgrstat failed to decode mgrstat state; luminous 
dev version?
Jul 18 12:03:34 c01 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function
'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700 
time 2017-07-18 12:03:34.870230 Jul 18 12:03:34 c01 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: 100: FAILED
assert(msg)
Jul 18 12:03:34 c01 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function
'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700 
time 2017-07-18 12:03:34.870230 Jul 18 12:03:34 c01 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: 100: FAILED
assert(msg)
Jul 18 12:03:34 c01 ceph-mon: ceph version 12.1.1
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) Jul 18 12:03:34 
c01 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*, char const*, 
int, char const*)+0x110) [0x7f4fa21f4310] Jul 18 12:03:34 c01 ceph-mon: 
2: 
(Monitor::handle_forward(boost::intrusive_ptr)+0xd70)
[0x7f4fa1fddcd0]
Jul 18 12:03:34 c01 ceph-mon: 3: 
(Monitor::dispatch_op(boost::intrusive_ptr)+0xd8d)
[0x7f4fa1fdb29d]
Jul 18 12:03:34 c01 ceph-mon: 4: (Monitor::_ms_dispatch(Message*)+0x7de)
[0x7f4fa1fdc06e]
Jul 18 12:03:34 c01 ceph-mon: 5: (Monitor::ms_dispatch(Message*)+0x23)
[0x7f4fa2004303]
Jul 18 12:03:34 c01 ceph-mon: 6: (DispatchQueue::entry()+0x792) 
[0x7f4fa242c812] Jul 18 12:03:34 c01 ceph-mon: 7: 
(DispatchQueue::DispatchThread::entry()+0xd) [0x7f4fa229a3cd] Jul 18 
12:03:34 c01 ceph-mon: 8: (()+0x7dc5) [0x7f4fa0fbedc5] Jul 18 12:03:34 
c01 ceph-mon: 9: (clone()+0x6d) [0x7f4f9e34a76d] Jul 18 12:03:34 c01 
ceph-mon: NOTE: a copy of the executable, or `objdump -rdS ` 
is needed to interpret this.
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.872654 7f4f977d9700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function
'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700 
time 2017-07-18 12:03:34.870230 Jul 18 12:03:34 c01 ceph-mon: ceph 
version 12.1.1
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc) Jul 18 12:03:34 
c01 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*, char const*, 
int, char const*)+0x110) [0x7f4fa21f4310] Jul 18 12:03:34 c01 ceph-mon: 
2: 
(Monitor::handle_forward(boost::intrusive_ptr)+0xd70)
[0x7f4fa1fddcd0]
Jul 18 12:03:34 c01 ceph-mon: 3: 
(Monitor::dispatch_op(boost::intrusive_ptr)+0xd8d)
[0x7f4fa1fdb29d]
Jul 18 12:03:34 c01 ceph-mon: 4: (Monitor::_ms_dispatch(Message*)+0x7de)
[0x7f4fa1fdc06e]
Jul 18 12:03:34 c01 ceph-mon: 5: (Monitor::ms_dispatch(Message*)+0x23)
[0x7f4fa2004303]
Jul 18 12:03:34 c01 ceph-mon: 6: 

[ceph-users] Ceph-Kraken: Error installing calamari

2017-07-18 Thread Oscar Segarra
Hi,

I have created a VM called vdiccalamari where I'm trying to install the
calamari server in order to view ceph status from a gui:

[vdicceph@vdicnode01 ceph]$ sudo ceph status
cluster 656e84b2-9192-40fe-9b81-39bd0c7a3196
 health HEALTH_OK
 monmap e2: 1 mons at {vdicnode01=192.168.100.101:6789/0}
election epoch 11, quorum 0 vdicnode01
mgr active: vdicnode01
 osdmap e52: 2 osds: 2 up, 2 in
flags sortbitwise,require_jewel_osds,require_kraken_osds
  pgmap v2312: 192 pgs, 2 pools, 4171 MB data, 1052 objects
8438 MB used, 63204 MB / 71642 MB avail
 192 active+clean


But when I try to install calamari from admin-server (in my case vdicnode01
as well) I get the following error:

[vdicceph@vdicnode01 ceph]$ ceph-deploy --username vdicceph calamari
connect vdiccalamari.local
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/vdicceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.38): /bin/ceph-deploy --username
vdicceph calamari connect vdiccalamari.local
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : vdicceph
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: connect
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  master: None
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  hosts :
['vdiccalamari.local']
[ceph_deploy.cli][INFO  ]  default_release   : False
[vdiccalamari.local][DEBUG ] connection detected need for sudo
[vdiccalamari.local][DEBUG ] connected to host: vdicceph@vdiccalamari.local
[vdiccalamari.local][DEBUG ] detect platform information from remote host
[vdiccalamari.local][DEBUG ] detect machine type
[ceph_deploy.calamari][INFO  ] Distro info: CentOS Linux 7.3.1611 Core
[ceph_deploy.calamari][INFO  ] assuming that a repository with Calamari
packages is already configured.
[ceph_deploy.calamari][INFO  ] Refer to the docs for examples (
http://ceph.com/ceph-deploy/docs/conf.html)
[vdiccalamari.local][DEBUG ] creating config dir: /etc/salt/minion.d
[vdiccalamari.local][DEBUG ] creating the calamari salt config:
/etc/salt/minion.d/calamari.conf
[vdiccalamari.local][INFO  ] Running command: sudo yum -y install
salt-minion
[vdiccalamari.local][DEBUG ] Loaded plugins: fastestmirror
[vdiccalamari.local][DEBUG ] Loading mirror speeds from cached hostfile
[vdiccalamari.local][DEBUG ]  * base: mirror.airenetworks.es
[vdiccalamari.local][DEBUG ]  * epel: mirror.airenetworks.es
[vdiccalamari.local][DEBUG ]  * extras: mirror.airenetworks.es
[vdiccalamari.local][DEBUG ]  * updates: mirror.airenetworks.es
[vdiccalamari.local][DEBUG ] Package salt-minion-2015.5.10-2.el7.noarch
already installed and latest version
[vdiccalamari.local][DEBUG ] Nothing to do
[vdiccalamari.local][INFO  ] Running command: sudo yum -y install diamond
[vdiccalamari.local][DEBUG ] Loaded plugins: fastestmirror
[vdiccalamari.local][DEBUG ] Loading mirror speeds from cached hostfile
[vdiccalamari.local][DEBUG ]  * base: mirror.airenetworks.es
[vdiccalamari.local][DEBUG ]  * epel: mirror.airenetworks.es
[vdiccalamari.local][DEBUG ]  * extras: mirror.airenetworks.es
[vdiccalamari.local][DEBUG ]  * updates: mirror.airenetworks.es
[vdiccalamari.local][DEBUG ] No package diamond available.
[vdiccalamari.local][WARNIN] Error: Nothing to do
*[vdiccalamari.local][ERROR ] RuntimeError: command returned non-zero exit
status: 1*
*[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
install diamond*

I have googled but I have not been able to find the diamond package.
Anybody has experienced the same issue?

Thanks a lot.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-18 Thread David Turner
This was recently covered on the mailing list. I believe this will cover
all of your questions.

https://www.spinics.net/lists/ceph-users/msg37252.html

On Tue, Jul 18, 2017, 9:07 AM Laszlo Budai  wrote:

> Dear all,
>
> we are planning to add new hosts to our existing hammer clusters, and I'm
> looking for best practices recommendations.
>
> currently we have 2 clusters with 72 OSDs and 6 nodes each. We want to add
> 3 more nodes (36 OSDs) to each cluster, and we have some questions about
> what would be the best way to do it. Currently the two clusters have
> different CRUSH maps.
>
> Cluster 1
> The CRUSH map only has OSDs, hosts and the root bucket. Failure domain is
> host.
> Our final desired state would be:
> OSD - hosts - chassis - root where each chassis has 3 hosts, each host has
> 12 OSDs, and the failure domain would be chassis.
>
> What would be the recommended way to achieve this without downtime for
> client operations?
> I have read about the possibility to throttle down the recovery/backfill
> using
> osd max backfills = 1
> osd recovery max active = 1
> osd recovery max single start = 1
> osd recovery op priority = 1
> osd recovery threads = 1
> osd backfill scan max = 16
> osd backfill scan min = 4
>
> but we wonder about the situation when, in a worst case scenario, all the
> replicas belonging to one pg have to be migrated to new locations according
> to the new CRUSH map. How will ceph behave in such situation?
>
>
> Cluster 2
> the crush map already contains chassis. Currently we have 3 chassis (c1,
> c2, c3) and 6 hosts:
> - x1, x2 in chassis c1
> - y1, y2 in chassis c2
> - x3, y3 in chassis c3
>
> We are adding hosts z1, z2, z3 and our desired CRUSH map would look like
> this:
> - x1, x2, x3 in c1
> - y1, y2, y3 in c2
> - z1, z2, z3 in c3
>
> Again, what would be the recommended way to achieve this while the clients
> are still accessing the data?
>
> Is it safe to add more OSDs at a time? or we should add them one by one?
>
> Thank you in advance for any suggestions, recommendations.
>
> Kind regards,
> Laszlo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hammer -> jewel 10.2.8 upgrade and setting sortbitwise

2017-07-18 Thread David Turner
It was recommended to set sort_bitwise in the upgrade from Hammer to Jewel
when Jewel was first released. 10.2.6 is definitely safe to enable it.

On Tue, Jul 18, 2017, 8:05 AM Dan van der Ster  wrote:

> Hi Martin,
>
> We had sortbitwise set on other jewel clusters well before 10.2.9 was out.
> 10.2.8 added the warning if it is not set, but the flag should be safe
> in 10.2.6.
>
> -- Dan
>
>
>
> On Tue, Jul 18, 2017 at 11:43 AM, Martin Palma  wrote:
> > Can the "sortbitwise" also be set if we have a cluster running OSDs on
> > 10.2.6 and some OSDs on 10.2.9? Or should we wait that all OSDs are on
> > 10.2.9?
> >
> > Monitor nodes are already on 10.2.9.
> >
> > Best,
> > Martin
> >
> > On Fri, Jul 14, 2017 at 1:16 PM, Dan van der Ster 
> wrote:
> >> On Mon, Jul 10, 2017 at 5:06 PM, Sage Weil  wrote:
> >>> On Mon, 10 Jul 2017, Luis Periquito wrote:
>  Hi Dan,
> 
>  I've enabled it in a couple of big-ish clusters and had the same
>  experience - a few seconds disruption caused by a peering process
>  being triggered, like any other crushmap update does. Can't remember
>  if it triggered data movement, but I have a feeling it did...
> >>>
> >>> That's consistent with what one should expect.
> >>>
> >>> The flag triggers a new peering interval, which means the PGs will
> peer,
> >>> but there is no change in the mapping or data layout or anything else.
> >>> The only thing that is potentially scary here is that *every* PG will
> >>> repeer at the same time.
> >>
> >> Thanks Sage & Luis. I confirm that setting sortbitwise on a large
> >> cluster is basically a non-event... nothing to worry about.
> >>
> >> (Btw, we just upgraded our biggest prod clusters to jewel -- that also
> >> went totally smooth!)
> >>
> >> -- Dan
> >>
> >>> sage
> >>>
> >>>
> 
> 
> 
>  On Mon, Jul 10, 2017 at 3:17 PM, Dan van der Ster 
> wrote:
>  > Hi all,
>  >
>  > With 10.2.8, ceph will now warn if you didn't yet set sortbitwise.
>  >
>  > I just updated a test cluster, saw that warning, then did the
> necessary
>  >   ceph osd set sortbitwise
>  >
>  > I noticed a short re-peering which took around 10s on this small
>  > cluster with very little data.
>  >
>  > Has anyone done this already on a large cluster with lots of
> objects?
>  > It would be nice to hear that it isn't disruptive before running it
> on
>  > our big production instances.
>  >
>  > Cheers, Dan
>  > ___
>  > ceph-users mailing list
>  > ceph-users@lists.ceph.com
>  > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  ___
>  ceph-users mailing list
>  ceph-users@lists.ceph.com
>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] skewed osd utilization

2017-07-18 Thread Ashley Merrick
Hello,

On a updated Lum cluster I am getting the following health warning (skewed osd 
utilization). The reason for this is I have a set of SSD’s in a cache which are 
much emptier than my standard SAS disks putting the ration off massively.

Is it possible to tell it to exclude certain disks from this calculation?

,Ashley
Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Modify pool size not allowed with permission osd 'allow rwx pool=test'

2017-07-18 Thread Wido den Hollander

> Op 18 juli 2017 om 17:40 schreef Marc Roos :
> 
> 
>  
> 
> With ceph auth I have set permissions like below, I can add and delete 
> objects in the test pool, but cannot set size of a the test pool. What 
> permission do I need to add for this user to modify the size of this 
> test pool?
> 
>  mon 'allow r' mds 'allow r' osd 'allow rwx pool=test'
> 

You will need rw and maybe x on mon, but that gives you cluster wide 
capabilities.

As changing size (or min_size) isn't something you do that often I recommend 
that you don't grant these capabilities to every user.

Wido

> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Modify pool size not allowed with permission osd 'allow rwx pool=test'

2017-07-18 Thread Marc Roos
 

With ceph auth I have set permissions like below, I can add and delete 
objects in the test pool, but cannot set size of a the test pool. What 
permission do I need to add for this user to modify the size of this 
test pool?

 mon 'allow r' mds 'allow r' osd 'allow rwx pool=test'





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems getting nfs-ganesha with cephfs backend to work.

2017-07-18 Thread David
You mentioned the Kernel client works but the Fuse mount would be a better
test in relation to the Ganesha FSAL.

The following config didn't give me the error you describe in 1) but I'm
mounting on the client with NFSv4, not sure about 2), is that dm-nfs?

EXPORT
{
Export_ID = 1;
Path = "/";
Pseudo = "/";
Access_Type = RW;
Squash = No_Root_Squash;
SecType = "none";
Protocols = "3", "4";
Transports = "TCP";

FSAL {
Name = CEPH;
}
}

Ganesha version 2.5.0.1 from the nfs-ganesha repo hosted on
download.ceph.com
CentOS 7.3 server and client


On Mon, Jul 17, 2017 at 2:26 PM, Micha Krause  wrote:

> Hi,
>
>
> > Change Pseudo to something like /mypseudofolder
>
> I tried this, without success, but I managed to get something working with
> version 2.5.
>
> I can mount the NFS export now, however 2 problems remain:
>
> 1. The root directory of the mount-point looks empty (ls shows no files),
> however directories
>and files can be accessed, and ls works in subdirectories.
>
> 2. I can't create devices in the nfs mount, not sure if ganesha supports
> this with other backends.
>
>
>
> Micha Krause
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Peter Maloney
On 07/18/17 14:10, Gencer W. Genç wrote:
>>> Are you sure? Your config didn't show this.
> Yes. I have dedicated 10GbE network between ceph nodes. Each ceph node has 
> seperate network that have 10GbE network card and speed. Do I have to set 
> anything in the config for 10GbE?
Not for 10GbE, but for public vs cluster network, for example:

> public network = 10.10.10.0/24
> cluster network = 10.10.11.0/24

Mainly this is for replication performance.

And using jumbo frames (high MTU, like 9000, on hosts and higher on
switches) also increases performance a bit (especially on slow CPUs in
theory). That's also not in the ceph.conf.

>>> What kind of devices are they? did you do the journal test?
> They are not connected via NVMe neither SSD's. Each node has 10x3TB SATA Hard 
> Disk Drives (HDD).
Then I'm not sure what to expect... probably poor performance with sync
writes on filestore, and not sure what would happen with bluestore...
probably much better than filestore though if you use a large block size.
>
>
> -Gencer.
>
>
> -Original Message-
> From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de] 
> Sent: Tuesday, July 18, 2017 2:47 PM
> To: gen...@gencgiyen.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Yet another performance tuning for CephFS
>
> On 07/17/17 22:49, gen...@gencgiyen.com wrote:
>> I have a seperate 10GbE network for ceph and another for public.
>>
> Are you sure? Your config didn't show this.
>
>> No they are not NVMe, unfortunately.
>>
> What kind of devices are they? did you do the journal test?
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Unlike most tests, with ceph journals, you can't look at the load on the 
> device and decide it's not the bottleneck; you have to test it another way. I 
> had some micron SSDs I tested which performed poorly, and that test showed 
> them performing poorly too. But from other benchmarks, and disk load during 
> journal tests, they looked ok, which was misleading.
>> Do you know any test command that i can try to see if this is the max.
>> Read speed from rsync?
> I don't know how you can improve your rsync test.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] updating the documentation

2017-07-18 Thread John Spray
On Wed, Jul 12, 2017 at 8:28 PM, Sage Weil  wrote:
> On Wed, 12 Jul 2017, Patrick Donnelly wrote:
>> On Wed, Jul 12, 2017 at 11:29 AM, Sage Weil  wrote:
>> > In the meantime, we can also avoid making the problem worse by requiring
>> > that all pull requests include any relevant documentation updates.  This
>> > means (1) helping educate contributors that doc updates are needed, (2)
>> > helping maintainers and reviewers remember that doc updates are part of
>> > the merge criteria (it will likely take a bit of time before this is
>> > second nature), and (3) generally inducing developers to become aware of
>> > the documentation that exists so that they know what needs to be updated
>> > when they make a change.
>>
>> There was a joke to add a bot which automatically fails PRs for no
>> documentation but I think there is an way to make that work in a
>> reasonable way. Perhaps the bot could simply comment on all PRs
>> touching src/ that documentation is required and where to look, and
>> then fails a doc check. A developer must comment on the PR to say it
>> passes documentation requirements before the bot changes the check to
>> pass.
>>
>> This addresses all three points in an automatic way.
>
> This is a great idea.  Greg brought up the idea of a bot but we
> didn't think of a "docs ok"-type comment to make it happy.
>
> Anybody interested in coding it up?
>
> Piotr makes a good point about config_opts.h, although that problem is
> about to go away (or at least change) with John's config update:
>
> https://github.com/ceph/ceph/pull/16211
>
> (Config options will be documented in the code where the schema is
> defined, and docs.ceph.com .rst will eventually be auto-generated from
> that.)


Separate to the discussion of bots, here's a proposed change to the
SubmittingPatches.rst to formalize the expectation that submitters
make doc changes in their PRs.

The twist here is that in addition to requiring submitters to make
changes, there is a responsibility on the component tech leads to
ensure there is a proper place for doc changes to go.  That means that
if someone comes with a change to a completely undocumented area of
functionality, then it is not the submitter's responsibility to create
the whole page just to note their small change (although it would
obviously be awesome if they did).

Cheers,
John

> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] v12.1.1 Luminous RC released

2017-07-18 Thread Abhishek Lekshmanan

This is the second release candidate for Luminous, the next long term stable
release. Please note that this is still a *release candidate* and not
the final release, and hence not yet recommended on production clusters,
testing is welcome & we would love feedback and bug reports.

Ceph Luminous (v12.2.0) will be the foundation for the next long-term
stable release series.  There have been major changes since Kraken
(v11.2.z) and Jewel (v10.2.z), and the upgrade process is non-trivial.
Please read these release notes carefully. 

Major Changes from Kraken
-

- *General*:

  * Ceph now has a simple, built-in web-based dashboard for monitoring
cluster status. 

- *RADOS*:

  * *BlueStore*:

- The new *BlueStore* backend for *ceph-osd* is now stable and the new
  default for newly created OSDs.  BlueStore manages data stored by each OSD
  by directly managing the physical HDDs or SSDs without the use of an
  intervening file system like XFS.  This provides greater performance
  and features. 
- BlueStore supports *full data and metadata checksums* of all
  data stored by Ceph.
- BlueStore supports inline compression using zlib, snappy, or LZ4.  (Ceph
  also supports zstd for RGW compression but zstd is not recommended for
  BlueStore for performance reasons.) 

  * *Erasure coded* pools now have full support for *overwrites*,
allowing them to be used with RBD and CephFS. 

  * The configuration option "osd pool erasure code stripe width" has
been replaced by "osd pool erasure code stripe unit", and given the
ability to be overridden by the erasure code profile setting
"stripe_unit". For more details see "Erasure Code Profiles" in the
documentation.

  * rbd and cephfs can use erasure coding with bluestore. This may be
enabled by setting 'allow_ec_overwrites' to 'true' for a pool. Since
this relies on bluestore's checksumming to do deep scrubbing,
enabling this on a pool stored on filestore is not allowed.

  * The 'rados df' JSON output now prints numeric values as numbers instead of
strings.

  * The `mon_osd_max_op_age` option has been renamed to
`mon_osd_warn_op_age` (default: 32 seconds), to indicate we
generate a warning at this age.  There is also a new
`mon_osd_err_op_age_ratio` that is a expressed as a multitple of
`mon_osd_warn_op_age` (default: 128, for roughly 60 minutes) to
control when an error is generated.

  * The default maximum size for a single RADOS object has been reduced from
100GB to 128MB.  The 100GB limit was completely impractical in practice
while the 128MB limit is a bit high but not unreasonable.  If you have an
application written directly to librados that is using objects larger than
128MB you may need to adjust `osd_max_object_size`.

  * The semantics of the 'rados ls' and librados object listing
operations have always been a bit confusing in that "whiteout"
objects (which logically don't exist and will return ENOENT if you
try to access them) are included in the results.  Previously
whiteouts only occurred in cache tier pools.  In luminous, logically
deleted but snapshotted objects now result in a whiteout object, and
as a result they will appear in 'rados ls' results, even though
trying to read such an object will result in ENOENT.  The 'rados
listsnaps' operation can be used in such a case to enumerate which
snapshots are present.

This may seem a bit strange, but is less strange than having a
deleted-but-snapshotted object not appear at all and be completely
hidden from librados's ability to enumerate objects.  Future
versions of Ceph will likely include an alternative object
enumeration interface that makes it more natural and efficient to
enumerate all objects along with their snapshot and clone metadata.


  * *ceph-mgr*:

- There is a new daemon, *ceph-mgr*, which is a required part of any
  Ceph deployment.  Although IO can continue when *ceph-mgr* is
  down, metrics will not refresh and some metrics-related calls
  (e.g., `ceph df`) may block.  We recommend deploying several instances of
  *ceph-mgr* for reliability.  See the notes on `Upgrading`_ below.
- The *ceph-mgr* daemon includes a REST-based management API.  The
  API is still experimental and somewhat limited but will form the basis
  for API-based management of Ceph going forward.  

- The `status` ceph-mgr module is enabled by default, and initially 
provides two
  commands: `ceph tell mgr osd status` and `ceph tell mgr fs status`.  These
  are high level colorized views to complement the existing CLI.


  * The overall *scalability* of the cluster has improved. We have
successfully tested clusters with up to 10,000 OSDs.
  * Each OSD can now have a *device class* associated with it (e.g., `hdd` or
`ssd`), allowing CRUSH rules to trivially map data to a 

Re: [ceph-users] Mon's crashing after updating

2017-07-18 Thread Ashley Merrick
Perfect seems to have worked, so look's like it was the same bug.

Thanks,
Ashley

-Original Message-
From: John Spray [mailto:jsp...@redhat.com] 
Sent: Tuesday, 18 July 2017 8:27 PM
To: Ashley Merrick 
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Mon's crashing after updating

On Tue, Jul 18, 2017 at 1:20 PM, Ashley Merrick  wrote:
> Hello,
>
> Thanks for quick response, as it seems to be related to when I tried 
> to enable the dashboard (if I’m correct) is their a way I can try and 
> disable the dashboard via admin socket e.t.c or another work around 
> till version is released with the patch?

You could stop the mgrs, then start the mons, then do a "module disable" on 
dashboard, then start the mgrs again.

John

>
> Thanks,
> Ashley
>
> Sent from my iPhone
>
> On 18 Jul 2017, at 8:14 PM, John Spray  wrote:
>
> On Tue, Jul 18, 2017 at 12:43 PM, Ashley Merrick 
> 
> wrote:
>
> Hello,
>
>
>
>
> I just updated to latest CEPH Lum RC, all was working fine with my 3
>
> Mon’s/Mgr’s online, went to enable the Dashboard with the command : 
> ceph mgr
>
> module enable dashboard
>
>
>
>
> Now only one of the 3 MON’s will run, every time a try and start a 
> failed
>
> mon it will either fail or stay online and the running mon fail.
>
>
> This is presumably a crash in the "log last" command that the 
> dashboard sends to the monitor on startup to load its log history.
>
> It could be the same one that's fixed by 
> https://github.com/ceph/ceph/pull/16376/files
>
> John
>
>
>
>
> The following is showed in the error log:
>
>
>
>
> 2017-07-18 12:07:19.611915 has v0 lc 44232717
>
>
> 0> 2017-07-18 12:07:14.687001 7f7c9f2fc700 -1 *** Caught signal
>
> (Segmentation fault) **
>
>
> in thread 7f7c9f2fc700 thread_name:ms_dispatch
>
>
>
>
> ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) 
> luminous (rc)
>
>
> 1: (()+0x87494c) [0x55ae716da94c]
>
>
> 2: (()+0xf890) [0x7f7cab700890]
>
>
> 3:
>
> (LogMonitor::preprocess_command(boost::intrusive_ptr)+0x
> 9af)
>
> [0x55ae7124e3ef]
>
>
> 4: 
> (LogMonitor::preprocess_query(boost::intrusive_ptr)+0x2d
> e)
>
> [0x55ae7124ed9e]
>
>
> 5: (PaxosService::dispatch(boost::intrusive_ptr)+0x7e8)
>
> [0x55ae71335f58]
>
>
> 6: 
> (Monitor::handle_command(boost::intrusive_ptr)+0x294f)
>
> [0x55ae7120bb7f]
>
>
> 7: (Monitor::dispatch_op(boost::intrusive_ptr)+0x8b9)
>
> [0x55ae71212099]
>
>
> 8: (Monitor::_ms_dispatch(Message*)+0xd99) [0x55ae71213999]
>
>
> 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55ae7123ad43]
>
>
> 10: (DispatchQueue::entry()+0x7ca) [0x55ae71683fda]
>
>
> 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ae714e554d]
>
>
> 12: (()+0x8064) [0x7f7cab6f9064]
>
>
> 13: (clone()+0x6d) [0x7f7ca8c0162d]
>
>
> NOTE: a copy of the executable, or `objdump -rdS ` is 
> needed to
>
> interpret this.
>
>
>
>
> Thanks,
>
> Ashley Merrick
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] best practices for expanding hammer cluster

2017-07-18 Thread Laszlo Budai

Dear all,

we are planning to add new hosts to our existing hammer clusters, and I'm 
looking for best practices recommendations.

currently we have 2 clusters with 72 OSDs and 6 nodes each. We want to add 3 
more nodes (36 OSDs) to each cluster, and we have some questions about what 
would be the best way to do it. Currently the two clusters have different CRUSH 
maps.

Cluster 1
The CRUSH map only has OSDs, hosts and the root bucket. Failure domain is host.
Our final desired state would be:
OSD - hosts - chassis - root where each chassis has 3 hosts, each host has 12 
OSDs, and the failure domain would be chassis.

What would be the recommended way to achieve this without downtime for client 
operations?
I have read about the possibility to throttle down the recovery/backfill using
osd max backfills = 1
osd recovery max active = 1
osd recovery max single start = 1
osd recovery op priority = 1
osd recovery threads = 1
osd backfill scan max = 16
osd backfill scan min = 4

but we wonder about the situation when, in a worst case scenario, all the 
replicas belonging to one pg have to be migrated to new locations according to 
the new CRUSH map. How will ceph behave in such situation?


Cluster 2
the crush map already contains chassis. Currently we have 3 chassis (c1, c2, 
c3) and 6 hosts:
- x1, x2 in chassis c1
- y1, y2 in chassis c2
- x3, y3 in chassis c3

We are adding hosts z1, z2, z3 and our desired CRUSH map would look like this:
- x1, x2, x3 in c1
- y1, y2, y3 in c2
- z1, z2, z3 in c3

Again, what would be the recommended way to achieve this while the clients are 
still accessing the data?

Is it safe to add more OSDs at a time? or we should add them one by one?

Thank you in advance for any suggestions, recommendations.

Kind regards,
Laszlo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Ansgar Jazdzewski
Hi,

i will try to join in and help, as far es i got you you only have
HDD's in your cluster? you use the journal on the HDD? and you have a
replication of 3 set on your pools?

with that in mind you can do some calulations ceph need to:

1. write the data and metadata into the journal
2. copy the data over the backend networke two time to the other OSD
3. write into the journal on the secondary OSD
4. wait for the ACK of all OSD.

with your setup you cann assume the you have ~ 1/4 write-speed of your
HDD and with only one client you can't make use of the scale-out

if you can, you should add SSD's for the journal and the cephfs
metadata you can also considre to build a cache-tier for the
cephfs-data-pool.

i hope it helps a bit,
Ansgar

2017-07-18 14:10 GMT+02:00 Gencer W. Genç :
>>> Are you sure? Your config didn't show this.
>
> Yes. I have dedicated 10GbE network between ceph nodes. Each ceph node has 
> seperate network that have 10GbE network card and speed. Do I have to set 
> anything in the config for 10GbE?
>
>>> What kind of devices are they? did you do the journal test?
> They are not connected via NVMe neither SSD's. Each node has 10x3TB SATA Hard 
> Disk Drives (HDD).
>
>
> -Gencer.
>
>
> -Original Message-
> From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de]
> Sent: Tuesday, July 18, 2017 2:47 PM
> To: gen...@gencgiyen.com
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Yet another performance tuning for CephFS
>
> On 07/17/17 22:49, gen...@gencgiyen.com wrote:
>> I have a seperate 10GbE network for ceph and another for public.
>>
> Are you sure? Your config didn't show this.
>
>> No they are not NVMe, unfortunately.
>>
> What kind of devices are they? did you do the journal test?
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Unlike most tests, with ceph journals, you can't look at the load on the 
> device and decide it's not the bottleneck; you have to test it another way. I 
> had some micron SSDs I tested which performed poorly, and that test showed 
> them performing poorly too. But from other benchmarks, and disk load during 
> journal tests, they looked ok, which was misleading.
>> Do you know any test command that i can try to see if this is the max.
>> Read speed from rsync?
> I don't know how you can improve your rsync test.
>>
>> Because I tried one thing a few minutes ago. I opened 4 ssh channel
>> and run rsync command and copy bigfile to different targets in cephfs
>> at the same time. Then i looked into network graphs and i see numbers
>> up to 1.09 gb/s. But why single copy/rsync cannot exceed 200mb/s? What
>> prevents it im really wonder this.
>>
>> Gencer.
>>
>> On 2017-07-17 23:24, Peter Maloney wrote:
>>> You should have a separate public and cluster network. And journal or
>>> wal/db performance is important... are the devices fast NVMe?
>>>
>>> On 07/17/17 21:31, gen...@gencgiyen.com wrote:
>>>
 Hi,

 I located and applied almost every different tuning setting/config
 over the internet. I couldn’t manage to speed up my speed one byte
 further. It is always same speed whatever I do.

 I was on jewel, now I tried BlueStore on Luminous. Still exact same
 speed I gain from cephfs.

 It doesn’t matter if I disable debug log, or remove [osd] section as
 below and re-add as below (see .conf). Results are exactly the same.
 Not a single byte is gained from those tunings. I also did tuning
 for kernel (sysctl.conf).

 Basics:

 I have 2 nodes with 10 OSD each and each OSD is 3TB SATA drive. Each
 node has 24 cores and 64GB of RAM. Ceph nodes are connected via
 10GbE NIC. No FUSE used. But tried that too. Same results.

 $ dd if=/dev/zero of=/mnt/c/testfile bs=100M count=10 oflag=direct

 10+0 records in

 10+0 records out

 1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.77219 s, 182 MB/s

 182MB/s. This is the best speed i get so far. Usually 170~MB/s. Hm..
 I get much much much higher speeds on different filesystems. Even
 with glusterfs. Is there anything I can do or try?

 Read speed is also around 180-220MB/s but not higher.

 This is What I am using on ceph.conf:

 [global]

 fsid = d7163667-f8c5-466b-88df-8747b26c91df

 mon_initial_members = server1

 mon_host = 192.168.0.1

 auth_cluster_required = cephx

 auth_service_required = cephx

 auth_client_required = cephx

 osd mount options = rw,noexec,nodev,noatime,nodiratime,nobarrier

 osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier


 osd_mkfs_type = xfs

 osd pool default size = 2

 enable experimental unrecoverable data corrupting features =
 bluestore rocksdb

 bluestore fsck on mount = true

 rbd readahead disable after 

Re: [ceph-users] Mon's crashing after updating

2017-07-18 Thread John Spray
On Tue, Jul 18, 2017 at 1:20 PM, Ashley Merrick  wrote:
> Hello,
>
> Thanks for quick response, as it seems to be related to when I tried to
> enable the dashboard (if I’m correct) is their a way I can try and disable
> the dashboard via admin socket e.t.c or another work around till version is
> released with the patch?

You could stop the mgrs, then start the mons, then do a "module
disable" on dashboard, then start the mgrs again.

John

>
> Thanks,
> Ashley
>
> Sent from my iPhone
>
> On 18 Jul 2017, at 8:14 PM, John Spray  wrote:
>
> On Tue, Jul 18, 2017 at 12:43 PM, Ashley Merrick 
> wrote:
>
> Hello,
>
>
>
>
> I just updated to latest CEPH Lum RC, all was working fine with my 3
>
> Mon’s/Mgr’s online, went to enable the Dashboard with the command : ceph mgr
>
> module enable dashboard
>
>
>
>
> Now only one of the 3 MON’s will run, every time a try and start a failed
>
> mon it will either fail or stay online and the running mon fail.
>
>
> This is presumably a crash in the "log last" command that the
> dashboard sends to the monitor on startup to load its log history.
>
> It could be the same one that's fixed by
> https://github.com/ceph/ceph/pull/16376/files
>
> John
>
>
>
>
> The following is showed in the error log:
>
>
>
>
> 2017-07-18 12:07:19.611915 has v0 lc 44232717
>
>
> 0> 2017-07-18 12:07:14.687001 7f7c9f2fc700 -1 *** Caught signal
>
> (Segmentation fault) **
>
>
> in thread 7f7c9f2fc700 thread_name:ms_dispatch
>
>
>
>
> ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
>
>
> 1: (()+0x87494c) [0x55ae716da94c]
>
>
> 2: (()+0xf890) [0x7f7cab700890]
>
>
> 3:
>
> (LogMonitor::preprocess_command(boost::intrusive_ptr)+0x9af)
>
> [0x55ae7124e3ef]
>
>
> 4: (LogMonitor::preprocess_query(boost::intrusive_ptr)+0x2de)
>
> [0x55ae7124ed9e]
>
>
> 5: (PaxosService::dispatch(boost::intrusive_ptr)+0x7e8)
>
> [0x55ae71335f58]
>
>
> 6: (Monitor::handle_command(boost::intrusive_ptr)+0x294f)
>
> [0x55ae7120bb7f]
>
>
> 7: (Monitor::dispatch_op(boost::intrusive_ptr)+0x8b9)
>
> [0x55ae71212099]
>
>
> 8: (Monitor::_ms_dispatch(Message*)+0xd99) [0x55ae71213999]
>
>
> 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55ae7123ad43]
>
>
> 10: (DispatchQueue::entry()+0x7ca) [0x55ae71683fda]
>
>
> 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ae714e554d]
>
>
> 12: (()+0x8064) [0x7f7cab6f9064]
>
>
> 13: (clone()+0x6d) [0x7f7ca8c0162d]
>
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
>
> interpret this.
>
>
>
>
> Thanks,
>
> Ashley Merrick
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon's crashing after updating

2017-07-18 Thread Ashley Merrick
Hello,

Thanks for quick response, as it seems to be related to when I tried to enable 
the dashboard (if I’m correct) is their a way I can try and disable the 
dashboard via admin socket e.t.c or another work around till version is 
released with the patch?

Thanks,
Ashley

Sent from my iPhone

On 18 Jul 2017, at 8:14 PM, John Spray 
> wrote:

On Tue, Jul 18, 2017 at 12:43 PM, Ashley Merrick 
> wrote:
Hello,



I just updated to latest CEPH Lum RC, all was working fine with my 3
Mon’s/Mgr’s online, went to enable the Dashboard with the command : ceph mgr
module enable dashboard



Now only one of the 3 MON’s will run, every time a try and start a failed
mon it will either fail or stay online and the running mon fail.

This is presumably a crash in the "log last" command that the
dashboard sends to the monitor on startup to load its log history.

It could be the same one that's fixed by
https://github.com/ceph/ceph/pull/16376/files

John




The following is showed in the error log:



2017-07-18 12:07:19.611915 has v0 lc 44232717

0> 2017-07-18 12:07:14.687001 7f7c9f2fc700 -1 *** Caught signal
(Segmentation fault) **

in thread 7f7c9f2fc700 thread_name:ms_dispatch



ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)

1: (()+0x87494c) [0x55ae716da94c]

2: (()+0xf890) [0x7f7cab700890]

3:
(LogMonitor::preprocess_command(boost::intrusive_ptr)+0x9af)
[0x55ae7124e3ef]

4: (LogMonitor::preprocess_query(boost::intrusive_ptr)+0x2de)
[0x55ae7124ed9e]

5: (PaxosService::dispatch(boost::intrusive_ptr)+0x7e8)
[0x55ae71335f58]

6: (Monitor::handle_command(boost::intrusive_ptr)+0x294f)
[0x55ae7120bb7f]

7: (Monitor::dispatch_op(boost::intrusive_ptr)+0x8b9)
[0x55ae71212099]

8: (Monitor::_ms_dispatch(Message*)+0xd99) [0x55ae71213999]

9: (Monitor::ms_dispatch(Message*)+0x23) [0x55ae7123ad43]

10: (DispatchQueue::entry()+0x7ca) [0x55ae71683fda]

11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ae714e554d]

12: (()+0x8064) [0x7f7cab6f9064]

13: (clone()+0x6d) [0x7f7ca8c0162d]

NOTE: a copy of the executable, or `objdump -rdS ` is needed to
interpret this.



Thanks,
Ashley Merrick


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon's crashing after updating

2017-07-18 Thread John Spray
On Tue, Jul 18, 2017 at 12:43 PM, Ashley Merrick  wrote:
> Hello,
>
>
>
> I just updated to latest CEPH Lum RC, all was working fine with my 3
> Mon’s/Mgr’s online, went to enable the Dashboard with the command : ceph mgr
> module enable dashboard
>
>
>
> Now only one of the 3 MON’s will run, every time a try and start a failed
> mon it will either fail or stay online and the running mon fail.

This is presumably a crash in the "log last" command that the
dashboard sends to the monitor on startup to load its log history.

It could be the same one that's fixed by
https://github.com/ceph/ceph/pull/16376/files

John

>
>
>
> The following is showed in the error log:
>
>
>
> 2017-07-18 12:07:19.611915 has v0 lc 44232717
>
>  0> 2017-07-18 12:07:14.687001 7f7c9f2fc700 -1 *** Caught signal
> (Segmentation fault) **
>
> in thread 7f7c9f2fc700 thread_name:ms_dispatch
>
>
>
> ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
>
> 1: (()+0x87494c) [0x55ae716da94c]
>
> 2: (()+0xf890) [0x7f7cab700890]
>
> 3:
> (LogMonitor::preprocess_command(boost::intrusive_ptr)+0x9af)
> [0x55ae7124e3ef]
>
> 4: (LogMonitor::preprocess_query(boost::intrusive_ptr)+0x2de)
> [0x55ae7124ed9e]
>
> 5: (PaxosService::dispatch(boost::intrusive_ptr)+0x7e8)
> [0x55ae71335f58]
>
> 6: (Monitor::handle_command(boost::intrusive_ptr)+0x294f)
> [0x55ae7120bb7f]
>
> 7: (Monitor::dispatch_op(boost::intrusive_ptr)+0x8b9)
> [0x55ae71212099]
>
> 8: (Monitor::_ms_dispatch(Message*)+0xd99) [0x55ae71213999]
>
> 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55ae7123ad43]
>
> 10: (DispatchQueue::entry()+0x7ca) [0x55ae71683fda]
>
> 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ae714e554d]
>
> 12: (()+0x8064) [0x7f7cab6f9064]
>
> 13: (clone()+0x6d) [0x7f7ca8c0162d]
>
> NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
>
>
> Thanks,
> Ashley Merrick
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Gencer W . Genç
>> Are you sure? Your config didn't show this.

Yes. I have dedicated 10GbE network between ceph nodes. Each ceph node has 
seperate network that have 10GbE network card and speed. Do I have to set 
anything in the config for 10GbE?

>> What kind of devices are they? did you do the journal test?
They are not connected via NVMe neither SSD's. Each node has 10x3TB SATA Hard 
Disk Drives (HDD).


-Gencer.


-Original Message-
From: Peter Maloney [mailto:peter.malo...@brockmann-consult.de] 
Sent: Tuesday, July 18, 2017 2:47 PM
To: gen...@gencgiyen.com
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Yet another performance tuning for CephFS

On 07/17/17 22:49, gen...@gencgiyen.com wrote:
> I have a seperate 10GbE network for ceph and another for public.
>
Are you sure? Your config didn't show this.

> No they are not NVMe, unfortunately.
>
What kind of devices are they? did you do the journal test?
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Unlike most tests, with ceph journals, you can't look at the load on the device 
and decide it's not the bottleneck; you have to test it another way. I had some 
micron SSDs I tested which performed poorly, and that test showed them 
performing poorly too. But from other benchmarks, and disk load during journal 
tests, they looked ok, which was misleading.
> Do you know any test command that i can try to see if this is the max.
> Read speed from rsync?
I don't know how you can improve your rsync test.
>
> Because I tried one thing a few minutes ago. I opened 4 ssh channel 
> and run rsync command and copy bigfile to different targets in cephfs 
> at the same time. Then i looked into network graphs and i see numbers 
> up to 1.09 gb/s. But why single copy/rsync cannot exceed 200mb/s? What 
> prevents it im really wonder this.
>
> Gencer.
>
> On 2017-07-17 23:24, Peter Maloney wrote:
>> You should have a separate public and cluster network. And journal or 
>> wal/db performance is important... are the devices fast NVMe?
>>
>> On 07/17/17 21:31, gen...@gencgiyen.com wrote:
>>
>>> Hi,
>>>
>>> I located and applied almost every different tuning setting/config 
>>> over the internet. I couldn’t manage to speed up my speed one byte 
>>> further. It is always same speed whatever I do.
>>>
>>> I was on jewel, now I tried BlueStore on Luminous. Still exact same 
>>> speed I gain from cephfs.
>>>
>>> It doesn’t matter if I disable debug log, or remove [osd] section as 
>>> below and re-add as below (see .conf). Results are exactly the same. 
>>> Not a single byte is gained from those tunings. I also did tuning 
>>> for kernel (sysctl.conf).
>>>
>>> Basics:
>>>
>>> I have 2 nodes with 10 OSD each and each OSD is 3TB SATA drive. Each 
>>> node has 24 cores and 64GB of RAM. Ceph nodes are connected via 
>>> 10GbE NIC. No FUSE used. But tried that too. Same results.
>>>
>>> $ dd if=/dev/zero of=/mnt/c/testfile bs=100M count=10 oflag=direct
>>>
>>> 10+0 records in
>>>
>>> 10+0 records out
>>>
>>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.77219 s, 182 MB/s
>>>
>>> 182MB/s. This is the best speed i get so far. Usually 170~MB/s. Hm..
>>> I get much much much higher speeds on different filesystems. Even 
>>> with glusterfs. Is there anything I can do or try?
>>>
>>> Read speed is also around 180-220MB/s but not higher.
>>>
>>> This is What I am using on ceph.conf:
>>>
>>> [global]
>>>
>>> fsid = d7163667-f8c5-466b-88df-8747b26c91df
>>>
>>> mon_initial_members = server1
>>>
>>> mon_host = 192.168.0.1
>>>
>>> auth_cluster_required = cephx
>>>
>>> auth_service_required = cephx
>>>
>>> auth_client_required = cephx
>>>
>>> osd mount options = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>>
>>> osd_mkfs_type = xfs
>>>
>>> osd pool default size = 2
>>>
>>> enable experimental unrecoverable data corrupting features = 
>>> bluestore rocksdb
>>>
>>> bluestore fsck on mount = true
>>>
>>> rbd readahead disable after bytes = 0
>>>
>>> rbd readahead max bytes = 4194304
>>>
>>> log to syslog = false
>>>
>>> debug_lockdep = 0/0
>>>
>>> debug_context = 0/0
>>>
>>> debug_crush = 0/0
>>>
>>> debug_buffer = 0/0
>>>
>>> debug_timer = 0/0
>>>
>>> debug_filer = 0/0
>>>
>>> debug_objecter = 0/0
>>>
>>> debug_rados = 0/0
>>>
>>> debug_rbd = 0/0
>>>
>>> debug_journaler = 0/0
>>>
>>> debug_objectcatcher = 0/0
>>>
>>> debug_client = 0/0
>>>
>>> debug_osd = 0/0
>>>
>>> debug_optracker = 0/0
>>>
>>> debug_objclass = 0/0
>>>
>>> debug_filestore = 0/0
>>>
>>> debug_journal = 0/0
>>>
>>> debug_ms = 0/0
>>>
>>> debug_monc = 0/0
>>>
>>> debug_tp = 0/0
>>>
>>> debug_auth = 0/0
>>>
>>> debug_finisher = 0/0
>>>
>>> debug_heartbeatmap = 0/0
>>>
>>> debug_perfcounter = 0/0
>>>
>>> debug_asok = 0/0
>>>
>>> debug_throttle = 0/0
>>>
>>> debug_mon = 0/0
>>>
>>> debug_paxos = 0/0
>>>
>>> debug_rgw = 0/0
>>>
>>> [osd]
>>>
>>> osd max write size = 512
>>>
>>> osd client 

Re: [ceph-users] hammer -> jewel 10.2.8 upgrade and setting sortbitwise

2017-07-18 Thread Dan van der Ster
Hi Martin,

We had sortbitwise set on other jewel clusters well before 10.2.9 was out.
10.2.8 added the warning if it is not set, but the flag should be safe
in 10.2.6.

-- Dan



On Tue, Jul 18, 2017 at 11:43 AM, Martin Palma  wrote:
> Can the "sortbitwise" also be set if we have a cluster running OSDs on
> 10.2.6 and some OSDs on 10.2.9? Or should we wait that all OSDs are on
> 10.2.9?
>
> Monitor nodes are already on 10.2.9.
>
> Best,
> Martin
>
> On Fri, Jul 14, 2017 at 1:16 PM, Dan van der Ster  wrote:
>> On Mon, Jul 10, 2017 at 5:06 PM, Sage Weil  wrote:
>>> On Mon, 10 Jul 2017, Luis Periquito wrote:
 Hi Dan,

 I've enabled it in a couple of big-ish clusters and had the same
 experience - a few seconds disruption caused by a peering process
 being triggered, like any other crushmap update does. Can't remember
 if it triggered data movement, but I have a feeling it did...
>>>
>>> That's consistent with what one should expect.
>>>
>>> The flag triggers a new peering interval, which means the PGs will peer,
>>> but there is no change in the mapping or data layout or anything else.
>>> The only thing that is potentially scary here is that *every* PG will
>>> repeer at the same time.
>>
>> Thanks Sage & Luis. I confirm that setting sortbitwise on a large
>> cluster is basically a non-event... nothing to worry about.
>>
>> (Btw, we just upgraded our biggest prod clusters to jewel -- that also
>> went totally smooth!)
>>
>> -- Dan
>>
>>> sage
>>>
>>>



 On Mon, Jul 10, 2017 at 3:17 PM, Dan van der Ster  
 wrote:
 > Hi all,
 >
 > With 10.2.8, ceph will now warn if you didn't yet set sortbitwise.
 >
 > I just updated a test cluster, saw that warning, then did the necessary
 >   ceph osd set sortbitwise
 >
 > I noticed a short re-peering which took around 10s on this small
 > cluster with very little data.
 >
 > Has anyone done this already on a large cluster with lots of objects?
 > It would be nice to hear that it isn't disruptive before running it on
 > our big production instances.
 >
 > Cheers, Dan
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Peter Maloney
On 07/17/17 22:49, gen...@gencgiyen.com wrote:
> I have a seperate 10GbE network for ceph and another for public.
>
Are you sure? Your config didn't show this.

> No they are not NVMe, unfortunately.
>
What kind of devices are they? did you do the journal test?
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Unlike most tests, with ceph journals, you can't look at the load on the
device and decide it's not the bottleneck; you have to test it another
way. I had some micron SSDs I tested which performed poorly, and that
test showed them performing poorly too. But from other benchmarks, and
disk load during journal tests, they looked ok, which was misleading.
> Do you know any test command that i can try to see if this is the max.
> Read speed from rsync?
I don't know how you can improve your rsync test.
>
> Because I tried one thing a few minutes ago. I opened 4 ssh channel
> and run rsync command and copy bigfile to different targets in cephfs
> at the same time. Then i looked into network graphs and i see numbers
> up to 1.09 gb/s. But why single copy/rsync cannot exceed 200mb/s? What
> prevents it im really wonder this.
>
> Gencer.
>
> On 2017-07-17 23:24, Peter Maloney wrote:
>> You should have a separate public and cluster network. And journal or
>> wal/db performance is important... are the devices fast NVMe?
>>
>> On 07/17/17 21:31, gen...@gencgiyen.com wrote:
>>
>>> Hi,
>>>
>>> I located and applied almost every different tuning setting/config
>>> over the internet. I couldn’t manage to speed up my speed one byte
>>> further. It is always same speed whatever I do.
>>>
>>> I was on jewel, now I tried BlueStore on Luminous. Still exact same
>>> speed I gain from cephfs.
>>>
>>> It doesn’t matter if I disable debug log, or remove [osd] section
>>> as below and re-add as below (see .conf). Results are exactly the
>>> same. Not a single byte is gained from those tunings. I also did
>>> tuning for kernel (sysctl.conf).
>>>
>>> Basics:
>>>
>>> I have 2 nodes with 10 OSD each and each OSD is 3TB SATA drive. Each
>>> node has 24 cores and 64GB of RAM. Ceph nodes are connected via
>>> 10GbE NIC. No FUSE used. But tried that too. Same results.
>>>
>>> $ dd if=/dev/zero of=/mnt/c/testfile bs=100M count=10 oflag=direct
>>>
>>> 10+0 records in
>>>
>>> 10+0 records out
>>>
>>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.77219 s, 182 MB/s
>>>
>>> 182MB/s. This is the best speed i get so far. Usually 170~MB/s. Hm..
>>> I get much much much higher speeds on different filesystems. Even
>>> with glusterfs. Is there anything I can do or try?
>>>
>>> Read speed is also around 180-220MB/s but not higher.
>>>
>>> This is What I am using on ceph.conf:
>>>
>>> [global]
>>>
>>> fsid = d7163667-f8c5-466b-88df-8747b26c91df
>>>
>>> mon_initial_members = server1
>>>
>>> mon_host = 192.168.0.1
>>>
>>> auth_cluster_required = cephx
>>>
>>> auth_service_required = cephx
>>>
>>> auth_client_required = cephx
>>>
>>> osd mount options = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>>
>>> osd_mkfs_type = xfs
>>>
>>> osd pool default size = 2
>>>
>>> enable experimental unrecoverable data corrupting features =
>>> bluestore rocksdb
>>>
>>> bluestore fsck on mount = true
>>>
>>> rbd readahead disable after bytes = 0
>>>
>>> rbd readahead max bytes = 4194304
>>>
>>> log to syslog = false
>>>
>>> debug_lockdep = 0/0
>>>
>>> debug_context = 0/0
>>>
>>> debug_crush = 0/0
>>>
>>> debug_buffer = 0/0
>>>
>>> debug_timer = 0/0
>>>
>>> debug_filer = 0/0
>>>
>>> debug_objecter = 0/0
>>>
>>> debug_rados = 0/0
>>>
>>> debug_rbd = 0/0
>>>
>>> debug_journaler = 0/0
>>>
>>> debug_objectcatcher = 0/0
>>>
>>> debug_client = 0/0
>>>
>>> debug_osd = 0/0
>>>
>>> debug_optracker = 0/0
>>>
>>> debug_objclass = 0/0
>>>
>>> debug_filestore = 0/0
>>>
>>> debug_journal = 0/0
>>>
>>> debug_ms = 0/0
>>>
>>> debug_monc = 0/0
>>>
>>> debug_tp = 0/0
>>>
>>> debug_auth = 0/0
>>>
>>> debug_finisher = 0/0
>>>
>>> debug_heartbeatmap = 0/0
>>>
>>> debug_perfcounter = 0/0
>>>
>>> debug_asok = 0/0
>>>
>>> debug_throttle = 0/0
>>>
>>> debug_mon = 0/0
>>>
>>> debug_paxos = 0/0
>>>
>>> debug_rgw = 0/0
>>>
>>> [osd]
>>>
>>> osd max write size = 512
>>>
>>> osd client message size cap = 2147483648
>>>
>>> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>>
>>> filestore xattr use omap = true
>>>
>>> osd_op_threads = 8
>>>
>>> osd disk threads = 4
>>>
>>> osd map cache size = 1024
>>>
>>> filestore_queue_max_ops = 25000
>>>
>>> filestore_queue_max_bytes = 10485760
>>>
>>> filestore_queue_committing_max_ops = 5000
>>>
>>> filestore_queue_committing_max_bytes = 1048576
>>>
>>> journal_max_write_entries = 1000
>>>
>>> journal_queue_max_ops = 3000
>>>
>>> journal_max_write_bytes = 1048576000
>>>
>>> journal_queue_max_bytes = 1048576000
>>>
>>> filestore_max_sync_interval = 15
>>>
>>> filestore_merge_threshold = 20

[ceph-users] Mon's crashing after updating

2017-07-18 Thread Ashley Merrick
Hello,

I just updated to latest CEPH Lum RC, all was working fine with my 3 
Mon's/Mgr's online, went to enable the Dashboard with the command : ceph mgr 
module enable dashboard

Now only one of the 3 MON's will run, every time a try and start a failed mon 
it will either fail or stay online and the running mon fail.

The following is showed in the error log:

2017-07-18 12:07:19.611915 has v0 lc 44232717
 0> 2017-07-18 12:07:14.687001 7f7c9f2fc700 -1 *** Caught signal 
(Segmentation fault) **
in thread 7f7c9f2fc700 thread_name:ms_dispatch

ceph version 12.1.1 (f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
1: (()+0x87494c) [0x55ae716da94c]
2: (()+0xf890) [0x7f7cab700890]
3: (LogMonitor::preprocess_command(boost::intrusive_ptr)+0x9af) 
[0x55ae7124e3ef]
4: (LogMonitor::preprocess_query(boost::intrusive_ptr)+0x2de) 
[0x55ae7124ed9e]
5: (PaxosService::dispatch(boost::intrusive_ptr)+0x7e8) 
[0x55ae71335f58]
6: (Monitor::handle_command(boost::intrusive_ptr)+0x294f) 
[0x55ae7120bb7f]
7: (Monitor::dispatch_op(boost::intrusive_ptr)+0x8b9) 
[0x55ae71212099]
8: (Monitor::_ms_dispatch(Message*)+0xd99) [0x55ae71213999]
9: (Monitor::ms_dispatch(Message*)+0x23) [0x55ae7123ad43]
10: (DispatchQueue::entry()+0x7ca) [0x55ae71683fda]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ae714e554d]
12: (()+0x8064) [0x7f7cab6f9064]
13: (clone()+0x6d) [0x7f7ca8c0162d]
NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

Thanks,
Ashley Merrick
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Gencer W . Genç
Patrick,

I did timing tests. Rsync is not a tools that should I trust for speed test. I 
simply do "cp" and extra write tests to ceph cluster. It is very very fast 
indeed. Rsync itself copies an 1GB file slowly and it takes 5-7 seconds to 
complete. Cp itself does it in 0,901s. (Not even 1 second).

So false alarm here. Ceph is fast enough. I also do stress tests (such as 
multiple background write at the same time) and they are very stable too.

Thanks for the heads up to you and all others.

Gencer.

-Original Message-
From: Patrick Donnelly [mailto:pdonn...@redhat.com] 
Sent: Monday, July 17, 2017 11:21 PM
To: gen...@gencgiyen.com
Cc: Ceph Users 
Subject: Re: [ceph-users] Yet another performance tuning for CephFS

On Mon, Jul 17, 2017 at 1:08 PM,   wrote:
> But lets try another. Lets say i have a file in my server which is 
> 5GB. If i do this:
>
> $ rsync ./bigfile /mnt/cephfs/targetfile --progress
>
> Then i see max. 200 mb/s. I think it is still slow :/ Is this an expected?

Perhaps that is the bandwidth limit of your local device rsync is reading from?

--
Patrick Donnelly

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Updating 12.1.0 -> 12.1.1

2017-07-18 Thread Marc Roos
 
I just updated packages on one CentOS7 node and getting these errors:

Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537510 7f4fa1c14e40 -1 
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537510 7f4fa1c14e40 -1 
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537725 7f4fa1c14e40 -1 
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.537725 7f4fa1c14e40 -1 
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.567250 7f4fa1c14e40 -1 
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.567250 7f4fa1c14e40 -1 
WARNING: the following dangerous and experimental features are enabled: 
bluestore
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.589008 7f4fa1c14e40 -1 
mon.a@-1(probing).mgrstat failed to decode mgrstat state; luminous dev 
version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.589008 7f4fa1c14e40 -1 
mon.a@-1(probing).mgrstat failed to decode mgrstat state; luminous dev 
version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.724836 7f4f977d9700 -1 
mon.a@0(synchronizing).mgrstat failed to decode mgrstat state; luminous 
dev version?
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.724836 7f4f977d9700 -1 
mon.a@0(synchronizing).mgrstat failed to decode mgrstat state; luminous 
dev version?
Jul 18 12:03:34 c01 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function 
'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700 
time 2017-07-18 12:03:34.870230
Jul 18 12:03:34 c01 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: 100: FAILED 
assert(msg)
Jul 18 12:03:34 c01 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function 
'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700 
time 2017-07-18 12:03:34.870230
Jul 18 12:03:34 c01 ceph-mon: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: 100: FAILED 
assert(msg)
Jul 18 12:03:34 c01 ceph-mon: ceph version 12.1.1 
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
Jul 18 12:03:34 c01 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*, 
char const*, int, char const*)+0x110) [0x7f4fa21f4310]
Jul 18 12:03:34 c01 ceph-mon: 2: 
(Monitor::handle_forward(boost::intrusive_ptr)+0xd70) 
[0x7f4fa1fddcd0]
Jul 18 12:03:34 c01 ceph-mon: 3: 
(Monitor::dispatch_op(boost::intrusive_ptr)+0xd8d) 
[0x7f4fa1fdb29d]
Jul 18 12:03:34 c01 ceph-mon: 4: (Monitor::_ms_dispatch(Message*)+0x7de) 
[0x7f4fa1fdc06e]
Jul 18 12:03:34 c01 ceph-mon: 5: (Monitor::ms_dispatch(Message*)+0x23) 
[0x7f4fa2004303]
Jul 18 12:03:34 c01 ceph-mon: 6: (DispatchQueue::entry()+0x792) 
[0x7f4fa242c812]
Jul 18 12:03:34 c01 ceph-mon: 7: 
(DispatchQueue::DispatchThread::entry()+0xd) [0x7f4fa229a3cd]
Jul 18 12:03:34 c01 ceph-mon: 8: (()+0x7dc5) [0x7f4fa0fbedc5]
Jul 18 12:03:34 c01 ceph-mon: 9: (clone()+0x6d) [0x7f4f9e34a76d]
Jul 18 12:03:34 c01 ceph-mon: NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
Jul 18 12:03:34 c01 ceph-mon: 2017-07-18 12:03:34.872654 7f4f977d9700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARC
H/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/1
2.1.1/rpm/el7/BUILD/ceph-12.1.1/src/messages/MForward.h: In function 
'PaxosServiceMessage* MForward::claim_message()' thread 7f4f977d9700 
time 2017-07-18 12:03:34.870230
Jul 18 12:03:34 c01 ceph-mon: ceph version 12.1.1 
(f3e663a190bf2ed12c7e3cda288b9a159572c800) luminous (rc)
Jul 18 12:03:34 c01 ceph-mon: 1: (ceph::__ceph_assert_fail(char const*, 
char const*, int, char const*)+0x110) [0x7f4fa21f4310]
Jul 18 12:03:34 c01 ceph-mon: 2: 
(Monitor::handle_forward(boost::intrusive_ptr)+0xd70) 
[0x7f4fa1fddcd0]
Jul 18 12:03:34 c01 ceph-mon: 3: 
(Monitor::dispatch_op(boost::intrusive_ptr)+0xd8d) 
[0x7f4fa1fdb29d]
Jul 18 12:03:34 c01 ceph-mon: 4: (Monitor::_ms_dispatch(Message*)+0x7de) 
[0x7f4fa1fdc06e]
Jul 18 12:03:34 c01 ceph-mon: 5: (Monitor::ms_dispatch(Message*)+0x23) 
[0x7f4fa2004303]
Jul 18 12:03:34 c01 ceph-mon: 6: (DispatchQueue::entry()+0x792) 

Re: [ceph-users] How's cephfs going?

2017-07-18 Thread David McBride
On Mon, 2017-07-17 at 02:59 +, 许雪寒 wrote:
> Hi, everyone.
> 
> We intend to use cephfs of Jewel version, however, we don’t know its status.
> Is it production ready in Jewel? Does it still have lots of bugs? Is it a
> major effort of the current ceph development? And who are using cephfs now?

Hello,

I've been using CephFS in production in a small-scale, conservative deployment
for the past year — to provide a small backing store, spanning two datacenters,
for a highly-available web service that supplies host configuration data and
APT package repositories for a fleet of ~1000 Linux workstations.

The motivation for using CephFS was not performance, but availability and
correctness.  It replaced a fairly complicated stack involving corosync,
pacemaker, XFS, DRBD, and mdadm.

Previous production systems using DRBD had proven unreliable in practice;
network glitches between datacenters would cause DRBD to enter a split-brain
state — which, for this application, was tolerable.

Bad, however, was when DRBD failed to keep the two halves of the mirrored
filesystem in sync — at one point, they had diverged to include over a gigabyte
of differences, which caused XFS to snap to read-only mode when it detected
internal inconsistencies.

CephFS, by contrast, has been remarkably solid, even in the face of networking
interruptions.  As well as remaining available, data integrity has been
problem-free.

(In addition to Ceph's own scrubbing capabilities, I've had automated
monitoring checks verifying the file-sizes and checksums of files in each APT
repository against those recorded in their indexes, with zero discrepancies in
a year of operation.)

Write performance wasn't terrific, but my deployment has been on constrained
hardware: running all of the Ceph daemons, plus the local kernel mount, plus
the nginx web server, all on the same hosts — with quadruple replication and
only a single gigabit link per machine — is not recommended for write
throughput.

The only issue I came across was a bug in the ceph.ko driver, which would
occasionally trip a null pointer exception in the kernel.  It was possible to
avoid this bug by mounting with the noasyncreaddir flag; the bug has long since
been fixed.

Kind regards,
David
-- 
David McBride 
Unix Specialist, University Information Services

signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hammer -> jewel 10.2.8 upgrade and setting sortbitwise

2017-07-18 Thread Martin Palma
Can the "sortbitwise" also be set if we have a cluster running OSDs on
10.2.6 and some OSDs on 10.2.9? Or should we wait that all OSDs are on
10.2.9?

Monitor nodes are already on 10.2.9.

Best,
Martin

On Fri, Jul 14, 2017 at 1:16 PM, Dan van der Ster  wrote:
> On Mon, Jul 10, 2017 at 5:06 PM, Sage Weil  wrote:
>> On Mon, 10 Jul 2017, Luis Periquito wrote:
>>> Hi Dan,
>>>
>>> I've enabled it in a couple of big-ish clusters and had the same
>>> experience - a few seconds disruption caused by a peering process
>>> being triggered, like any other crushmap update does. Can't remember
>>> if it triggered data movement, but I have a feeling it did...
>>
>> That's consistent with what one should expect.
>>
>> The flag triggers a new peering interval, which means the PGs will peer,
>> but there is no change in the mapping or data layout or anything else.
>> The only thing that is potentially scary here is that *every* PG will
>> repeer at the same time.
>
> Thanks Sage & Luis. I confirm that setting sortbitwise on a large
> cluster is basically a non-event... nothing to worry about.
>
> (Btw, we just upgraded our biggest prod clusters to jewel -- that also
> went totally smooth!)
>
> -- Dan
>
>> sage
>>
>>
>>>
>>>
>>>
>>> On Mon, Jul 10, 2017 at 3:17 PM, Dan van der Ster  
>>> wrote:
>>> > Hi all,
>>> >
>>> > With 10.2.8, ceph will now warn if you didn't yet set sortbitwise.
>>> >
>>> > I just updated a test cluster, saw that warning, then did the necessary
>>> >   ceph osd set sortbitwise
>>> >
>>> > I noticed a short re-peering which took around 10s on this small
>>> > cluster with very little data.
>>> >
>>> > Has anyone done this already on a large cluster with lots of objects?
>>> > It would be nice to hear that it isn't disruptive before running it on
>>> > our big production instances.
>>> >
>>> > Cheers, Dan
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Installing ceph on Centos 7.3

2017-07-18 Thread Marc Roos
 
We are running on 

Linux c01 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.3.1611 (Core)

And didn’t have any issues installing/upgrading, but we are not using 
ceph-deploy. In fact am surprised on how easy it is to install.




-Original Message-
From: Götz Reinicke - IT Koordinator 
[mailto:goetz.reini...@filmakademie.de] 
Sent: dinsdag 18 juli 2017 11:25
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Installing ceph on Centos 7.3

Hi,


Am 18.07.17 um 10:51 schrieb Brian Wallis:


I’m failing to get an install of ceph to work on a new Centos 
7.3.1611 server. I’m following the instructions at 
http://docs.ceph.com/docs/master/start/quick-ceph-deploy/ to no avail.  

First question, is it possible to install ceph on Centos 7.3 or 
should I choose a different version or different linux distribution to 
use for now?

<...>

we run CEPH Jewel 10.2.7 on RHEL 7.3. It is working.

Maybe an other guide might help you trough the installation steps?

https://www.virtualtothecore.com/en/quickly-build-a-new-ceph-cluster-with-ceph-deploy-on-centos-7/


Regards . Götz


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Installing ceph on Centos 7.3

2017-07-18 Thread Götz Reinicke - IT Koordinator
Hi,

Am 18.07.17 um 10:51 schrieb Brian Wallis:
> I’m failing to get an install of ceph to work on a new Centos 7.3.1611
> server. I’m following the instructions
> at http://docs.ceph.com/docs/master/start/quick-ceph-deploy/ to no
> avail. 
>
> First question, is it possible to install ceph on Centos 7.3 or should
> I choose a different version or different linux distribution to use
> for now?
<...>

we run CEPH /Jewel /10.2.7 on RHEL 7.3. It is working.

Maybe an other guide might help you trough the installation steps?

https://www.virtualtothecore.com/en/quickly-build-a-new-ceph-cluster-with-ceph-deploy-on-centos-7/


Regards . Götz


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another performance tuning for CephFS

2017-07-18 Thread Gencer W . Genç
I have 3 pools.

 

0 rbd,1 cephfs_data,2 cephfs_metadata

 

cephfs_data has 1024 as a pg_num, total pg number is 2113

 

POOL_NAME   USED   OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND 
DEGRADED RD_OPS RDWR_OPS WR

cephfs_data  4000M1000  0   2000  0   0
0  2 0  27443 44472M

cephfs_metadata 11505k  24  0 48  0   0
0 38 8456k   7384 14719k

rbd  0   0  0  0  0   0
0  0 0  0  0

 

total_objects1024

total_used   30575M

total_avail  55857G

total_space  55887G

 

 

 

 

 

From: David Turner [mailto:drakonst...@gmail.com] 
Sent: Tuesday, July 18, 2017 2:31 AM
To: Gencer Genç ; Patrick Donnelly 
Cc: Ceph Users 
Subject: Re: [ceph-users] Yet another performance tuning for CephFS

 

What are your pool settings? That can affect your read/write speeds as much as 
anything in the ceph.conf file.

 

On Mon, Jul 17, 2017, 4:55 PM Gencer Genç  > wrote:

I don't think so.

Because I tried one thing a few minutes ago. I opened 4 ssh channel and
run rsync command and copy bigfile to different targets in cephfs at the
same time. Then i looked into network graphs and i see numbers up to
1.09 gb/s. But why single copy/rsync cannot exceed 200mb/s? What
prevents it im really wonder this.

Gencer.


-Original Message-
From: Patrick Donnelly [mailto:pdonn...@redhat.com  
]
Sent: 17 Temmuz 2017 Pazartesi 23:21
To: gen...@gencgiyen.com  
Cc: Ceph Users  >
Subject: Re: [ceph-users] Yet another performance tuning for CephFS

On Mon, Jul 17, 2017 at 1:08 PM,   > wrote:
> But lets try another. Lets say i have a file in my server which is 5GB. If i
> do this:
>
> $ rsync ./bigfile /mnt/cephfs/targetfile --progress
>
> Then i see max. 200 mb/s. I think it is still slow :/ Is this an expected?

Perhaps that is the bandwidth limit of your local device rsync is reading from?

--
Patrick Donnelly

___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Installing ceph on Centos 7.3

2017-07-18 Thread Brian Wallis
I’m failing to get an install of ceph to work on a new Centos 7.3.1611 server. 
I’m following the instructions at 
http://docs.ceph.com/docs/master/start/quick-ceph-deploy/ to no avail.

First question, is it possible to install ceph on Centos 7.3 or should I choose 
a different version or different linux distribution to use for now?

When I run ceph-deploy on Centos 7.3 I get the following error.

[cephuser@ceph1 my-cluster]$ ceph-deploy install ceph2
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/home/cephuser/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.25): /bin/ceph-deploy install ceph2
[ceph_deploy.install][DEBUG ] Installing stable version hammer on cluster ceph 
hosts ceph2
[ceph_deploy.install][DEBUG ] Detecting platform for host ceph2 ...
[ceph2][DEBUG ] connection detected need for sudo
[ceph2][DEBUG ] connected to host: ceph2
[ceph2][DEBUG ] detect platform information from remote host
[ceph2][DEBUG ] detect machine type
[ceph_deploy.install][INFO  ] Distro info: CentOS Linux 7.3.1611 Core
[ceph2][INFO  ] installing ceph on ceph2
[ceph2][INFO  ] Running command: sudo yum clean all
[ceph2][DEBUG ] Loaded plugins: fastestmirror, priorities
[ceph2][DEBUG ] Cleaning repos: base epel extras grafana influxdb updates
[ceph2][DEBUG ] Cleaning up everything
[ceph2][INFO  ] adding EPEL repository
[ceph2][INFO  ] Running command: sudo yum -y install epel-release
[ceph2][DEBUG ] Loaded plugins: fastestmirror, priorities
[ceph2][DEBUG ] Determining fastest mirrors
[ceph2][DEBUG ]  * base: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ]  * extras: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ]  * updates: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ] Package matching epel-release-7-9.noarch already installed. 
Checking for update.
[ceph2][DEBUG ] Nothing to do
[ceph2][INFO  ] Running command: sudo yum -y install yum-priorities
[ceph2][DEBUG ] Loaded plugins: fastestmirror, priorities
[ceph2][DEBUG ] Loading mirror speeds from cached hostfile
[ceph2][DEBUG ]  * base: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ]  * extras: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ]  * updates: 
centos.mirror.digitalpacific.com.au
[ceph2][DEBUG ] Package yum-plugin-priorities-1.1.31-40.el7.noarch already 
installed and latest version
[ceph2][DEBUG ] Nothing to do
[ceph2][DEBUG ] Configure Yum priorities to include obsoletes
[ceph2][WARNIN] check_obsoletes has been enabled for Yum priorities plugin
[ceph2][INFO  ] Running command: sudo rpm --import 
https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc
[ceph2][INFO  ] Running command: sudo rpm -Uvh --replacepkgs 
http://ceph.com/rpm-hammer/el7/noarch/ceph-release-1-0.el7.noarch.rpm
[ceph2][DEBUG ] Retrieving 
http://ceph.com/rpm-hammer/el7/noarch/ceph-release-1-0.el7.noarch.rpm
[ceph2][WARNIN] error: open of  failed: No such file or directory
[ceph2][WARNIN] error: open of Index failed: No such file or 
directory
[ceph2][WARNIN] error: open of of failed: No such file or directory
[ceph2][WARNIN] error: open of /rpm-hammer/ failed: No such file 
or directory
[ceph2][WARNIN] error: open of  failed: No such file or directory
[ceph2][WARNIN] error: open of Index failed: No such file or directory
[ceph2][WARNIN] error: open of of failed: No such file or directory
[ceph2][WARNIN] error: open of /rpm-hammer/../ failed: No such file or 
directory
[ceph2][WARNIN] error: open of el6/ failed: No such file or 
directory
[ceph2][WARNIN] error: open of 24-Apr-2016 failed: No such file or directory
[ceph2][WARNIN] error: open of 00:05 failed: No such file or directory
[ceph2][WARNIN] error: -: not an rpm package (or package manifest):
[ceph2][WARNIN] error: open of el7/ failed: No such file or 
directory
[ceph2][WARNIN] error: open of 29-Aug-2016 failed: No such file or directory
[ceph2][WARNIN] error: open of 11:53 failed: No such file or directory
[ceph2][WARNIN] error: -: not an rpm package (or package manifest):
[ceph2][WARNIN] error: open of fc20/ failed: No such file or 
directory
[ceph2][WARNIN] error: open of 07-Apr-2015 failed: No such file or directory
[ceph2][WARNIN] error: open of 19:21 failed: No such file or directory
[ceph2][WARNIN] error: -: not an rpm package (or package manifest):
[ceph2][WARNIN] error: open of rhel6/ failed: No such file or 
directory
[ceph2][WARNIN] error: open of 07-Apr-2015 failed: No such file or directory
[ceph2][WARNIN] error: open of 19:22 failed: No such file or directory
[ceph2][WARNIN] error: -: not an rpm package (or package manifest):
[ceph2][WARNIN] error: open of  failed: No such file or 
directory
[ceph2][WARNIN] error: open of  failed: No such file or directory
[ceph2][ERROR ] RuntimeError: command returned 

Re: [ceph-users] installing specific version of ceph-common

2017-07-18 Thread Buyens Niels
I've been looking into this again and have been able to install it now (10.2.9 
is newest now instead of 10.2.8 when I first asked the question):

Looking at the dependency resolving, we can see it's going to install 
libradosstriper1 version 10.2.9 and because of that also librados 10.2.9
...
---> Package libradosstriper1.x86_64 1:10.2.9-0.el7 will be installed
--> Processing Dependency: librados2 = 1:10.2.9-0.el7 for package: 
1:libradosstriper1-10.2.9-0.el7.x86_64
...

Same for librgw2:
...
---> Package librgw2.x86_64 1:10.2.9-0.el7 will be installed
--> Processing Dependency: libfcgi.so.0()(64bit) for package: 
1:librgw2-10.2.9-0.el7.x86_64
...

So to install ceph-common with a specific version, you need to do:
yum install ceph-common-10.2.7 libradosstriper1-10.2.7 librgw2-10.2.7
This way it won't try to install v10.2.9 of librados2.

I still feel it's weird that it's trying to install newer versions as 
dependency for a 10.2.7 package (looking at the dependencies being processed, 
there's no version provided for the librbd, libbabeltrace, libbabeltrace-ctf, 
libradosstriper and librgw so it will install the newest version it can find 
and because of that upgrade librados2 to the newest version as well to provide 
for those dependencies.

Complete resolve:
Resolving Dependencies
--> Running transaction check
---> Package ceph-common.x86_64 1:10.2.7-0.el7 will be installed
--> Processing Dependency: python-rados = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: librbd1 = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: python-rbd = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: python-cephfs = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: libcephfs1 = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: librbd.so.1()(64bit) for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: libbabeltrace.so.1()(64bit) for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: libbabeltrace-ctf.so.1()(64bit) for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: libradosstriper.so.1()(64bit) for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: librgw.so.2()(64bit) for package: 
1:ceph-common-10.2.7-0.el7.x86_64
---> Package librados2.x86_64 1:10.2.7-0.el7 will be installed
--> Running transaction check
---> Package libbabeltrace.x86_64 0:1.2.4-3.el7 will be installed
---> Package libcephfs1.x86_64 1:10.2.7-0.el7 will be installed
---> Package libradosstriper1.x86_64 1:10.2.9-0.el7 will be installed
--> Processing Dependency: librados2 = 1:10.2.9-0.el7 for package: 
1:libradosstriper1-10.2.9-0.el7.x86_64
---> Package librbd1.x86_64 1:10.2.7-0.el7 will be installed
--> Processing Dependency: librados2 = 1:10.2.7-0.el7 for package: 
1:librbd1-10.2.7-0.el7.x86_64
---> Package librgw2.x86_64 1:10.2.9-0.el7 will be installed
--> Processing Dependency: libfcgi.so.0()(64bit) for package: 
1:librgw2-10.2.9-0.el7.x86_64
---> Package python-cephfs.x86_64 1:10.2.7-0.el7 will be installed
---> Package python-rados.x86_64 1:10.2.7-0.el7 will be installed
--> Processing Dependency: librados2 = 1:10.2.7-0.el7 for package: 
1:python-rados-10.2.7-0.el7.x86_64
---> Package python-rbd.x86_64 1:10.2.7-0.el7 will be installed
--> Running transaction check
---> Package fcgi.x86_64 0:2.4.0-25.el7 will be installed
---> Package librados2.x86_64 1:10.2.7-0.el7 will be installed
--> Processing Dependency: librados2 = 1:10.2.7-0.el7 for package: 
1:python-rados-10.2.7-0.el7.x86_64
--> Processing Dependency: librados2 = 1:10.2.7-0.el7 for package: 
1:librbd1-10.2.7-0.el7.x86_64
--> Processing Dependency: librados2 = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64
---> Package librados2.x86_64 1:10.2.9-0.el7 will be installed
---> Package librbd1.x86_64 1:10.2.7-0.el7 will be installed
--> Processing Dependency: librados2 = 1:10.2.7-0.el7 for package: 
1:librbd1-10.2.7-0.el7.x86_64
---> Package python-rados.x86_64 1:10.2.7-0.el7 will be installed
--> Processing Dependency: librados2 = 1:10.2.7-0.el7 for package: 
1:python-rados-10.2.7-0.el7.x86_64
--> Finished Dependency Resolution

Fixed install:
yum install ceph-common-10.2.7 libradosstriper1-10.2.7 librgw2-10.2.7
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: ftp.belnet.be
 * epel: ftp.nluug.nl
 * extras: ftp.belnet.be
 * updates: ftp.belnet.be
Resolving Dependencies
--> Running transaction check
---> Package ceph-common.x86_64 1:10.2.7-0.el7 will be installed
--> Processing Dependency: python-rados = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: librbd1 = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64
--> Processing Dependency: python-rbd = 1:10.2.7-0.el7 for package: 
1:ceph-common-10.2.7-0.el7.x86_64

Re: [ceph-users] how to list and reset the scrub schedules

2017-07-18 Thread Dan van der Ster
On Fri, Jul 14, 2017 at 10:40 PM, Gregory Farnum  wrote:
> On Fri, Jul 14, 2017 at 5:41 AM Dan van der Ster  wrote:
>>
>> Hi,
>>
>> Occasionally we want to change the scrub schedule for a pool or whole
>> cluster, but we want to do this by injecting new settings without
>> restarting every daemon.
>>
>> I've noticed that in jewel, changes to scrub_min/max_interval and
>> deep_scrub_interval do not take immediate effect, presumably because
>> the scrub schedules are calculated in advance for all the PGs on an
>> OSD.
>>
>> Does anyone know how to list that scrub schedule for a given OSD?
>
>
> I'm not aware of any "scrub schedule" as such, just the constraints around
> when new scrubbing happens. What exactly were you doing previously that
> isn't working now?

Take this for example:

2017-07-18 10:03:30.600486 7f02f7a54700 20 osd.1 123582
scrub_random_backoff lost coin flip, randomly backing off
2017-07-18 10:03:31.600558 7f02f7a54700 20 osd.1 123582
can_inc_scrubs_pending0 -> 1 (max 1, active 0)
2017-07-18 10:03:31.600565 7f02f7a54700 20 osd.1 123582
scrub_time_permit should run between 0 - 24 now 10 = yes
2017-07-18 10:03:31.600592 7f02f7a54700 20 osd.1 123582
scrub_load_below_threshold loadavg 0.85 < max 5 = yes
2017-07-18 10:03:31.600603 7f02f7a54700 20 osd.1 123582 sched_scrub
load_is_low=1
2017-07-18 10:03:31.600605 7f02f7a54700 30 osd.1 123582 sched_scrub
examine 38.127 at 2017-07-18 10:08:01.148612
2017-07-18 10:03:31.600608 7f02f7a54700 10 osd.1 123582 sched_scrub
38.127 scheduled at 2017-07-18 10:08:01.148612 > 2017-07-18
10:03:31.600562
2017-07-18 10:03:31.600611 7f02f7a54700 20 osd.1 123582 sched_scrub done

PG 38.127 is the next registered scrub on osd.1. AFAICT, "registered"
means that there exists a ScrubJob for this PG, with a sched_time
(time of the last scrub + a random interval) and a deadline (time of
the last scrub + scrub max interval)

(Question: how many scrubs are registered at a given time on an OSD?
Just this one that is printed in the tick loop, or several?)

Anyway, I decrease the min and max scrub intervals for that pool,
hoping to make it scrub right away:

# ceph osd pool set testing-images scrub_min_interval 60 set pool 38
scrub_min_interval to 60
set pool 38 scrub_min_interval to 60
# ceph osd pool set testing-images scrub_max_interval 86400
set pool 38 scrub_max_interval to 86400


But the registered ScrubJob doesn't change -- what I called the "scrub
schedule" doesn't change:

2017-07-18 10:06:53.622286 7f02f7a54700 20 osd.1 123584
scrub_random_backoff lost coin flip, randomly backing off
2017-07-18 10:06:54.622403 7f02f7a54700 20 osd.1 123584
can_inc_scrubs_pending0 -> 1 (max 1, active 0)
2017-07-18 10:06:54.622409 7f02f7a54700 20 osd.1 123584
scrub_time_permit should run between 0 - 24 now 10 = yes
2017-07-18 10:06:54.622436 7f02f7a54700 20 osd.1 123584
scrub_load_below_threshold loadavg 1.16 < max 5 = yes
2017-07-18 10:06:54.622446 7f02f7a54700 20 osd.1 123584 sched_scrub
load_is_low=1
2017-07-18 10:06:54.622449 7f02f7a54700 30 osd.1 123584 sched_scrub
examine 38.127 at 2017-07-18 10:08:01.148612
2017-07-18 10:06:54.622452 7f02f7a54700 10 osd.1 123584 sched_scrub
38.127 scheduled at 2017-07-18 10:08:01.148612 > 2017-07-18
10:06:54.622408
2017-07-18 10:06:54.622455 7f02f7a54700 20 osd.1 123584 sched_scrub done


I'm looking for a way to reset those registered scrubs, so that the
new intervals can take effect (without restarting OSDs).


Cheers, Dan

>
>>
>>
>> And better yet, does anyone know a way to reset that schedule, so that
>> the OSD generates a new one with the new configuration?
>>
>> (I've noticed that by chance setting sortbitwise triggers many scrubs
>> -- maybe a new peering interval resets the scrub schedules?) Any
>> non-destructive way to trigger a new peering interval on demand?
>>
>> Cheers,
>>
>> Dan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS attempt to access beyond end of device

2017-07-18 Thread Dan van der Ster
On Tue, Jul 18, 2017 at 6:08 AM, Marcus Furlong  wrote:
> On 22 March 2017 at 05:51, Dan van der Ster  wrote:
>> On Wed, Mar 22, 2017 at 8:24 AM, Marcus Furlong 
>> wrote:
>>> Hi,
>>>
>>> I'm experiencing the same issue as outlined in this post:
>>>
>>>
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013330.html
>>>
>>> I have also deployed this jewel cluster using ceph-deploy.
>>>
>>> This is the message I see at boot (happens for all drives, on all OSD
>>> nodes):
>>>
>>> [ 92.938882] XFS (sdi1): Mounting V5 Filesystem
>>> [ 93.065393] XFS (sdi1): Ending clean mount
>>> [ 93.175299] attempt to access beyond end of device
>>> [ 93.175304] sdi1: rw=0, want=19134412768, limit=19134412767
>>>
>>> and again while the cluster is in operation:
>>>
>>> [429280.254400] attempt to access beyond end of device
>>> [429280.254412] sdi1: rw=0, want=19134412768, limit=19134412767
>>>
>>
>> We see these as well, and I'm also curious what's causing it. Perhaps
>> sgdisk is doing something wrong when creating the ceph-data partition?
>
> Apologies for reviving an old thread, but I figured out what happened and
> never documented it, so I thought an update might be useful.
>
> The disk layout I've ascertained is as follows:
>
> sector 0 = protective MBR (or empty)
> sectors 1 to 33 = GPT (33 sectors)
> sectors 34 to 2047 = free (as confirmed by sgdisk -f -E)
> sectors 2048 to 19134414814 (19134412767 sectors: Data Partition 1)
> sectors 19134414815 to 19134414847 (33 sectors: GPT backup data)
>
> And the error:
>
> [ 92.938882] XFS (sdi1): Mounting V5 Filesystem
> [ 93.065393] XFS (sdi1): Ending clean mount
> [ 93.175299] attempt to access beyond end of device
> [ 93.175304] sdi1: rw=0, want=19134412768, limit=19134412767
>
> This shows that the error occurs when trying to access sector 1913441278 of
> Partition 1, which we can see from the above, doesn't exist.
>
> I noticed that the file system size is 3.5KiB less than the size of the
> partition, and the XFS block size is 4KiB.
>
> EMDS = 19134412767 * 512 = 9796819336704 <- actual partition size
> CDS = 9567206383 * 1024 = 9796819336192 (512 bytes less than EMDS) <- oddly
> /proc/partitions reports 512 bytes less, because it's using 1024 bytes as
> the unit
> FSS = 2391801595 * 4096 = 9796819333120 (3072 bytes less than CDS) <-
> filesystem
>
> It turns out, if I create a partition that matches the block size of the XFS
> filesystem, then the error does not occur. i.e. no error when the filesystem
> starts _and_ ends on a partition boundary.
>
> When this happens, e.g. as follows, then there is no issue. This partition
> is 7 sectors smaller than the one referenced above.
>
> # sgdisk --new=0:2048:19134414807 -- /dev/sdi
> Creating new GPT entries.
> The operation has completed successfully.
>
> # sgdisk -p /dev/sdi
> Disk /dev/sdf: 19134414848 sectors, 8.9 TiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): 3E61A8BA-838A-4D7E-BB8E-293972EB45AE
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 19134414814
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 2021 sectors (1010.5 KiB)
>
> When the end of the partition is not aligned to the 4KiB blocks used by XFS,
> the error occurs. This explains why the defaults from parted work correctly,
> as the 1MiB "padding" is 4K-aligned.
>
> This non-alignment happens because ceph-deploy uses sgdisk, and sgdisk seems
> to align the start of the partition with 2048-sector boundaries, but _not_
> the end of the partition, when used with the -L parameter.
>
> The fix was to recreate the partition table, and reduce the unused sectors
> down to the max filesystem size:
>
> https://gist.github.com/furlongm/292aefa930f40dc03f21693d1fc19f35
>
> In my testing, I could only reproduce this with XFS, not with other
> filesystems. It can be reproduced on smaller XFS filesystems but seems to
> take more time.

Great work. I've tested (in print mode) and seems to detect things
correctly here:

/dev/sdz1
OSD ID : 88
Partition size in sectors : 11721043087
Sector size   : 512
Partition size in bytes   : 6001174060544
XFS block size: 4096
# of XFS blocks   : 1465130385
XFS filsystem size: 6001174056960
Unused sectors: 7
Unused bytes (unused sector count * sector size) : 3584
Unused bytes (partition size - filesystem size)  : 3584
Filesystem is not correctly aligned to partition boundary :-(
systemctl stop ceph-osd@88
umount /dev/sdz1
sgdisk --delete=1 -- /dev/sdz
sgdisk --new=1:2048:11721045127 --change-name=1:"ceph data"
--partition-guid=1:c0832f78-5d7c-49f7-a133-786424b8b491
--typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sdz
partprobe /dev/sdz
xfs_repair /dev/sdz1
sgdisk --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sdz


But one thing is still unclear to me. sgdisk is not aligning the end
of the partition -- 

[ceph-users] Ceph MDS Q Size troubleshooting

2017-07-18 Thread James Wilkins
Hello list,

I'm looking for some more information relating to CephFS and the 'Q'
size, specifically how to diagnose what contributes towards it rising
up

Ceph Version: 11.2.0.0
OS: CentOS 7
Kernel (Ceph Servers): 3.10.0-514.10.2.el7.x86_64
Kernel (CephFS Clients): 4.4.76-1.el7.elrepo.x86_64 - using kernel
mount
Storage: 8 OSD Servers, 2TB NVME (P3700) in front of 6 x 6TB Disks
(bcache)
2 pools for CephFS

pool 1 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 1984 flags
hashpspool crash_replay_interval 45 stripe_width 0
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 256 pgp_num 256 last_change 40695 flags
hashpspool stripe_width 0


Average client IO is between 1000-2000 op/s and 150-200MB/s

We track the q size attribute coming out of ceph daemon
/var/run/ceph/ perf dump mds q into prometheus on a regular
basis and this figure is always northbound of 5K

When we run into performance issues/sporadic failovers of the MDS
servers this figure is the warning sign and normally peaks at >50K
prior to an issue occuring

I've attached a sample graph showing the last 12 hours of the q figure
as an example




Does anyone have any suggestions as to where we look at what is causing
this Q size?






signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-18 Thread Brady Deetz
We have a cephfs data pool with 52.8M files stored in 140.7M objects. That
translates to a metadata pool size of 34.6MB across 1.5M objects.

On Jul 18, 2017 12:54 AM, "Blair Bethwaite" 
wrote:

> We are a data-intensive university, with an increasingly large fleet
> of scientific instruments capturing various types of data (mostly
> imaging of one kind or another). That data typically needs to be
> stored, protected, managed, shared, connected/moved to specialised
> compute for analysis. Given the large variety of use-cases we are
> being somewhat more circumspect it our CephFS adoption and really only
> dipping toes in the water, ultimately hoping it will become a
> long-term default NAS choice from Luminous onwards.
>
> On 18 July 2017 at 15:21, Brady Deetz  wrote:
> > All of that said, you could also consider using rbd and zfs or whatever
> filesystem you like. That would allow you to gain the benefits of scaleout
> while still getting a feature rich fs. But, there are some down sides to
> that architecture too.
>
> We do this today (KVMs with a couple of large RBDs attached via
> librbd+QEMU/KVM), but the throughput able to be achieved this way is
> nothing like native CephFS - adding more RBDs doesn't seem to help
> increase overall throughput. Also, if you have NFS clients you will
> absolutely need SSD ZIL. And of course you then have a single point of
> failure and downtime for regular updates etc.
>
> In terms of small file performance I'm interested to hear about
> experiences with in-line file storage on the MDS.
>
> Also, while we're talking about CephFS - what size metadata pools are
> people seeing on their production systems with 10s-100s millions of
> files?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com