Re: [ceph-users] size of inc_osdmap vs osdmap

2019-01-02 Thread xie.xingguo
> Xie, does that sound right?


yeah, looks right to me.








原始邮件



发件人:SageWeil 
收件人:Sergey Dolgov ;
抄送人:Gregory Farnum ;ceph-users@lists.ceph.com 
;ceph-de...@vger.kernel.org 
;谢型果10072465;
日 期 :2019年01月03日 11:05
主 题 :Re: [ceph-users] size of inc_osdmap vs osdmap


I think that code was broken by 
ea723fbb88c69bd00fefd32a3ee94bf5ce53569c and should be fixed like so:

diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 8376a40668..12f468636f 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -1006,7 +1006,8 @@ void OSDMonitor::prime_pg_temp(
   int next_up_primary, next_acting_primary;
   next.pg_to_up_acting_osds(pgid, _up, _up_primary,
_acting, _acting_primary);
-  if (acting == next_acting && next_up != next_acting)
+  if (acting == next_acting &&
+  !(up != acting && next_up == next_acting))
 return;  // no change since last epoch
 
   if (acting.empty())


The original intent was to clear out pg_temps during priming, but as 
written it would set a new_pg_temp item clearing the pg_temp even if one 
didn't already exist.  Adding the up != acting condition in there makes us 
only take that path if there is an existing pg_temp entry to remove.

Xie, does that sound right?

sage


On Thu, 3 Jan 2019, Sergey Dolgov wrote:

> >
> > Well those commits made some changes, but I'm not sure what about them
> > you're saying is wrong?
> >
> I mean,  that all pgs have "up == acting && next_up == next_acting" but at
> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
> condition
> "next_up != next_acting" false and we clear acting for all pgs at
> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1018 after
> that all pg fall into inc_osdmap
> I think https://github.com/ceph/ceph/pull/25724 change behavior to
> correct(as was before commit
> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c)
> for pg with up == acting && next_up == next_acting
> 
> On Thu, Jan 3, 2019 at 2:13 AM Gregory Farnum  wrote:
> 
> >
> >
> > On Thu, Dec 27, 2018 at 1:20 PM Sergey Dolgov  wrote:
> >
> >> We investigated the issue and set debug_mon up to 20 during little change
> >> of osdmap get many messages for all pgs of each pool (for all cluster):
> >>
> >>> 2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming
> >>> []
> >>> 2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
> >>> 2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
> >>
> >> though no pg_temps are created as result(no single backfill)
> >>
> >> We suppose this behavior changed in commit
> >> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c
> >> because earlier function *OSDMonitor::prime_pg_temp* should return in
> >> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
> >> like in jewel
> >> https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214
> >>
> >> i accept that we may be mistaken
> >>
> >
> > Well those commits made some changes, but I'm not sure what about them
> > you're saying is wrong?
> >
> > What would probably be most helpful is if you can dump out one of those
> > over-large incremental osdmaps and see what's using up all the space. (You
> > may be able to do it through the normal Ceph CLI by querying the monitor?
> > Otherwise if it's something very weird you may need to get the
> > ceph-dencoder tool and look at it with that.)
> > -Greg
> >
> >
> >>
> >>
> >> On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum 
> >> wrote:
> >>
> >>> Hmm that does seem odd. How are you looking at those sizes?
> >>>
> >>> On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov  wrote:
> >>>
>  Greq, for example for our cluster ~1000 osd:
> 
>  size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
>  modified 2018-12-12 04:00:17.661731)
>  size osdmap.1357882__0_F7FE772D__none = 363KB
>  size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
>  modified 2018-12-12 04:00:27.385702)
>  size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB
> 
>  difference between epoch 1357881 and 1357883: crush weight one osd was
>  increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
>  

[ceph-users] Mimic 13.2.3?

2019-01-02 Thread Ashley Merrick
Have just run an apt update and have noticed there are some CEPH packages
now available for update on my mimic cluster / ubuntu.

Have yet to install these yet but it look's like we have the next point
release of CEPH Mimic, but not able to see any release note's or official
comm's yet?..
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Ceph-users] Multisite-Master zone still in recover mode

2019-01-02 Thread Amit Ghadge
Hi,

We following http://docs.ceph.com/docs/master/radosgw/multisite/ steps to
migrate single-site to master zone and then setup secondary zone.
We not delete existing data and all objects sync to secondary zone but in
master zone it still showing in recovery mode, dynamic resharding is
disable.

Master zone
# radosgw-admin sync status
  realm 2c642eee-46e0-488e-8566-6a58878c1a95 (movie)
  zonegroup b569583b-ae34-4798-bb7c-a79de191b7dd (us)
   zone 2929a077-6d81-48ee-bf64-3503dcdf2d46 (us-west)
  metadata sync no sync (zone is master)
  data sync source: 5bcbf11e-5626-4773-967d-6d22decb44c0 (us-east)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
128 shards are recovering
recovering shards:
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]


Secondary zone
#  radosgw-admin sync status
  realm 2c642eee-46e0-488e-8566-6a58878c1a95 (movie)
  zonegroup b569583b-ae34-4798-bb7c-a79de191b7dd (us)
   zone 5bcbf11e-5626-4773-967d-6d22decb44c0 (us-east)
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: 2929a077-6d81-48ee-bf64-3503dcdf2d46 (us-west)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with
source


After we pushed objects to master zone, objects sync to secondary zone and
started showing in recovery mode.

So, My question is,  It is normal behavior?
We running ceph version 12.2.9.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any way to see enabled/disabled status of bucket sync?

2019-01-02 Thread Konstantin Shalygin

I had no clue there was a bucket sync status command.
This command is present with radosgw cli `radosgw-admin bucket sync 
status` (but hidden in `--help`, also like `radosgw-admin reshard 
status` command).



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] size of inc_osdmap vs osdmap

2019-01-02 Thread Sage Weil
I think that code was broken by 
ea723fbb88c69bd00fefd32a3ee94bf5ce53569c and should be fixed like so:

diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 8376a40668..12f468636f 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -1006,7 +1006,8 @@ void OSDMonitor::prime_pg_temp(
   int next_up_primary, next_acting_primary;
   next.pg_to_up_acting_osds(pgid, _up, _up_primary,
_acting, _acting_primary);
-  if (acting == next_acting && next_up != next_acting)
+  if (acting == next_acting &&
+  !(up != acting && next_up == next_acting))
 return;  // no change since last epoch
 
   if (acting.empty())


The original intent was to clear out pg_temps during priming, but as 
written it would set a new_pg_temp item clearing the pg_temp even if one 
didn't already exist.  Adding the up != acting condition in there makes us 
only take that path if there is an existing pg_temp entry to remove.

Xie, does that sound right?

sage


On Thu, 3 Jan 2019, Sergey Dolgov wrote:

> >
> > Well those commits made some changes, but I'm not sure what about them
> > you're saying is wrong?
> >
> I mean,  that all pgs have "up == acting && next_up == next_acting" but at
> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
> condition
> "next_up != next_acting" false and we clear acting for all pgs at
> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1018 after
> that all pg fall into inc_osdmap
> I think https://github.com/ceph/ceph/pull/25724 change behavior to
> correct(as was before commit
> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c)
> for pg with up == acting && next_up == next_acting
> 
> On Thu, Jan 3, 2019 at 2:13 AM Gregory Farnum  wrote:
> 
> >
> >
> > On Thu, Dec 27, 2018 at 1:20 PM Sergey Dolgov  wrote:
> >
> >> We investigated the issue and set debug_mon up to 20 during little change
> >> of osdmap get many messages for all pgs of each pool (for all cluster):
> >>
> >>> 2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_tempnext_up === next_acting now, clear pg_temp
> >>> 2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming
> >>> []
> >>> 2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
> >>> 2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789
> >>> prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
> >>
> >> though no pg_temps are created as result(no single backfill)
> >>
> >> We suppose this behavior changed in commit
> >> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c
> >> because earlier function *OSDMonitor::prime_pg_temp* should return in
> >> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
> >> like in jewel
> >> https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214
> >>
> >> i accept that we may be mistaken
> >>
> >
> > Well those commits made some changes, but I'm not sure what about them
> > you're saying is wrong?
> >
> > What would probably be most helpful is if you can dump out one of those
> > over-large incremental osdmaps and see what's using up all the space. (You
> > may be able to do it through the normal Ceph CLI by querying the monitor?
> > Otherwise if it's something very weird you may need to get the
> > ceph-dencoder tool and look at it with that.)
> > -Greg
> >
> >
> >>
> >>
> >> On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum 
> >> wrote:
> >>
> >>> Hmm that does seem odd. How are you looking at those sizes?
> >>>
> >>> On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov  wrote:
> >>>
>  Greq, for example for our cluster ~1000 osd:
> 
>  size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
>  modified 2018-12-12 04:00:17.661731)
>  size osdmap.1357882__0_F7FE772D__none = 363KB
>  size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
>  modified 2018-12-12 04:00:27.385702)
>  size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB
> 
>  difference between epoch 1357881 and 1357883: crush weight one osd was
>  increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
>  inc_osdmap so huge
> 
>  чт, 6 дек. 2018 г. в 06:20, Gregory Farnum :
>  >
>  > On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov 
>  wrote:
>  >>
>  >> Hi guys
>  >>
>  >> I faced strange behavior of crushmap change. When I change crush
>  >> 

Re: [ceph-users] size of inc_osdmap vs osdmap

2019-01-02 Thread Sergey Dolgov
>
> Well those commits made some changes, but I'm not sure what about them
> you're saying is wrong?
>
I mean,  that all pgs have "up == acting && next_up == next_acting" but at
https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
condition
"next_up != next_acting" false and we clear acting for all pgs at
https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1018 after
that all pg fall into inc_osdmap
I think https://github.com/ceph/ceph/pull/25724 change behavior to
correct(as was before commit
https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c)
for pg with up == acting && next_up == next_acting

On Thu, Jan 3, 2019 at 2:13 AM Gregory Farnum  wrote:

>
>
> On Thu, Dec 27, 2018 at 1:20 PM Sergey Dolgov  wrote:
>
>> We investigated the issue and set debug_mon up to 20 during little change
>> of osdmap get many messages for all pgs of each pool (for all cluster):
>>
>>> 2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>>> 2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>>> 2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>>> 2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming
>>> []
>>> 2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
>>> 2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
>>
>> though no pg_temps are created as result(no single backfill)
>>
>> We suppose this behavior changed in commit
>> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c
>> because earlier function *OSDMonitor::prime_pg_temp* should return in
>> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
>> like in jewel
>> https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214
>>
>> i accept that we may be mistaken
>>
>
> Well those commits made some changes, but I'm not sure what about them
> you're saying is wrong?
>
> What would probably be most helpful is if you can dump out one of those
> over-large incremental osdmaps and see what's using up all the space. (You
> may be able to do it through the normal Ceph CLI by querying the monitor?
> Otherwise if it's something very weird you may need to get the
> ceph-dencoder tool and look at it with that.)
> -Greg
>
>
>>
>>
>> On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum 
>> wrote:
>>
>>> Hmm that does seem odd. How are you looking at those sizes?
>>>
>>> On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov  wrote:
>>>
 Greq, for example for our cluster ~1000 osd:

 size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
 modified 2018-12-12 04:00:17.661731)
 size osdmap.1357882__0_F7FE772D__none = 363KB
 size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
 modified 2018-12-12 04:00:27.385702)
 size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB

 difference between epoch 1357881 and 1357883: crush weight one osd was
 increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
 inc_osdmap so huge

 чт, 6 дек. 2018 г. в 06:20, Gregory Farnum :
 >
 > On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov 
 wrote:
 >>
 >> Hi guys
 >>
 >> I faced strange behavior of crushmap change. When I change crush
 >> weight osd I sometimes get  increment osdmap(1.2MB) which size is
 >> significantly bigger than size of osdmap(0.4MB)
 >
 >
 > This is probably because when CRUSH changes, the new primary OSDs for
 a PG will tend to set a "pg temp" value (in the OSDMap) that temporarily
 reassigns it to the old acting set, so the data can be accessed while the
 new OSDs get backfilled. Depending on the size of your cluster, the number
 of PGs on it, and the size of the CRUSH change, this can easily be larger
 than the rest of the map because it is data with size linear in the number
 of PGs affected, instead of being more normally proportional to the number
 of OSDs.
 > -Greg
 >
 >>
 >> I use luminois 12.2.8. Cluster was installed a long ago, I suppose
 >> that initially it was firefly
 >> How can I view content of increment osdmap or can you give me opinion
 >> on this problem. I think that spikes of traffic tight after change of
 >> crushmap relates to this crushmap behavior
 >> ___
 >> ceph-users mailing list
 >> ceph-users@lists.ceph.com
 >> 

Re: [ceph-users] size of inc_osdmap vs osdmap

2019-01-02 Thread Sergey Dolgov
Thanks Grag

I dumped inc_osdmap to file
ceph-dencoder type OSDMap::Incremental import
./inc\\uosdmap.1378266__0_B7F36FFA__none decode dump_json  > inc_osdmap.txt
There are 52330 pgs(cluster has 52332 pgs) in structure 'new_pg_temp' and
for all of them osd is empty. For examle short excerpt:

>

 {

"osds": [],

"pgid": "3.0"

  },

  {

"osds": [],

"pgid": "3.1"

  },

  {

"osds": [],

"pgid": "3.2"

  },

  {

"osds": [],

"pgid": "3.3"

  },


On Thu, Jan 3, 2019 at 2:13 AM Gregory Farnum  wrote:

>
>
> On Thu, Dec 27, 2018 at 1:20 PM Sergey Dolgov  wrote:
>
>> We investigated the issue and set debug_mon up to 20 during little change
>> of osdmap get many messages for all pgs of each pool (for all cluster):
>>
>>> 2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>>> 2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>>> 2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>>> 2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming
>>> []
>>> 2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
>>> 2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789
>>> prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
>>
>> though no pg_temps are created as result(no single backfill)
>>
>> We suppose this behavior changed in commit
>> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c
>> because earlier function *OSDMonitor::prime_pg_temp* should return in
>> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
>> like in jewel
>> https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214
>>
>> i accept that we may be mistaken
>>
>
> Well those commits made some changes, but I'm not sure what about them
> you're saying is wrong?
>
> What would probably be most helpful is if you can dump out one of those
> over-large incremental osdmaps and see what's using up all the space. (You
> may be able to do it through the normal Ceph CLI by querying the monitor?
> Otherwise if it's something very weird you may need to get the
> ceph-dencoder tool and look at it with that.)
> -Greg
>
>
>>
>>
>> On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum 
>> wrote:
>>
>>> Hmm that does seem odd. How are you looking at those sizes?
>>>
>>> On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov  wrote:
>>>
 Greq, for example for our cluster ~1000 osd:

 size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
 modified 2018-12-12 04:00:17.661731)
 size osdmap.1357882__0_F7FE772D__none = 363KB
 size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
 modified 2018-12-12 04:00:27.385702)
 size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB

 difference between epoch 1357881 and 1357883: crush weight one osd was
 increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
 inc_osdmap so huge

 чт, 6 дек. 2018 г. в 06:20, Gregory Farnum :
 >
 > On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov 
 wrote:
 >>
 >> Hi guys
 >>
 >> I faced strange behavior of crushmap change. When I change crush
 >> weight osd I sometimes get  increment osdmap(1.2MB) which size is
 >> significantly bigger than size of osdmap(0.4MB)
 >
 >
 > This is probably because when CRUSH changes, the new primary OSDs for
 a PG will tend to set a "pg temp" value (in the OSDMap) that temporarily
 reassigns it to the old acting set, so the data can be accessed while the
 new OSDs get backfilled. Depending on the size of your cluster, the number
 of PGs on it, and the size of the CRUSH change, this can easily be larger
 than the rest of the map because it is data with size linear in the number
 of PGs affected, instead of being more normally proportional to the number
 of OSDs.
 > -Greg
 >
 >>
 >> I use luminois 12.2.8. Cluster was installed a long ago, I suppose
 >> that initially it was firefly
 >> How can I view content of increment osdmap or can you give me opinion
 >> on this problem. I think that spikes of traffic tight after change of
 >> crushmap relates to this crushmap behavior
 >> ___
 >> ceph-users mailing list
 >> ceph-users@lists.ceph.com
 >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 Best regards, Sergey Dolgov

>>>
>>
>> --
>> Best regards, Sergey Dolgov
>>
>

-- 
Best regards, Sergey Dolgov

Re: [ceph-users] size of inc_osdmap vs osdmap

2019-01-02 Thread Gregory Farnum
On Thu, Dec 27, 2018 at 1:20 PM Sergey Dolgov  wrote:

> We investigated the issue and set debug_mon up to 20 during little change
> of osdmap get many messages for all pgs of each pool (for all cluster):
>
>> 2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789
>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>> 2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789
>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>> 2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789
>> prime_pg_tempnext_up === next_acting now, clear pg_temp
>> 2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789
>> prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming
>> []
>> 2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789
>> prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
>> 2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789
>> prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []
>
> though no pg_temps are created as result(no single backfill)
>
> We suppose this behavior changed in commit
> https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c
> because earlier function *OSDMonitor::prime_pg_temp* should return in
> https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009
> like in jewel
> https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214
>
> i accept that we may be mistaken
>

Well those commits made some changes, but I'm not sure what about them
you're saying is wrong?

What would probably be most helpful is if you can dump out one of those
over-large incremental osdmaps and see what's using up all the space. (You
may be able to do it through the normal Ceph CLI by querying the monitor?
Otherwise if it's something very weird you may need to get the
ceph-dencoder tool and look at it with that.)
-Greg


>
>
> On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum 
> wrote:
>
>> Hmm that does seem odd. How are you looking at those sizes?
>>
>> On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov  wrote:
>>
>>> Greq, for example for our cluster ~1000 osd:
>>>
>>> size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
>>> modified 2018-12-12 04:00:17.661731)
>>> size osdmap.1357882__0_F7FE772D__none = 363KB
>>> size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
>>> modified 2018-12-12 04:00:27.385702)
>>> size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB
>>>
>>> difference between epoch 1357881 and 1357883: crush weight one osd was
>>> increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
>>> inc_osdmap so huge
>>>
>>> чт, 6 дек. 2018 г. в 06:20, Gregory Farnum :
>>> >
>>> > On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov 
>>> wrote:
>>> >>
>>> >> Hi guys
>>> >>
>>> >> I faced strange behavior of crushmap change. When I change crush
>>> >> weight osd I sometimes get  increment osdmap(1.2MB) which size is
>>> >> significantly bigger than size of osdmap(0.4MB)
>>> >
>>> >
>>> > This is probably because when CRUSH changes, the new primary OSDs for
>>> a PG will tend to set a "pg temp" value (in the OSDMap) that temporarily
>>> reassigns it to the old acting set, so the data can be accessed while the
>>> new OSDs get backfilled. Depending on the size of your cluster, the number
>>> of PGs on it, and the size of the CRUSH change, this can easily be larger
>>> than the rest of the map because it is data with size linear in the number
>>> of PGs affected, instead of being more normally proportional to the number
>>> of OSDs.
>>> > -Greg
>>> >
>>> >>
>>> >> I use luminois 12.2.8. Cluster was installed a long ago, I suppose
>>> >> that initially it was firefly
>>> >> How can I view content of increment osdmap or can you give me opinion
>>> >> on this problem. I think that spikes of traffic tight after change of
>>> >> crushmap relates to this crushmap behavior
>>> >> ___
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> --
>>> Best regards, Sergey Dolgov
>>>
>>
>
> --
> Best regards, Sergey Dolgov
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any way to see enabled/disabled status of bucket sync?

2019-01-02 Thread Christian Rice
I had no clue there was a bucket sync status command.  And I expect that 
metadata get command will be useful going forward, as well.

Thanks for those!

From: ceph-users  on behalf of Casey Bodley 

Date: Wednesday, January 2, 2019 at 1:04 PM
To: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] any way to see enabled/disabled status of bucket sync?

Hi Christian,

The easiest way to do that is probably the 'radosgw-admin bucket sync
status' command, which will print "Sync is disabled for bucket ..." if
disabled. Otherwise, you could use 'radosgw-admin metadata get' to
inspect that flag in the bucket instance metadata.


On 12/31/18 2:20 PM, Christian Rice wrote:

Is there a command that will show me the current status of bucket sync
(enabled vs disabled)?

Referring to
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ceph_ceph_blob_b5f33ae3722118ec07112a4fe1bb0bdedb803a60_src_rgw_rgw-5Fadmin.cc-23L1626=DwICAg=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw=NE1NbWtVhgG-K7YvLdoLZigfFx8zGPwOGk6HWpYK04I=6zLJjbJhbmQz4WtGgOnzu8GHVsUIdl3J6Fd2g745Ofg=fA3Zk6ZZeRhXvN8XKi9cAAOz1L-dL0Y55SZoCVFRAlc=


___
ceph-users mailing list
ceph-users@lists.ceph.com
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=DwICAg=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw=NE1NbWtVhgG-K7YvLdoLZigfFx8zGPwOGk6HWpYK04I=6zLJjbJhbmQz4WtGgOnzu8GHVsUIdl3J6Fd2g745Ofg=iwiRFsysuIwOAdBHc5RWn_Q99ksRb07cXeKLODoMDK0=
___
ceph-users mailing list
ceph-users@lists.ceph.com
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=DwICAg=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw=NE1NbWtVhgG-K7YvLdoLZigfFx8zGPwOGk6HWpYK04I=6zLJjbJhbmQz4WtGgOnzu8GHVsUIdl3J6Fd2g745Ofg=iwiRFsysuIwOAdBHc5RWn_Q99ksRb07cXeKLODoMDK0=

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] TCP qdisc + congestion control / BBR

2019-01-02 Thread Kevin Olbrich
Hi!

I wonder if changing qdisc and congestion_control (for example fq with
Google BBR) on Ceph servers / clients has positive effects during high
load.
Google BBR: 
https://cloud.google.com/blog/products/gcp/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster

I am running a lot of VMs with BBR but the hypervisors run fq_codel +
cubic (OSDs run Ubuntu defaults).

Did someone test qdisc and congestion control settings?

Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any way to see enabled/disabled status of bucket sync?

2019-01-02 Thread Casey Bodley

Hi Christian,

The easiest way to do that is probably the 'radosgw-admin bucket sync 
status' command, which will print "Sync is disabled for bucket ..." if 
disabled. Otherwise, you could use 'radosgw-admin metadata get' to 
inspect that flag in the bucket instance metadata.



On 12/31/18 2:20 PM, Christian Rice wrote:


Is there a command that will show me the current status of bucket sync 
(enabled vs disabled)?


Referring to 
https://github.com/ceph/ceph/blob/b5f33ae3722118ec07112a4fe1bb0bdedb803a60/src/rgw/rgw_admin.cc#L1626



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-admin unable to store user information

2019-01-02 Thread Casey Bodley


On 12/26/18 4:58 PM, Dilip Renkila wrote:

Hi all,

Some useful information

>>/>> />>///What do the following return?/
>>/>> >> />>/>> >> $ radosgw-admin zone get/
/root@ctrl1:~# radosgw-admin zone get { "id": 
"8bfdf8a3-c165-44e9-9ed6-deff8a5d852f", "name": "default", 
"domain_root": "default.rgw.meta:root", "control_pool": 
"default.rgw.control", "gc_pool": "default.rgw.log:gc", "lc_pool": 
"default.rgw.log:lc", "log_pool": "default.rgw.log", 
"intent_log_pool": "default.rgw.log:intent", "usage_log_pool": 
"default.rgw.log:usage", "reshard_pool": "default.rgw.log:reshard", 
"user_keys_pool": "default.rgw.meta:users.keys", "user_email_pool": 
"default.rgw.meta:users.email", "user_swift_pool": 
"default.rgw.meta:users.swift", "user_uid_pool": 
"default.rgw.meta:users.uid", "otp_pool": "default.rgw.otp", 
"system_key": { "access_key": "", "secret_key": "" }, 
"placement_pools": [ { "key": "default-placement", "val": { 
"index_pool": "default.rgw.buckets.index", "data_pool": 
"default.rgw.buckets.data", "data_extra_pool": 
"default.rgw.buckets.non-ec", "index_type": 0, "compression": "" } } 
], "metadata_heap": "", "realm_id": "" }/
>>/>> >> /radosgw-admin user info 
--uid="0611e8fdb62b4b2892b62c7e7bf3767f$0611e8fdb62b4b2892b62c7e7bf3767f" 
--debug-ms=1 --debug-rgw=20 --debug-objecter=20 --log-to-stderr//

https://etherpad.openstack.org/p/loPctEQWFU
//
>>/>> >> />>/>> >> $ rados lspools/
/root@ctrl1:~# rados lspools cinder-volumes-sas ephemeral-volumes 
.rgw.root rgw1 defaults.rgw.buckets.data default.rgw.control 
default.rgw.meta defaults.rgw.buckets.index default.rgw.log 
cinder-volumes-nvme default.rgw.buckets.index images 
default.rgw.buckets.data /

//
/
/
Best Regards / Kind Regards

Dilip Renkila


Den ons 26 dec. 2018 kl 22:29 skrev Dilip Renkila 
mailto:dilip.renk...@linserv.se>>:


Hi all,

I have a ceph radosgw deployment as openstack swift backend with
multitenancy enabled in rgw.

I can create containers and store data through swift api.

I am trying to retrieve user data from radosgw-admin cli tool for
an user. I am able to get only admin user info but no one else.
$  radosgw-admin user info
--uid="0611e8fdb62b4b2892b62c7e7bf3767f$0611e8fdb62b4b2892b62c7e7bf3767f"
could not fetch user info: no user info saved

$  radosgw-admin user list
[
"0611e8fdb62b4b2892b62c7e7bf3767f$0611e8fdb62b4b2892b62c7e7bf3767f",
"32a7cd9b37bb40168200bae69015311a$32a7cd9b37bb40168200bae69015311a",
"2eea218eea984dd68f1378ea21c64b83$2eea218eea984dd68f1378ea21c64b83",
    "admin",
"032f07e376404586b53bb8c3bfd6d1d7$032f07e376404586b53bb8c3bfd6d1d7",
"afcf7fc3fd5844ea920c2028ebfa5832$afcf7fc3fd5844ea920c2028ebfa5832",
"5793054cd0fe4a018e959eb9081442a8$5793054cd0fe4a018e959eb9081442a8",
"d4f6c1bd190d40feb8379625bcf2bc39$d4f6c1bd190d40feb8379625bcf2bc39",
"8f411343b44143d2b116563c177ed93d$8f411343b44143d2b116563c177ed93d",
"0a49f61d66644fb2a10d664d5b79b1af$0a49f61d66644fb2a10d664d5b79b1af",
"a1dd449c9ce64345af2a7fb05c4aa21f$a1dd449c9ce64345af2a7fb05c4aa21f",
"a5442064c50a4b9bbf854d15748f99d4$a5442064c50a4b9bbf854d15748f99d4"
]



The general format of these objects names is 'tenant$uid', so you may 
need to specify them separately ie. radosgw-admin user info --tenant= --uid=





Debug output
$ radosgw-admin user info
--uid="0611e8fdb62b4b2892b62c7e7bf3767f$0611e8fdb62b4b2892b62c7e7bf3767f"
--debug_rgw=20 --log-to-stderr
2018-12-26 22:25:10.722 7fbc4999e740 20 get_system_obj_state:
rctx=0x7ffcd45bfe20 obj=.rgw.root:default.realm
state=0x5571718d9000 s->prefetch_data=0
2018-12-26 22:25:10.722 7fbc24ff9700  2
RGWDataChangesLog::ChangesRenewThread: start
2018-12-26 22:25:10.726 7fbc4999e740 20 get_system_obj_state:
rctx=0x7ffcd45bf3d0 obj=.rgw.root:converted state=0x5571718d9000
s->prefetch_data=0
2018-12-26 22:25:10.730 7fbc4999e740 20 get_system_obj_state:
rctx=0x7ffcd45bee50 obj=.rgw.root:default.realm
state=0x5571718e35a0 s->prefetch_data=0
2018-12-26 22:25:10.730 7fbc4999e740 20 get_system_obj_state:
rctx=0x7ffcd45bef40 obj=.rgw.root:zonegroups_names.default
state=0x5571718e35a0 s->prefetch_data=0
2018-12-26 22:25:10.730 7fbc4999e740 20 get_system_obj_state:
s->obj_tag was set empty
2018-12-26 22:25:10.730 7fbc4999e740 20 rados->read ofs=0 len=524288
2018-12-26 22:25:10.730 7fbc4999e740 20 rados->read r=0 bl.length=46
2018-12-26 22:25:10.742 7fbc4999e740 20 RGWRados::pool_iterate:
got zonegroup_info.b7493bbe-a638-4950-a4d5-716919e5d150
2018-12-26 22:25:10.742 7fbc4999e740 20 RGWRados::pool_iterate:
got zonegroup_info.23e74943-f594-44cb-a3bb-3a2150804dd3
2018-12-26 22:25:10.742 7fbc4999e740 20 RGWRados::pool_iterate:
got zone_info.9be46480-91cb-437b-87e1-eb6eff862767
2018-12-26 22:25:10.742 7fbc4999e740 20 RGWRados::pool_iterate:
got zone_info.8bfdf8a3-c165-44e9-9ed6-deff8a5d852f
2018-12-26 22:25:10.742 

[ceph-users] Compacting omap data

2019-01-02 Thread Bryan Stillwell
Recently on one of our bigger clusters (~1,900 OSDs) running Luminous (12.2.8), 
we had a problem where OSDs would frequently get restarted while deep-scrubbing.

After digging into it I found that a number of the OSDs had very large omap 
directories (50GiB+).  I believe these were OSDs that had previous held PGs 
that were part of the .rgw.buckets.index pool which I have recently moved to 
all SSDs, however, it seems like the data remained on the HDDs.

I was able to reduce the data usage on most of the OSDs (from ~50 GiB to < 200 
MiB!) by compacting the omap dbs offline by setting 'leveldb_compact_on_mount = 
true' in the [osd] section of ceph.conf, but that didn't work on the newer OSDs 
which use rocksdb.  On those I had to do an online compaction using a command 
like:

$ ceph tell osd.510 compact

That worked, but today when I tried doing that on some of the SSD-based OSDs 
which are backing .rgw.buckets.index I started getting slow requests and the 
compaction ultimately failed with this error:

$ ceph tell osd.1720 compact
osd.1720: Error ENXIO: osd down

When I tried it again it succeeded:

$ ceph tell osd.1720 compact
osd.1720: compacted omap in 420.999 seconds

The data usage on that OSD dropped from 57.8 GiB to 43.4 GiB which was nice, 
but I don't believe that'll get any smaller until I start splitting the PGs in 
the .rgw.buckets.index pool to better distribute that pool across the SSD-based 
OSDs.

The first question I have is what is the option to do an offline compaction of 
rocksdb so I don't impact our customers while compacting the rest of the 
SSD-based OSDs?

The next question is if there's a way to configure Ceph to automatically 
compact the omap dbs in the background in a way that doesn't affect user 
experience?

Finally, I was able to figure out that the omap directories were getting large 
because we're using filestore on this cluster, but how could someone determine 
this when using BlueStore?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-02 Thread Konstantin Shalygin

On a medium sized cluster with device-classes, I am experiencing a
problem with the SSD pool:

root at adminnode  :~# 
ceph osd df | grep ssd
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
  2   ssd 0.43700  1.0  447GiB  254GiB  193GiB 56.77 1.28  50
  3   ssd 0.43700  1.0  447GiB  208GiB  240GiB 46.41 1.04  58
  4   ssd 0.43700  1.0  447GiB  266GiB  181GiB 59.44 1.34  55
30   ssd 0.43660  1.0  447GiB  222GiB  225GiB 49.68 1.12  49
  6   ssd 0.43700  1.0  447GiB  238GiB  209GiB 53.28 1.20  59
  7   ssd 0.43700  1.0  447GiB  228GiB  220GiB 50.88 1.14  56
  8   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.16 1.35  57
31   ssd 0.43660  1.0  447GiB  231GiB  217GiB 51.58 1.16  56
34   ssd 0.43660  1.0  447GiB  186GiB  261GiB 41.65 0.94  49
36   ssd 0.87329  1.0  894GiB  364GiB  530GiB 40.68 0.92  91
37   ssd 0.87329  1.0  894GiB  321GiB  573GiB 35.95 0.81  78
42   ssd 0.87329  1.0  894GiB  375GiB  519GiB 41.91 0.94  92
43   ssd 0.87329  1.0  894GiB  438GiB  456GiB 49.00 1.10  92
13   ssd 0.43700  1.0  447GiB  249GiB  198GiB 55.78 1.25  72
14   ssd 0.43700  1.0  447GiB  290GiB  158GiB 64.76 1.46  71
15   ssd 0.43700  1.0  447GiB  368GiB 78.6GiB 82.41 1.85  78 <
16   ssd 0.43700  1.0  447GiB  253GiB  194GiB 56.66 1.27  70
19   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.21 1.35  70
20   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.81 1.57  77
21   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.77 1.57  77
22   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.10 1.35  67
38   ssd 0.43660  1.0  447GiB  153GiB  295GiB 34.11 0.77  46
39   ssd 0.43660  1.0  447GiB  127GiB  320GiB 28.37 0.64  38
40   ssd 0.87329  1.0  894GiB  386GiB  508GiB 43.17 0.97  97
41   ssd 0.87329  1.0  894GiB  375GiB  520GiB 41.88 0.94 113

This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool).
Currently, the balancer plugin is off because it immediately crashed
the MGR in the past (on 12.2.5).
Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I
am unable to find the bugtracker ID]

Would the balancer plugin correct this situation?
What happens if all MGRs die like they did on 12.2.5 because of the plugin?
Will the balancer take data from the most-unbalanced OSDs first?
Otherwise the OSD may fill up more then FULL which would cause the
whole pool to freeze (because the smallest OSD is taken into account
for free space calculation).
This would be the worst case as over 100 VMs would freeze, causing lot
of trouble. This is also the reason I did not try to enable the
balancer again.

Please read this [1], all about Balancer with upmap mode.

It's stable from 12.2.8 with upmap mode.



k

[1] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-December/032002.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] list admin issues

2019-01-02 Thread David Galloway


On 12/28/18 4:13 AM, Ilya Dryomov wrote:
> On Sat, Dec 22, 2018 at 7:18 PM Brian :  wrote:
>>
>> Sorry to drag this one up again.
>>
>> Just got the unsubscribed due to excessive bounces thing.
>>
>> 'Your membership in the mailing list ceph-users has been disabled due
>> to excessive bounces The last bounce received from you was dated
>> 21-Dec-2018.  You will not get any more messages from this list until
>> you re-enable your membership.  You will receive 3 more reminders like
>> this before your membership in the list is deleted.'
>>
>> can anyone check MTA logs to see what the bounce is?
> 
> Me too.  Happens regularly and only on ceph-users, not on sepia or
> ceph-maintainers, etc.  David, Dan, could you or someone you know look
> into this?
> 

As far as I know, we don't have shell access to the mail servers for
those lists so I can't see what's going on behind the scenes.  I will
increase the bounce_score_threshold for now and change the list owner to
an active e-mail address (oops) who will get the bounce notifications.

The Bounce Processing settings are the same for ceph-users and
ceph-maintainers so I'm guessing the high volume of ceph-users@ is why
it's only happening on that list.

I think the plan is to move to a self-hosted mailman instance soon so
this shouldn't be an issue for much longer.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best way to update object ACL for many files?

2019-01-02 Thread Jin Mao
Ceph Users,

Updating ACL and apply s3 policy for millions of objects in a bucket using
s3cmd seems to be very slow. I experienced about 3-4 objects/second when
doing so.

Any one know a faster way to accomplish this task either as ceph user or as
ceph admin?

Thank you.

Jin.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed

2019-01-02 Thread Thomas Byrne - UKRI STFC
>   In previous versions of Ceph, I was able to determine which PGs had
> scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
> provided that they were not already being scrubbed. In Luminous, the bad
> PG is not visible in "ceph --status" anywhere. Should I use something like
> "ceph health detail -f json-pretty" instead?

'ceph pg ls inconsistent' lists all inconsistent PGs.

>   Also, is it possible to configure Ceph to attempt repairing the bad PGs
> itself, as soon as the scrub fails? I run most of my OSDs on top of a bunch of
> old spinning disks, and a scrub error almost always means that there is a bad
> sector somewhere, which can easily be fixed by rewriting the lost data using
> "ceph pg repair".

I don't know of a good way to repair inconsistencies automatically from within 
Ceph. However, I seem to remember someone saying that with BlueStore OSDs, read 
errors are attempted to be fixed (by rewriting the unreadable replica/shard) 
when they are discovered during client reads. And there was a potential plan to 
do the same if they are discovered during scrubbing. I can't remember the 
details (this was a while ago, at Cephalocon APAC), so I may be completely off 
the mark here. 

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed sync?

2019-01-02 Thread Konstantin Shalygin

Hello, Ceph users,

I am afraid the following question is a FAQ, but I still was not able
to find the answer:

I use ceph --status --format=json-pretty as a source of CEPH status
for my Nagios monitoring. After upgrading to Luminous, I see the following
in the JSON output when the cluster is not healthy:

 "summary": [
 {
 "severity": "HEALTH_WARN",
 "summary": "'ceph health' JSON format has changed in luminous. If 
you see this your monitoring system is scraping the wrong fields. Disable this with 'mon health 
preluminous compat warning = false'"
 }
 ],

Apart from that, the JSON data seems reasonable. My question is which part
of JSON structure are the "wrong fields" I have to avoid. Is it just the
"summary" section, or some other parts as well? Or should I avoid
the whole ceph --status and use something different instead?

What I want is a single machine-readable value with OK/WARNING/ERROR meaning,
and a single human-readable text line, describing the most severe
error condition which is currently present. What is the preferred way to
get this data in Luminous?

Thanks,

-Yenya


Check this [1] changeset for ceph_dash and this [2] for check_ceph_dash. 
This should answer your question.




k

[1] https://github.com/Crapworks/ceph-dash/pull/57/files

[2] https://github.com/Crapworks/check_ceph_dash/pull/5/files

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed

2019-01-02 Thread Jan Kasprzak
Thomas Byrne - UKRI STFC wrote:
: I recently spent some time looking at this, I believe the 'summary' and
: 'overall_status' sections are now deprecated. The 'status' and 'checks'
: fields are the ones to use now.

OK, thanks.

: The 'status' field gives you the OK/WARN/ERR, but returning the most
: severe error condition from the 'checks' section is less trivial. AFAIK
: all health_warn states are treated as equally severe, and same for
: health_err. We ended up formatting our single line human readable output
: as something like:
: 
: "HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN: 20 
large omap objects"

Speaking of scrub errors:

In previous versions of Ceph, I was able to determine which PGs had
scrub errors, and then a cron.hourly script ran "ceph pg repair" for them,
provided that they were not already being scrubbed. In Luminous, the bad PG
is not visible in "ceph --status" anywhere. Should I use something like
"ceph health detail -f json-pretty" instead?

Also, is it possible to configure Ceph to attempt repairing
the bad PGs itself, as soon as the scrub fails? I run most of my OSDs on top
of a bunch of old spinning disks, and a scrub error almost always means
that there is a bad sector somewhere, which can easily be fixed by
rewriting the lost data using "ceph pg repair".

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph health JSON format has changed sync?

2019-01-02 Thread Thomas Byrne - UKRI STFC
I recently spent some time looking at this, I believe the 'summary' and 
'overall_status' sections are now deprecated. The 'status' and 'checks' fields 
are the ones to use now.

The 'status' field gives you the OK/WARN/ERR, but returning the most severe 
error condition from the 'checks' section is less trivial. AFAIK all 
health_warn states are treated as equally severe, and same for health_err. We 
ended up formatting our single line human readable output as something like:

"HEALTH_ERR: 1 inconsistent pg, HEALTH_ERR: 1 scrub error, HEALTH_WARN: 20 
large omap objects"

To make it obvious which check is causing which state. We needed to supress 
specific checks for callouts, so had to look at each check and the resulting 
state. If you're not trying to do something similar there may be a more 
lightweight way to go about it.

Cheers,
Tom

> -Original Message-
> From: ceph-users  On Behalf Of Jan
> Kasprzak
> Sent: 02 January 2019 09:29
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] ceph health JSON format has changed sync?
> 
>   Hello, Ceph users,
> 
> I am afraid the following question is a FAQ, but I still was not able to find 
> the
> answer:
> 
> I use ceph --status --format=json-pretty as a source of CEPH status for my
> Nagios monitoring. After upgrading to Luminous, I see the following in the
> JSON output when the cluster is not healthy:
> 
> "summary": [
> {
> "severity": "HEALTH_WARN",
> "summary": "'ceph health' JSON format has changed in 
> luminous. If
> you see this your monitoring system is scraping the wrong fields. Disable this
> with 'mon health preluminous compat warning = false'"
> }
> ],
> 
> Apart from that, the JSON data seems reasonable. My question is which part
> of JSON structure are the "wrong fields" I have to avoid. Is it just the
> "summary" section, or some other parts as well? Or should I avoid the whole
> ceph --status and use something different instead?
> 
> What I want is a single machine-readable value with OK/WARNING/ERROR
> meaning, and a single human-readable text line, describing the most severe
> error condition which is currently present. What is the preferred way to get
> this data in Luminous?
> 
>   Thanks,
> 
> -Yenya
> 
> --
> | Jan "Yenya" Kasprzak  |
> | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
>  This is the world we live in: the way to deal with computers is to google  
> the
> symptoms, and hope that you don't have to watch a video. --P. Zaitcev
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2019-01-02 Thread Thomas Byrne - UKRI STFC
Assuming I understand it correctly:

"pg_upmap_items 6.0 [40,20]" refers to replacing (upmapping?) osd.40 with 
osd.20 in the acting set of the placement group '6.0'. Assuming it's a 3 
replica PG, the other two OSDs in the set remain unchanged from the CRUSH 
calculation.

"pg_upmap_items 6.6 [45,46,59,56]" describes two upmap replacements for the PG 
6.6, replacing 45 with 46, and 59 with 56.

Hope that helps.

Cheers,
Tom

> -Original Message-
> From: ceph-users  On Behalf Of
> jes...@krogh.cc
> Sent: 30 December 2018 22:04
> To: Konstantin Shalygin 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Balancing cluster with large disks - 10TB HHD
> 
> >> I would still like to have a log somewhere to grep and inspect what
> >> balancer/upmap actually does - when in my cluster. Or some ceph
> >> commands that deliveres some monitoring capabilityes .. any
> >> suggestions?
> > Yes, on ceph-mgr log, when log level is DEBUG.
> 
> Tried the docs .. something like:
> 
> ceph tell mds ... does not seem to work.
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/
> 
> > You can get your cluster upmap's in via `ceph osd dump | grep upmap`.
> 
> Got it -- but I really need the README .. it shows the map ..
> ...
> pg_upmap_items 6.0 [40,20]
> pg_upmap_items 6.1 [59,57,47,48]
> pg_upmap_items 6.2 [59,55,75,9]
> pg_upmap_items 6.3 [22,13,40,39]
> pg_upmap_items 6.4 [23,9]
> pg_upmap_items 6.5 [25,17]
> pg_upmap_items 6.6 [45,46,59,56]
> pg_upmap_items 6.8 [60,54,16,68]
> pg_upmap_items 6.9 [61,69]
> pg_upmap_items 6.a [51,48]
> pg_upmap_items 6.b [43,71,41,29]
> pg_upmap_items 6.c [22,13]
> 
> ..
> 
> But .. I dont have any pg's that should only have 2 replicas.. neither any 
> with 4
> .. how should this be interpreted?
> 
> Thanks.
> 
> --
> Jesper
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usage of devices in SSD pool vary very much

2019-01-02 Thread Kevin Olbrich
Hi!

On a medium sized cluster with device-classes, I am experiencing a
problem with the SSD pool:

root@adminnode:~# ceph osd df | grep ssd
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
 2   ssd 0.43700  1.0  447GiB  254GiB  193GiB 56.77 1.28  50
 3   ssd 0.43700  1.0  447GiB  208GiB  240GiB 46.41 1.04  58
 4   ssd 0.43700  1.0  447GiB  266GiB  181GiB 59.44 1.34  55
30   ssd 0.43660  1.0  447GiB  222GiB  225GiB 49.68 1.12  49
 6   ssd 0.43700  1.0  447GiB  238GiB  209GiB 53.28 1.20  59
 7   ssd 0.43700  1.0  447GiB  228GiB  220GiB 50.88 1.14  56
 8   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.16 1.35  57
31   ssd 0.43660  1.0  447GiB  231GiB  217GiB 51.58 1.16  56
34   ssd 0.43660  1.0  447GiB  186GiB  261GiB 41.65 0.94  49
36   ssd 0.87329  1.0  894GiB  364GiB  530GiB 40.68 0.92  91
37   ssd 0.87329  1.0  894GiB  321GiB  573GiB 35.95 0.81  78
42   ssd 0.87329  1.0  894GiB  375GiB  519GiB 41.91 0.94  92
43   ssd 0.87329  1.0  894GiB  438GiB  456GiB 49.00 1.10  92
13   ssd 0.43700  1.0  447GiB  249GiB  198GiB 55.78 1.25  72
14   ssd 0.43700  1.0  447GiB  290GiB  158GiB 64.76 1.46  71
15   ssd 0.43700  1.0  447GiB  368GiB 78.6GiB 82.41 1.85  78 <
16   ssd 0.43700  1.0  447GiB  253GiB  194GiB 56.66 1.27  70
19   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.21 1.35  70
20   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.81 1.57  77
21   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.77 1.57  77
22   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.10 1.35  67
38   ssd 0.43660  1.0  447GiB  153GiB  295GiB 34.11 0.77  46
39   ssd 0.43660  1.0  447GiB  127GiB  320GiB 28.37 0.64  38
40   ssd 0.87329  1.0  894GiB  386GiB  508GiB 43.17 0.97  97
41   ssd 0.87329  1.0  894GiB  375GiB  520GiB 41.88 0.94 113

This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool).
Currently, the balancer plugin is off because it immediately crashed
the MGR in the past (on 12.2.5).
Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I
am unable to find the bugtracker ID]

Would the balancer plugin correct this situation?
What happens if all MGRs die like they did on 12.2.5 because of the plugin?
Will the balancer take data from the most-unbalanced OSDs first?
Otherwise the OSD may fill up more then FULL which would cause the
whole pool to freeze (because the smallest OSD is taken into account
for free space calculation).
This would be the worst case as over 100 VMs would freeze, causing lot
of trouble. This is also the reason I did not try to enable the
balancer again.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph health JSON format has changed sync?

2019-01-02 Thread Jan Kasprzak
Hello, Ceph users,

I am afraid the following question is a FAQ, but I still was not able
to find the answer:

I use ceph --status --format=json-pretty as a source of CEPH status
for my Nagios monitoring. After upgrading to Luminous, I see the following
in the JSON output when the cluster is not healthy:

"summary": [
{
"severity": "HEALTH_WARN",
"summary": "'ceph health' JSON format has changed in luminous. 
If you see this your monitoring system is scraping the wrong fields. Disable 
this with 'mon health preluminous compat warning = false'"
}
],

Apart from that, the JSON data seems reasonable. My question is which part
of JSON structure are the "wrong fields" I have to avoid. Is it just the
"summary" section, or some other parts as well? Or should I avoid
the whole ceph --status and use something different instead?

What I want is a single machine-readable value with OK/WARNING/ERROR meaning,
and a single human-readable text line, describing the most severe
error condition which is currently present. What is the preferred way to
get this data in Luminous?

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com