Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-29 Thread Marco Gaiarin


I reply to myself.

> I've added a new node, added slowly 4 new OSD, but in the meantime an
> OSD (not the new, not the node to remove) died. My situation now is:
>  root@blackpanther:~# ceph osd df tree
>  ID WEIGHT   REWEIGHT SIZE   USE   AVAIL  %USE  VAR  TYPE NAME   
>  -1 21.41985-  5586G 2511G  3074G 00 root default
>  -2  5.45996-  5586G 2371G  3214G 42.45 0.93 host capitanamerica 
>   0  1.81999  1.0  1862G  739G  1122G 39.70 0.87 osd.0   
>   1  1.81999  1.0  1862G  856G  1005G 46.00 1.00 osd.1   
>  10  0.90999  1.0   931G  381G   549G 40.95 0.89 osd.10  
>  11  0.90999  1.0   931G  394G   536G 42.35 0.92 osd.11  
>  -3  5.03996-  5586G 2615G  2970G 46.82 1.02 host vedovanera 
>   2  1.3  1.0  1862G  684G  1177G 36.78 0.80 osd.2   
>   3  1.81999  1.0  1862G 1081G   780G 58.08 1.27 osd.3   
>   4  0.90999  1.0   931G  412G   518G 44.34 0.97 osd.4   
>   5  0.90999  1.0   931G  436G   494G 46.86 1.02 osd.5   
>  -4  5.45996-   931G  583G   347G 00 host deadpool   
>   6  1.81999  1.0  1862G  898G   963G 48.26 1.05 osd.6   
>   7  1.81999  1.0  1862G  839G  1022G 45.07 0.98 osd.7   
>   8  0.909990  0 0  0 00 osd.8   
>   9  0.90999  1.0   931G  583G   347G 62.64 1.37 osd.9   
>  -5  5.45996-  5586G 2511G  3074G 44.96 0.98 host blackpanther   
>  12  1.81999  1.0  1862G  828G  1033G 44.51 0.97 osd.12  
>  13  1.81999  1.0  1862G  753G  1108G 40.47 0.88 osd.13  
>  14  0.90999  1.0   931G  382G   548G 41.11 0.90 osd.14  
>  15  0.90999  1.0   931G  546G   384G 58.66 1.28 osd.15  
> TOTAL 21413G 9819G 11594G 45.85  
>  MIN/MAX VAR: 0/1.37  STDDEV: 7.37
> 
> Perfectly healthy. But i've tried to, slowly, remove an OSD from
> 'vedovanera', and so i've tried with:
>   ceph osd crush reweight osd.2 
> as you can see, i'm arrived to weight 1.4 (from 1.81999), but if i go
> lower than that i catch:
[...]
> recovery 2/2556513 objects degraded (0.000%)

Seems that the trouble came from osd.8 that was out and down, but not
from the crushmap (still have weight 0.90999).

After removing osd 8 massive rebalance start. After that, now i can
lower weight of OSD for node vedovanera and i've no more degraded
object.

I think i'm starting to understand how concretely the crush algorithm
work. ;-)

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Gregory Farnum
On Mon, Nov 26, 2018 at 3:30 AM Janne Johansson  wrote:

> Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman :
> >
> > Hi List,
> >
> > Another interesting and unexpected thing we observed during cluster
> > expansion is the following. After we added  extra disks to the cluster,
> > while "norebalance" flag was set, we put the new OSDs "IN". As soon as
> > we did that a couple of hundered objects would become degraded. During
> > that time no OSD crashed or restarted. Every "ceph osd crush add $osd
> > weight host=$storage-node" would cause extra degraded objects.
> >
> > I don't expect objects to become degraded when extra OSDs are added.
> > Misplaced, yes. Degraded, no
> >
> > Someone got an explantion for this?
> >
>
> Yes, when you add a drive (or 10), some PGs decide they should have one or
> more
> replicas on the new drives, a new empty PG is created there, and
> _then_ that replica
> will make that PG get into the "degraded" mode, meaning if it had 3
> fine active+clean
> replicas before, it now has 2 active+clean and one needing backfill to
> get into shape.
>
> It is a slight mistake in reporting it in the same way as an error,
> even if it looks to the
> cluster just as if it was in error and needs fixing. This gives the
> new ceph admins a
> sense of urgency or danger whereas it should be perfectly normal to add
> space to
> a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> PG and fill from
> the one going out into the new empty PG and somehow keep itself with 3
> working
> replicas, but ceph chooses to first discard one replica, then backfill
> into the empty
> one, leading to this kind of "error" report.
>

See, that's the thing: Ceph is designed *not* to reduce data reliability
this way; it shouldn't do that; and so far as I've been able to establish
so far it doesn't actually do that. Which makes these degraded object
reports a bit perplexing.

What we have worked out is that sometimes objects can be degraded because
the log-based recovery takes a while after the primary juggles around PG
set membership, and I suspect that's what is turning up here. The exact
cause still eludes me a bit, but I assume it's a consequence of the
backfill and recovery throttling we've added over the years.
If a whole PG was missing then you'd expect to see very large degraded
object counts (as opposed to the 2 that Marco reported).

-Greg


>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Marco Gaiarin
Mandi! Janne Johansson
  In chel di` si favelave...

> It is a slight mistake in reporting it in the same way as an error, even if 
> it looks to the
> cluster just as if it was in error and needs fixing.

I think i'm hit a similar situation, and also i'm feeling that
something have to be 'fixed'. I seek an explanation...

I'm adding a node (blackpanther, 4 OSDs, done) and removing a
node (vedovanera[1], 4 OSDs, to be done).

I've added a new node, added slowly 4 new OSD, but in the meantime an
OSD (not the new, not the node to remove) died. My situation now is:

 root@blackpanther:~# ceph osd df tree
 ID WEIGHT   REWEIGHT SIZE   USE   AVAIL  %USE  VAR  TYPE NAME   
 -1 21.41985-  5586G 2511G  3074G 00 root default
 -2  5.45996-  5586G 2371G  3214G 42.45 0.93 host capitanamerica 
  0  1.81999  1.0  1862G  739G  1122G 39.70 0.87 osd.0   
  1  1.81999  1.0  1862G  856G  1005G 46.00 1.00 osd.1   
 10  0.90999  1.0   931G  381G   549G 40.95 0.89 osd.10  
 11  0.90999  1.0   931G  394G   536G 42.35 0.92 osd.11  
 -3  5.03996-  5586G 2615G  2970G 46.82 1.02 host vedovanera 
  2  1.3  1.0  1862G  684G  1177G 36.78 0.80 osd.2   
  3  1.81999  1.0  1862G 1081G   780G 58.08 1.27 osd.3   
  4  0.90999  1.0   931G  412G   518G 44.34 0.97 osd.4   
  5  0.90999  1.0   931G  436G   494G 46.86 1.02 osd.5   
 -4  5.45996-   931G  583G   347G 00 host deadpool   
  6  1.81999  1.0  1862G  898G   963G 48.26 1.05 osd.6   
  7  1.81999  1.0  1862G  839G  1022G 45.07 0.98 osd.7   
  8  0.909990  0 0  0 00 osd.8   
  9  0.90999  1.0   931G  583G   347G 62.64 1.37 osd.9   
 -5  5.45996-  5586G 2511G  3074G 44.96 0.98 host blackpanther   
 12  1.81999  1.0  1862G  828G  1033G 44.51 0.97 osd.12  
 13  1.81999  1.0  1862G  753G  1108G 40.47 0.88 osd.13  
 14  0.90999  1.0   931G  382G   548G 41.11 0.90 osd.14  
 15  0.90999  1.0   931G  546G   384G 58.66 1.28 osd.15  
TOTAL 21413G 9819G 11594G 45.85  
 MIN/MAX VAR: 0/1.37  STDDEV: 7.37

Perfectly healthy. But i've tried to, slowly, remove an OSD from
'vedovanera', and so i've tried with:

ceph osd crush reweight osd.2 

as you can see, i'm arrived to weight 1.4 (from 1.81999), but if i go
lower than that i catch:

   cluster 8794c124-c2ec-4e81-8631-742992159bd6
 health HEALTH_WARN
6 pgs backfill
1 pgs backfilling
7 pgs stuck unclean
recovery 2/2556513 objects degraded (0.000%)
recovery 7721/2556513 objects misplaced (0.302%)
 monmap e6: 6 mons at 
{0=10.27.251.7:6789/0,1=10.27.251.8:6789/0,2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0}
election epoch 2780, quorum 0,1,2,3,4,5 blackpanther,0,1,4,2,3
 osdmap e9302: 16 osds: 15 up, 15 in; 7 remapped pgs
  pgmap v54971897: 768 pgs, 3 pools, 3300 GB data, 830 kobjects
9911 GB used, 11502 GB / 21413 GB avail
2/2556513 objects degraded (0.000%)
7721/2556513 objects misplaced (0.302%)
 761 active+clean
   6 active+remapped+wait_backfill
   1 active+remapped+backfilling
  client io 9725 kB/s rd, 772 kB/s wr, 153 op/s

eg, 2 object 'degraded'. This really puzzled me.

Why?! Thanks.


[1] some Marvel Comics heros got translated in Italian, so 'vedovanera'
  is 'black widow' and 'capitanamerica' clearly 'Captain America'.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Janne Johansson
Den mån 26 nov. 2018 kl 09:39 skrev Stefan Kooman :

> > It is a slight mistake in reporting it in the same way as an error,
> > even if it looks to the
> > cluster just as if it was in error and needs fixing. This gives the
> > new ceph admins a
> > sense of urgency or danger whereas it should be perfectly normal to add 
> > space to
> > a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> > PG and fill from
> > the one going out into the new empty PG and somehow keep itself with 3 
> > working
> > replicas, but ceph chooses to first discard one replica, then backfill
> > into the empty
> > one, leading to this kind of "error" report.
>
> Thanks for the explanation. I agree with you that it would be more safe to
> first backfill to the new PG instead of just assuming the new OSD will
> be fine and discarding a perfectly healthy PG. We do have max_size 3 in
> the CRUSH ruleset ... I wonder if Ceph would behave differently if we
> would have max_size 4 ... to actually allow a fourth copy in the first
> place ...

I don't think the replication number is important, it's more of a choice which
PERHAPS is meant to allow you to move PGs to a new drive when the cluster is
near full, since it will clear out space lots faster if you just kill
off one unneeded
replica and starts writing to a new drive, whereas keeping all old
replicas until data is
100% ok on the new replica will make new space not appear until a large
amount of data has moved, which for large drives and large PGs might take
a very long time.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Stefan Kooman
Quoting Janne Johansson (icepic...@gmail.com):
> Yes, when you add a drive (or 10), some PGs decide they should have one or 
> more
> replicas on the new drives, a new empty PG is created there, and
> _then_ that replica
> will make that PG get into the "degraded" mode, meaning if it had 3
> fine active+clean
> replicas before, it now has 2 active+clean and one needing backfill to
> get into shape.
> 
> It is a slight mistake in reporting it in the same way as an error,
> even if it looks to the
> cluster just as if it was in error and needs fixing. This gives the
> new ceph admins a
> sense of urgency or danger whereas it should be perfectly normal to add space 
> to
> a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> PG and fill from
> the one going out into the new empty PG and somehow keep itself with 3 working
> replicas, but ceph chooses to first discard one replica, then backfill
> into the empty
> one, leading to this kind of "error" report.

Thanks for the explanation. I agree with you that it would be more safe to
first backfill to the new PG instead of just assuming the new OSD will
be fine and discarding a perfectly healthy PG. We do have max_size 3 in
the CRUSH ruleset ... I wonder if Ceph would behave differently if we
would have max_size 4 ... to actually allow a fourth copy in the first
place ...

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Janne Johansson
Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman :
>
> Hi List,
>
> Another interesting and unexpected thing we observed during cluster
> expansion is the following. After we added  extra disks to the cluster,
> while "norebalance" flag was set, we put the new OSDs "IN". As soon as
> we did that a couple of hundered objects would become degraded. During
> that time no OSD crashed or restarted. Every "ceph osd crush add $osd
> weight host=$storage-node" would cause extra degraded objects.
>
> I don't expect objects to become degraded when extra OSDs are added.
> Misplaced, yes. Degraded, no
>
> Someone got an explantion for this?
>

Yes, when you add a drive (or 10), some PGs decide they should have one or more
replicas on the new drives, a new empty PG is created there, and
_then_ that replica
will make that PG get into the "degraded" mode, meaning if it had 3
fine active+clean
replicas before, it now has 2 active+clean and one needing backfill to
get into shape.

It is a slight mistake in reporting it in the same way as an error,
even if it looks to the
cluster just as if it was in error and needs fixing. This gives the
new ceph admins a
sense of urgency or danger whereas it should be perfectly normal to add space to
a cluster. Also, it could have chosen to add a fourth PG in a repl=3
PG and fill from
the one going out into the new empty PG and somehow keep itself with 3 working
replicas, but ceph chooses to first discard one replica, then backfill
into the empty
one, leading to this kind of "error" report.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-25 Thread Stefan Kooman
Hi List,

Another interesting and unexpected thing we observed during cluster
expansion is the following. After we added  extra disks to the cluster,
while "norebalance" flag was set, we put the new OSDs "IN". As soon as
we did that a couple of hundered objects would become degraded. During
that time no OSD crashed or restarted. Every "ceph osd crush add $osd
weight host=$storage-node" would cause extra degraded objects.

I don't expect objects to become degraded when extra OSDs are added.
Misplaced, yes. Degraded, no

Someone got an explantion for this?

Gr. Stefan



-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com