I'm by no means an expert, but from what I understand you do need to stick to 
numbering from zero if you want things to work out in the long term.  Is there 
a chance that the cluster didn't finish bringing things back up to full 
replication before osd's were removed?  

If I were moving from 0,1 to 2,3 I'd bring both 2 and 3 up, set the weight of 
0,1 to zero and let all of the pg's get active+clean again then remove 0,1.  
Doing your swap I might bring up 2 under rack az2, set 1 to weight 0, stop 1 
after getting active+clean and remake what is now 3 as 1 and bring it back in 
as 1 with full weight, then finally drop 2 to weight zero and remove after 
active+clean.  I'd follow on doing a similar shuffle for the now inactive 
former osd 1 the current osd 0 and the future osd 0 which was osd 2.  Clear as 
mud?


On Jul 19, 2013, at 7:03 PM, Pawel Veselov <pawel.vese...@gmail.com> wrote:

> On Fri, Jul 19, 2013 at 3:54 PM, Mike Lowe <j.michael.l...@gmail.com> wrote:
> I'm not sure how to get you out of the situation you are in but what you have 
> in your crush map is osd 2 and osd 3 but ceph starts counting from 0 so I'm 
> guessing it's probably gotten confused.  Some history on your cluster might 
> give somebody an idea for a fix. 
> 
> We had osd.0 and osd.1 first, then we added osd.2. We then removed osd.1, 
> added osd.3 and removed osd.0.
> Do you think that adding back a new osd.0 and osd.1, and then removing osd.2 
> and osd.3 will solve that confusion? I'm a bit concerned that proper osd 
> numbering is required to maintain a healthy cluster...
>  
>  
> On Jul 19, 2013, at 6:44 PM, Pawel Veselov <pawel.vese...@gmail.com> wrote:
> 
>> Hi.
>> 
>> I'm trying to understand the reason behind some of my unclean pages, after 
>> moving some OSDs around. Any help would be greatly appreciated.I'm sure we 
>> are missing something, but can't quite figure out what.
>> 
>> [root@ip-10-16-43-12 ec2-user]# ceph health detail
>> HEALTH_WARN 29 pgs degraded; 68 pgs stuck unclean; recovery 4071/217370 
>> degraded (1.873%)
>> pg 0.50 is stuck unclean since forever, current state active+degraded, last 
>> acting [2]
>> ...
>> pg 2.4b is stuck unclean for 836.989336, current state active+remapped, last 
>> acting [3,2]
>> ...
>> pg 0.6 is active+degraded, acting [3]
>> 
>> These are distinct examples of problems. There are total of 676 page groups.
>> Query shows pretty much the same on them: .
>> 
>> crush map: http://pastebin.com/4Hkkgau6
>> There are some pg_temps (I don't quite understand what those are), that are 
>> mapped to non-existing OSDs. osdmap: http://pastebin.com/irbRNYJz
>> queries for all stuck page groups:http://pastebin.com/kzYa6s2G
>> 
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> -- 
> With best of best regards
> Pawel S. Veselov

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to