I would suggest either adding 1 new disk on each of the 2 machines
increasing the osd_backfill_full_ratio to something like 90 or 92 from
default 85. 

/Maged  

On 2017-08-28 08:01, hjcho616 wrote:

> Hello! 
> 
> I've been using ceph for long time mostly for network CephFS storage, even 
> before Argonaut release!  It's been working very well for me.  Yes, I had 
> some power outtages before and asked few questions on this list before and 
> got resolved happily!  Thank you all! 
> 
> Not sure why but we've been having quite a bit of power outages lately.  Ceph 
> appear to be running OK with those going on.. so I was pretty happy and 
> didn't thought much of it... till yesterday, When I started to move some 
> videos to cephfs, ceph decided that it was full although df showed only 54% 
> utilization!  Then I looked up, some of the osds were down! (only 3 at that 
> point!) 
> 
> I am running pretty simple ceph configuration... I have one machine running 
> MDS and mon named MDS1.  Two OSD machines with 5 2TB HDDs and 1 SSD for 
> journal named OSD1 and OSD2. 
> 
> At the time, I was running jewel 10.2.2. I looked at some of downed OSD's log 
> file and googled some of them... they appeared to be tied to version 10.2.2.  
> So I just upgraded all to 10.2.9.  Well that didn't solve my problems.. =P  
> While looking at some of this.. there was another power outage!  D'oh!  I may 
> need to invest in a UPS or something... Until this happened, all of the osd 
> down were from OSD2.  But OSD1 took a hit!  Couldn't boot, because osd-0 was 
> damaged... I tried xfs_repair -L /dev/sdb1 as suggested by command line.. I 
> was able to mount it again, phew, reboot... then /dev/sdb1 is no longer 
> accessible!  Noooo!!! 
> 
> So this is what I have today!  I am a bit concerned as half of the osds are 
> down!  and osd.0 doesn't look good at all... 
> # ceph osd tree 
> ID WEIGHT   TYPE NAME     UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 16.24478 root default 
> -2  8.12239     host OSD1 
> 1  1.95250         osd.1      up  1.00000          1.00000 
> 0  1.95250         osd.0    down        0          1.00000 
> 7  0.31239         osd.7      up  1.00000          1.00000 
> 6  1.95250         osd.6      up  1.00000          1.00000 
> 2  1.95250         osd.2      up  1.00000          1.00000 
> -3  8.12239     host OSD2 
> 3  1.95250         osd.3    down        0          1.00000 
> 4  1.95250         osd.4    down        0          1.00000 
> 5  1.95250         osd.5    down        0          1.00000 
> 8  1.95250         osd.8    down        0          1.00000 
> 9  0.31239         osd.9      up  1.00000          1.00000 
> 
> This looked alot better before that last extra power outage... =(  Can't 
> mount it anymore! 
> # ceph health 
> HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 44 pgs 
> backfill_toofull; 80 pgs backfill_wait; 122 pgs degraded; 6 pgs down; 8 pgs 
> inconsistent; 6 pgs peering; 2 pgs recovering; 18 pgs recovery_wait; 16 pgs 
> stale; 122 pgs stuck degraded; 6 pgs stuck inactive; 16 pgs stuck stale; 159 
> pgs stuck unclean; 102 pgs stuck undersized; 102 pgs undersized; 1 requests 
> are blocked > 32 sec; recovery 1803466/4503980 objects degraded (40.042%); 
> recovery 692976/4503980 objects misplaced (15.386%); recovery 147/2251990 
> unfound (0.007%); 1 near full osd(s); 54 scrub errors; mds cluster is 
> degraded; no legacy OSD present but 'sortbitwise' flag is not set 
> 
> Each of osds are showing different failure signature.  
> 
> I've uploaded osd log with debug osd = 20, debug filestore = 20, and debug ms 
> = 20.  You can find it in below links.  Let me know if there is preferred way 
> to share this! 
> https://drive.google.com/open?id=0By7YztAJNGUWQXItNzVMR281Snc 
> (ceph-osd.3.log) 
> https://drive.google.com/open?id=0By7YztAJNGUWYmJBb3RvLVdSQWc 
> (ceph-osd.4.log) 
> https://drive.google.com/open?id=0By7YztAJNGUWaXhRMlFOajN6M1k 
> (ceph-osd.5.log) 
> https://drive.google.com/open?id=0By7YztAJNGUWdm9BWFM5a3ExOFE 
> (ceph-osd.8.log) 
> 
> So how does this look?  Can this be fixed? =)  If so please let me know.  I 
> used to take backups but since it grew so big, I wasn't able to do so 
> anymore... and would like to get most of these back if I can.  Please let me 
> know if you need more info! 
> 
> Thank you! 
> 
> Regards, 
> Hong 
> 
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to