>> >> After looking at the osd, we saw a very high load on osd (450 of load), some >> were down. >> Ceph -s displayed that we were having down pg, peering+down pg, remapped pg. >> etc. >> > >Could you tell us a bit more? > When the load was 450, was this mainly due to disk I/O wait? Did the machines start to swap?
All disk were 100% busy. And server was swapping. > Could it be that the swapping was actually causing the machines to die even > more? > Although a OSD could run with 100M of memory, during recovery it can grow > quite fast. Is there a way to estimate the needed memory ? > > So basically the cluster was under load because we was recovering... but > because it was under load recovering could not complete. > > >FileStore aborts indicate that it couldn't get the work done quickly enough. >I've seen this with btrfs, but you say you are using XFS. > >You say you are storing small files. What exactly is "small"? In average 120ko. -- Yann ROBIN www.YouScribe.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
