Thanks Andreas. We hit this problem at 79% full on each ost. After deleting we got the ost's down to 77% full and the problem subsided. I haven't found any information or rumors regarding full filesystem lustre performance but I know for our workload we're setting a 75% hard limit on used space to avoid these issues. The biggest surprise for me was not that it slowed down (all filesystems get slower as they approach 100% full) but how sudden the wall seemed to be hit.

Joe, I can't definitively answer your question but I can tell you that what I saw on the luns was one io thread would dominate the lun for 100s. No other read/write requests would get through. This was with the deadline scheduler. We tried with cfq as well and the same behavior was exhibited. That indicates to me that the thread was *active* for 100 seconds.

Daniel

Andreas Dilger wrote:
On Aug 17, 2007  07:28 -0600, Daniel Leaberry wrote:
I have an interesting problem. I've made no changes to the IB DDN storage yet I'm finding OST's crashing left and right. The thread watchdog gets triggered,

Note that a watchdog thread stack dump is NOT a crash, but rather a
debugging mechanism so we can see where the thread is stuck for such
a long time.  It should be able to continue working even after this
happens.

Is there anyway to tune the extent searching code? Does my analysis seem likely? Is this fixed in 1.6.1 such that I should upgrade immediately?

You could increase the watchdog thread timeout (this is currently a
compile time constant), but that won't remove the fact that it is taking
100s to find free space.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to