Thanks Andreas. We hit this problem at 79% full on each ost. After
deleting we got the ost's down to 77% full and the problem subsided. I
haven't found any information or rumors regarding full filesystem lustre
performance but I know for our workload we're setting a 75% hard limit
on used space to avoid these issues. The biggest surprise for me was not
that it slowed down (all filesystems get slower as they approach 100%
full) but how sudden the wall seemed to be hit.
Joe, I can't definitively answer your question but I can tell you that
what I saw on the luns was one io thread would dominate the lun for
100s. No other read/write requests would get through. This was with the
deadline scheduler. We tried with cfq as well and the same behavior was
exhibited. That indicates to me that the thread was *active* for 100
seconds.
Daniel
Andreas Dilger wrote:
On Aug 17, 2007 07:28 -0600, Daniel Leaberry wrote:
I have an interesting problem. I've made no changes to the IB DDN
storage yet I'm finding OST's crashing left and right. The thread
watchdog gets triggered,
Note that a watchdog thread stack dump is NOT a crash, but rather a
debugging mechanism so we can see where the thread is stuck for such
a long time. It should be able to continue working even after this
happens.
Is there anyway to tune the extent searching code? Does my analysis seem
likely? Is this fixed in 1.6.1 such that I should upgrade immediately?
You could increase the watchdog thread timeout (this is currently a
compile time constant), but that won't remove the fact that it is taking
100s to find free space.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss