And another quick question - would this be more likely to be the journal on the MDS, or the OSS servers?
On 02/09/10 17:38, Tina Friedrich wrote: > Hello, > > On 02/09/10 17:28, Tina Friedrich wrote: >> Hi Andreas, >> >> thanks for your answer. >> >>>> Causing most grieve at the moment is that we sometimes see delays >>>> writing files. From the writing clients end, it simply looks as if I/O >>>> stops for a while (we've seen 'pauses' of anything up to 10 seconds). >>>> This appears to be independent of what client does the writing, and >>>> software doing the writing. We investigated this a bit using strace and >>>> dd; the 'slow' calls appear to always be either open, write, or close >>>> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of >>>> cases, they take up to multiple seconds. It does not seem to be >>>> associated with any specific OST, OSS, client or anything; there is >>>> nothing in any log files or any exceptional load on MDS or OSS or >>>> any of >>>> the clients. >>> >>> This is most likely associated with delays in committing the journal >>> on the MDT or OST, which can happen if the journal fills completely. >>> Having larger journals can help, if you have enough RAM to keep them >>> all in memory and not overflow. Alternately, if you make the journals >>> small it will limit the latency, at the cost of reducing overall >>> performance. A third alternative might be to use SSDs for the journal >>> devices. >> >> Just to double check - that would be the file system journal, I assume? >> >> That makes a lot of sense; is there a way to verify that this is the >> issue we're having? >> >> Journal size appears to be 400M - if we were to try increasing it, how >> would be determine what to best set it to? > > That was meant to be 'if we were to try increasing or decreasing it' - > sounds to us as if decreasing might be the better option (as in, if this > is the journal flushing, having less journal to flush would probably be > better - or is that the wrong idea?) > > >>>> The other issue is that we frequently see delays when trying to read a >>>> file. I sometimes takes more than 60s for a file to be visible on a >>>> machine after the initial write on a different machine has completed >>>> (both machines being Lustre clients). Again, there is nothing in the >>>> logs, nor exceptional load on any of the machines. >>> >>> This is probably just a manifestation of the first problem. The issue >>> likely isn't in the read, but a delay in flushing the data from the >>> cache of the writing client. There were fixes made in 1.8 to increase >>> the IO priority for clients writing data under a lock that other >>> clients are waiting on. >> >> We kind of suspected them to be related, yes. >> >> Tina >> > > -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
