Hi Andreas, thanks for your answer.
>> Causing most grieve at the moment is that we sometimes see delays >> writing files. From the writing clients end, it simply looks as if I/O >> stops for a while (we've seen 'pauses' of anything up to 10 seconds). >> This appears to be independent of what client does the writing, and >> software doing the writing. We investigated this a bit using strace and >> dd; the 'slow' calls appear to always be either open, write, or close >> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of >> cases, they take up to multiple seconds. It does not seem to be >> associated with any specific OST, OSS, client or anything; there is >> nothing in any log files or any exceptional load on MDS or OSS or any of >> the clients. > > This is most likely associated with delays in committing the journal on the > MDT or OST, which can happen if the journal fills completely. Having larger > journals can help, if you have enough RAM to keep them all in memory and not > overflow. Alternately, if you make the journals small it will limit the > latency, at the cost of reducing overall performance. A third alternative > might be to use SSDs for the journal devices. Just to double check - that would be the file system journal, I assume? That makes a lot of sense; is there a way to verify that this is the issue we're having? Journal size appears to be 400M - if we were to try increasing it, how would be determine what to best set it to? >> The other issue is that we frequently see delays when trying to read a >> file. I sometimes takes more than 60s for a file to be visible on a >> machine after the initial write on a different machine has completed >> (both machines being Lustre clients). Again, there is nothing in the >> logs, nor exceptional load on any of the machines. > > This is probably just a manifestation of the first problem. The issue likely > isn't in the read, but a delay in flushing the data from the cache of the > writing client. There were fixes made in 1.8 to increase the IO priority for > clients writing data under a lock that other clients are waiting on. We kind of suspected them to be related, yes. Tina -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
