Hi Andreas,

thanks for your answer.

>> Causing most grieve at the moment is that we sometimes see delays
>> writing files. From the writing clients end, it simply looks as if I/O
>> stops for a while (we've seen 'pauses' of anything up to 10 seconds).
>> This appears to be independent of what client does the writing, and
>> software doing the writing. We investigated this a bit using strace and
>> dd; the 'slow' calls appear to always be either open, write, or close
>> calls. Usually, these take well below 0.001s; in around 0.5% or 1% of
>> cases, they take up to multiple seconds. It does not seem to be
>> associated with any specific OST, OSS, client or anything; there is
>> nothing in any log files or any exceptional load on MDS or OSS or any of
>> the clients.
>
> This is most likely associated with delays in committing the journal on the 
> MDT or OST, which can happen if the journal fills completely.  Having larger 
> journals can help, if you have enough RAM to keep them all in memory and not 
> overflow.  Alternately, if you make the journals small it will limit the 
> latency, at the cost of reducing overall performance.  A third alternative 
> might be to use SSDs for the journal devices.

Just to double check - that would be the file system journal, I assume?

That makes a lot of sense; is there a way to verify that this is the 
issue we're having?

Journal size appears to be 400M - if we were to try increasing it, how 
would be determine what to best set it to?

>> The other issue is that we frequently see delays when trying to read a
>> file. I sometimes takes more than 60s for a file to be visible on a
>> machine after the initial write on a different machine has completed
>> (both machines being Lustre clients). Again, there is nothing in the
>> logs, nor exceptional load on any of the machines.
>
> This is probably just a manifestation of the first problem.  The issue likely 
> isn't in the read, but a delay in flushing the data from the cache of the 
> writing client.  There were fixes made in 1.8 to increase the IO priority for 
> clients writing data under a lock that other clients are waiting on.

We kind of suspected them to be related, yes.

Tina

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to