On 2010-08-12, at 14:52, burlen wrote:
> Andreas Dilger wrote:
>> On 2010-08-11, at 23:36, burlen wrote:
>>> I am interested in how write()s are buffered in Lustre on the cleint,
>>> server, and network in between. Specifically I'd like to understand what
>>> happens during writes when large number of clients are making large writes
>>> to all of the OSTs on an OSS, and the buffers are inadequate to handle the
>>> outgoing/incoming data.
>>
>> Lustre doesn't buffer dirty pages on the OSS, only on the client. The
>> clients are granted a "reserve" of space in each OST filesystem to ensure
>> there is enough free space for any cached writes that they do.
>
> If I understand the way write() typically works on Linux, during a large
> write(), too large to be buffered in the page cache, once the page cache is
> full dirty pages would be flushed to disk. the data transfer would block at
> that point until the dirty pages are written to disk, whence the data
> transfer would resume into the resulting free pages. But in Lustre I assume
> that once the client's page cache is full, the dirty pages are sent over the
> network to the OSS where they are written to disk.
In fact, Lustre aggressively flushes dirty data from the client as soon as it
can create a 1MB RPC. Otherwise, the VM will cache dirty data for up to 30s,
and if you work out that cache for all clients and the aggregate network
bandwidth, it would be a huge waste of bandwidth to leave it sitting idle.
> In that case, does the network layer effectively act like a buffer? So that
> the client may resume the data transfer into the page cache before the former
> set dirty pages actually hit the disk? Or does the data transfer block until
> dirty pages actually reach the disk?
Lustre also limits the dirty page cache per OST far below the VM limits, for
similar reasons as above. Clients can have 32MB (default) dirty data per OST,
and up to 8 RPCs (default) in flight per OST at one time.
The network does NOT act as a buffer, since the client must keep a copy of all
{meta}data in memory until it is ACK'd by the server (it is not fire & forget)
so that the client can replay this RPC in case of a server crash. The server
will send an ACK (RPC reply) when it has processed the RPC along with a
transaction number for that RPC, and asynchronously notifies the client that
RPCs <= "last_committed_transno" have been committed to disk and they can
discard their copy of the RPC.
Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss