Re: [ceph-users] CephFS overwrite/truncate performance hit

David Turner Tue, 19 Feb 2019 10:37:35 -0800

If your client needs to be able to handle the writes like that on its own,
RBDs might be the more appropriate use case.  You lose the ability to have
multiple clients accessing the data as easily as with CephFS, but you would
gain the features you're looking for.


On Tue, Feb 12, 2019 at 1:43 PM Gregory Farnum <[email protected]> wrote:

>
>
> On Tue, Feb 12, 2019 at 5:10 AM Hector Martin <[email protected]>
> wrote:
>
>> On 12/02/2019 06:01, Gregory Farnum wrote:
>> > Right. Truncates and renames require sending messages to the MDS, and
>> > the MDS committing to RADOS (aka its disk) the change in status, before
>> > they can be completed. Creating new files will generally use a
>> > preallocated inode so it's just a network round-trip to the MDS.
>>
>> I see. Is there a fundamental reason why these kinds of metadata
>> operations cannot be buffered in the client, or is this just the current
>> way they're implemented?
>>
>
> It's pretty fundamental, at least to the consistency guarantees we hold
> ourselves to. What happens if the client has buffered an update like that,
> performs writes to the data with those updates in mind, and then fails
> before they're flushed to the MDS? A local FS doesn't need to worry about a
> different node having a different lifetime, and can control the write order
> of its metadata and data updates on belated flush a lot more precisely than
> we can. :(
> -Greg
>
>
>>
>> e.g. on a local FS these kinds of writes can just stick around in the
>> block cache unflushed. And of course for CephFS I assume file extension
>> also requires updating the file size in the MDS, yet that doesn't block
>> while truncation does.
>>
>> > Going back to your first email, if you do an overwrite that is confined
>> > to a single stripe unit in RADOS (by default, a stripe unit is the size
>> > of your objects which is 4MB and it's aligned from 0), it is guaranteed
>> > to be atomic. CephFS can only tear writes across objects, and only if
>> > your client fails before the data has been flushed.
>>
>> Great! I've implemented this in a backwards-compatible way, so that gets
>> rid of this bottleneck. It's just a 128-byte flag file (formerly
>> variable length, now I just pad it to the full 128 bytes and rewrite it
>> in-place). This is good information to know for optimizing things :-)
>>
>> --
>> Hector Martin ([email protected])
>> Public Key: https://mrcn.st/pub
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS overwrite/truncate performance hit

Reply via email to