On Thu, Feb 7, 2019 at 3:31 AM Hector Martin <hec...@marcansoft.com> wrote:

> On 07/02/2019 19:47, Marc Roos wrote:
> >
> > Is this difference not related to chaching? And you filling up some
> > cache/queue at some point? If you do a sync after each write, do you
> > have still the same results?
> No, the slow operations are slow from the very beginning. It's not about
> filling a buffer/cache somewhere. I'm guessing the slow operations
> trigger several synchronous writes to the underlying OSDs, while the
> fast ones don't. But I'd like to know more about why exactly there is
> this significant performance hit to truncation operations vs. normal
> writes.
> To give some more numbers:
> echo test | dd of=b conv=notrunc
> This completes extremely quickly (microseconds). The data obviously
> remains in the client cache at this point. This is what I want.
> echo test | dd of=b conv=notrunc,fdatasync
> This runs quickly until the fdatasync(), then that takes ~12ms, which is
> about what I'd expect for a synchronous write to the underlying HDDs. Or
> maybe that's two writes?

It's certainly one write, and may be two overlapping ones if you've
extended the file and need to persist its new size (via the MDS journal).


> echo test | dd of=b
> This takes ~10ms in the best case for the open() call (sometimes 30-40
> or even more), and 6-8ms for the write() call.
> echo test | dd of=b conv=fdatasync
> This takes ~10ms for the open() call, ~8ms for the write() call, and
> ~18ms for the fdatasync() call.
> So it seems like truncating/recreating an existing file introduces
> several disk I/Os worth of latency and forces synchronous behavior
> somewhere down the stack, while merely creating a new file or writing to
> an existing one without truncation does not.

Right. Truncates and renames require sending messages to the MDS, and the
MDS committing to RADOS (aka its disk) the change in status, before they
can be completed. Creating new files will generally use a preallocated
inode so it's just a network round-trip to the MDS.

Going back to your first email, if you do an overwrite that is confined to
a single stripe unit in RADOS (by default, a stripe unit is the size of
your objects which is 4MB and it's aligned from 0), it is guaranteed to be
atomic. CephFS can only tear writes across objects, and only if your client
fails before the data has been flushed.

ceph-users mailing list

Reply via email to