On Fri, Dec 14, 2018 at 6:44 PM Bryan Henderson <bry...@giraffe-data.com>
wrote:

> > Going back through the logs though it looks like the main reason we do a
> > 4MiB block size is so that we have a chance of reporting actual cluster
> > sizes to 32-bit systems,
>
> I believe you're talking about a different block size (there are so many of
> them).
>
> The 'statvfs' system call (the essence of a 'df' command) can return its
> space
> sizes in any units it wants, and tells you that unit.  The unit has
> variously
> been called block size and fragment size.  In Cephfs, it is hardcoded as 4
> MiB
> so that 32 bit fields can represent large storage sizes.  I'm not aware
> that
> anyone attempts to use that value for anything but interpreting statvfs
> results.  Not saying they don't, though.
>
> What I'm looking at, in contrast, is the block size returned by a 'stat'
> system call on a particular file.  In Cephfs, it's the stripe unit size for
> the file, which is an aspect of the file's layout.  In the default layout,
> stripe unit size is 4 MiB.
>

You are of course correct; sorry for the confusion.
It looks like this was introduced in (user space) commit
0457783f6eb0c41951b6d56a568eccaeccec8e6d, which swapped it from the
previous hard-coded 4096. Probably in the expectation that there might be
still-small stripe units that were nevertheless useful to do IO in terms of.

You might want to try and be more sophisticated than just having a mount
option to override the reported block size — perhaps forcing the reported
size within some reasonable limits, but trying to keep some relationship
between it and the stripe size? If someone deploys an erasure-coded pool
under CephFS they definitely want to be doing IO in the stripe size if
possible, rather than 4 or 8KiB.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to