On Fri, Dec 14, 2018 at 6:44 PM Bryan Henderson <bry...@giraffe-data.com> wrote:
> > Going back through the logs though it looks like the main reason we do a > > 4MiB block size is so that we have a chance of reporting actual cluster > > sizes to 32-bit systems, > > I believe you're talking about a different block size (there are so many of > them). > > The 'statvfs' system call (the essence of a 'df' command) can return its > space > sizes in any units it wants, and tells you that unit. The unit has > variously > been called block size and fragment size. In Cephfs, it is hardcoded as 4 > MiB > so that 32 bit fields can represent large storage sizes. I'm not aware > that > anyone attempts to use that value for anything but interpreting statvfs > results. Not saying they don't, though. > > What I'm looking at, in contrast, is the block size returned by a 'stat' > system call on a particular file. In Cephfs, it's the stripe unit size for > the file, which is an aspect of the file's layout. In the default layout, > stripe unit size is 4 MiB. > You are of course correct; sorry for the confusion. It looks like this was introduced in (user space) commit 0457783f6eb0c41951b6d56a568eccaeccec8e6d, which swapped it from the previous hard-coded 4096. Probably in the expectation that there might be still-small stripe units that were nevertheless useful to do IO in terms of. You might want to try and be more sophisticated than just having a mount option to override the reported block size — perhaps forcing the reported size within some reasonable limits, but trying to keep some relationship between it and the stripe size? If someone deploys an erasure-coded pool under CephFS they definitely want to be doing IO in the stripe size if possible, rather than 4 or 8KiB. -Greg
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com