On Fri, Dec 14, 2018 at 7:50 AM Bryan Henderson <[email protected]> wrote:
>
> I've searched the ceph-users archives and found no discussion to speak of of
> Cephfs block sizes, and I wonder how much people have thought about it.
>
> The POSIX 'stat' system call reports for each file a block size, which is
> usually defined vaguely as the smallest read or write size that is efficient.
> It usually takes into account that small writes may require a
> read-modify-write and there may be a minimum size on reads from backing
> storage.
>
> One thing that uses this information is the stream I/O implementation
> (fopen/fclose/fread/fwrite) in GNU libc. It always reads and usually writes
> full blocks, buffering as necessary.
>
I tested fread on Fedora 28. fread does 8k read on even block size is 4M.
> Most filesystems report this number as 4K.
>
NFS reports 1M block size
> Ceph reports the stripe unit (stripe column size), which is the maximum size
> of the RADOS objects that back the file. This is 4M by default.
>
> One result of this is that a program uses a thousand times more buffer space
> when running against a Ceph file as against a traditional filesystem.
>
> And a really pernicious result occurs when you have a special file in Cephfs.
> Block size doesn't make any sense at all for special files, and it's probably
> a bad idea to use stream I/O to read one, but I've seen it done. The Chrony
> clock synchronizer programs use fread to read random numbers from
> /dev/urandom. Should /dev/urandom be in a Cephfs filesystem, with defaults,
> it's going to generate 4M of random bits to satisfy a 4-byte request. On one
> of my computers, that takes 7 seconds - and wipes out the entropy pool.
>
This patch should address this issue.
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index c50501c6005a..7f82ceff510a 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -907,6 +907,7 @@ static int fill_inode(struct inode *inode, struct
page *locked_page,
case S_IFBLK:
case S_IFCHR:
case S_IFSOCK:
+ inode->i_blkbits = PAGE_SHIFT;
init_special_inode(inode, inode->i_mode, inode->i_rdev);
inode->i_op = &ceph_file_iops;
break;
>
> Has stat block size been discussed much? Is there a good reason that it's
> the RADOS object size?
>
> I'm thinking of modifying the cephfs filesystem driver to add a mount option
> to specify a fixed block size to be reported for all files, and using 4K or
> 64K. Would that break something?
mount option should work.
Regards
Yan, Zheng
>
> --
> Bryan Henderson San Jose, California
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com