This patch addresses the tail allocation problem found at paper "Reducing File
System Tail Latencies with Chopper".
https://www.usenix.org/system/files/conference/fast15/fast15-paper-he.pdf . The
paper refers the tail allocation problem as the Special End problem.
Here is a description of the problem:
A tail extent is the last extent of a file. The last block of the tail extent
corresponds to the last logical block of the file. When a file is closed and
the tail extent is being allocated, ext4 marks the allocation request with the
hint EXT4_MB_HINT_NOPREALLOC. The intention is to avoid preallocation when we
know the file is final (it won't change until the next open.). But the
implementation leads to some problems. The following program attacks the
problem.
/****************************************/
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
int main(int argc, char **argv)
{
int fd;
int size, off;
char buf[4096];
size = 4096;
fd = open(argv[1], O_WRONLY|O_CREAT);
if ( fd == -1 ) {
perror("opening file");
exit(1);
}
off = 0;
pwrite(fd, buf, size, off);
printf("wrote at %d, size %d bytes\n", off, size);
fsync(fd);
off = off + size;
pwrite(fd, buf, size, off);
printf("wrote at %d, size %d bytes\n", off, size);
close(fd);
sync();
}
/****************************************/
Mount an ext4 on /mnt/ext4onloop and run the program by
$ ./a.out /mnt/ext4onloop/testfile
Check the extents by
$ filefrag -sv /mnt/ext4onloop/testfile
Filesystem type is: ef53
File size of /mnt/ext4onloop/testfile is 8192 (2 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 33280 1
1 1 33025 33281 1 eof
/mnt/ext4onloop/testfile: 2 extents found
The first 4KB of the file is forced to disk by fsync() and allocated from a
locality group preallocation. The second 4KB of the file is forced to disk by
sync() (If you don't call sync(), the effect would be the same as the write
back thread will eventually kick it and force the data to disk.). Since the
second 4KB is allocated when the file is closed and the extent is the tail
extent, it is allocated with the hint EXT4_MB_HINT_NOPREALLOC. This hint
prevents the second 4KB from being allocated from locality group preallocation.
So the second 4KB is not allocated next to the first 4KB.
The program above demonstrates the tail allocation problem. The Chopper paper
demonstrates that the problem could drag extents of the same file very far
(GBs) away from each other.
The EXT4_MB_HINT_NOPREALLOC hint currently means:
1. if there is an i-node preallocation for this file, use it;
2. do not use LG (locality group) preallocations, even they are available (I
think this is not intended);
3. do not create any new preallocations.
EXT4_MB_HINT_NOPREALLOC is not the right hint for tail allocation. What we
really want is:
1. if file is large and there is an i-node preallocation for this file, use it;
2. if file is large and there is no i-node preallocation for this file, do not
create new inode preallocation. Simply allocate as needed.
3. if file is small and there is LG preallocations, use it.
4. if file is small and there is no LG preallocations, create a LG
preallocation and allocate the tail extent from it. LG preallocations are
always useful because small files almost always come. In the case that a write
back thread writes a lot of new small files, this design will group the small
files together and reduce seeks. Also, allocating many small files from LG
preallocation space should be faster than allocating them from buddy cache
separately.
To get the correct behaviors:
EXT4_MB_HINT_NOPREALLOC is not the right hint for special end. If we remove
current check for tail in ext4_mb_group_or_file() , 1, 3, 4 will be satisfied.
Now, we only need to handle the tail of large file (2) by checking tail when
normalizing. If it is a tail of large file, we simply do not normalize it.
The patch with Linux 4.0rc6 has been tested with kvm-xfstests. generic/256
failed. But I wouldn't worry about it because the vanilla kernel also failed...
Signed-off-by: Jun He <[email protected]>
---
fs/ext4/mballoc.c | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8d1e602..49c8547 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3004,7 +3004,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context
*ac,
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
int bsbits, max;
ext4_lblk_t end;
- loff_t size, start_off;
+ loff_t size, isize, start_off;
loff_t orig_size __maybe_unused;
ext4_lblk_t start;
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
@@ -3019,8 +3019,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context
*ac,
if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
return;
- /* caller may indicate that preallocation isn't
- * required (it's a tail, for example) */
+ /* caller may indicate that preallocation isn't required */
if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
return;
@@ -3029,11 +3028,21 @@ ext4_mb_normalize_request(struct
ext4_allocation_context *ac,
return ;
}
- bsbits = ac->ac_sb->s_blocksize_bits;
-
/* first, let's learn actual file size
* given current request is allocated */
size = ac->ac_o_ex.fe_logical + EXT4_C2B(sbi, ac->ac_o_ex.fe_len);
+
+ bsbits = ac->ac_sb->s_blocksize_bits;
+ isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
+ >> bsbits;
+
+ /* don't normalize tail of large files */
+ if ((size == isize) &&
+ !ext4_fs_is_busy(sbi) &&
+ (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
+ return;
+ }
+
size = size << bsbits;
if (size < i_size_read(ac->ac_inode))
size = i_size_read(ac->ac_inode);
@@ -4107,13 +4116,6 @@ static void ext4_mb_group_or_file(struct
ext4_allocation_context *ac)
isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
>> bsbits;
- if ((size == isize) &&
- !ext4_fs_is_busy(sbi) &&
- (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
- ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
- return;
- }
-
if (sbi->s_mb_group_prealloc <= 0) {
ac->ac_flags |= EXT4_MB_STREAM_ALLOC;
return;
--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/