Resending in plain text mode..

On Wed, Apr 8, 2015 at 11:24 AM, Jun He <[email protected]> wrote:
>
> Hi Ted,
>
> On Tue, Apr 7, 2015 at 9:44 PM, Theodore Ts'o <[email protected]> wrote:
>>
>> On Mon, Apr 06, 2015 at 01:31:41PM -0600, Jun He wrote:
>> > This patch addresses the tail allocation problem found at paper "Reducing 
>> > File System Tail Latencies with Chopper". 
>> > https://www.usenix.org/system/files/conference/fast15/fast15-paper-he.pdf 
>> > . The paper refers the tail allocation problem as the Special End problem.
>>
>> Hi Jun He,
>>
>> Kernel patches should have the commit body line-wrapped at 72 columns
>> (more or less).  The idea is that the commit should be easily readable
>> on an 80 column screen, with some extra space left for identation (git
>> log will ident the commit body by 4 characters).
>>
>> > A tail extent is the last extent of a file. The last block of the
>> > tail extent corresponds to the last logical block of the file. When
>> > a file is closed and the tail extent is being allocated, ext4 marks
>> > the allocation request with the hint EXT4_MB_HINT_NOPREALLOC. The
>> > intention is to avoid preallocation when we know the file is final
>> > (it won't change until the next open.). But the implementation leads
>> > to some problems. The following program attacks the problem.
>>
>> I think you mean "the following program _demonstrates_ the problem".
>>
>>
>> > EXT4_MB_HINT_NOPREALLOC is not the right hint for special end. If we
>> > remove current check for tail in ext4_mb_group_or_file() , 1, 3, 4
>> > will be satisfied. Now, we only need to handle the tail of large
>> > file (2) by checking tail when normalizing. If it is a tail of large
>> > file, we simply do not normalize it.
>>
>> The problem with skipping the normalization is this also by passes the
>> goal block optimizations based on pleft, pright, lleft, and lright.
>
>
> Actually bypassing this optimization is OK because similar goal has been set 
> by ext4_ext_find_goal() before. The goal optimization in 
> ext4_mb_normalize_request() is a re-alignment for normalized requests. If 
> ext4 doesn't normalize the request, it won't need to re-align.
>
>>
>> In the case of a small file, it's not always true that you want to
>> allocate from a LG preallocation.  If there is free space immediately
>> adjacent to file, we want to use that in preference to the LG
>> preallocation --- since that would result in a contiguous file.
>
>
> In the current code, the space that is immediately after a small file cannot 
> be used for the same file because EXT4_MB_HINT_NOPREALLOC prevents the tail 
> extent from using it (this is the Special End issues in the Chopper paper). 
> The hint actually makes the file discontiguous. The attached program shows 
> the problem. The patch fixes the problem.
>
>>
>> > +
>> > +     /* don't normalize tail of large files */
>> > +     if ((size == isize) &&
>> > +         !ext4_fs_is_busy(sbi) &&
>> > +         (atomic_read(&ac->ac_inode->i_writecount) == 0)) {
>> > +             return;
>> > +     }
>>
>> The comment is a bit misleading; it's not checking for large files at
>> all.  It also looks like that you copied the ext4_fs_is_busy(sbi) from
>> the original check.
>
>
> If the execution reaches here, the file must be large -- if it's small, it 
> would have gone out of the ext4_mb_normalize_request() function. It will be 
> clear if the comment states it though.
>
> I did just copy ext4_fs_is_busy()..
>
>>
>>
>> Perhaps it would be helpful to explain the design goal of the original
>> code.  As originally conceived, the problem we were trying to fix was
>> that in the case of unpacking a tar file with a large number of small
>> files, we wanted to make sure all of the files would be packed
>> togehter, without extra space reserved (caused by the normalization).
>> This was achieved by disabing the use of the locality group.  The
>> problem is that if there are a large number of processes writing into
>> the same block group, this would cause contention, and so in the case
>> where we had contention, we would use the locality group (since the
>> whole point was to eliminate the need to constantly take the block
>> group lock while allocaitng out of the block group; that's why the
>> locality groups are per-cpu).  That's what the ext4_fs_is_busy() is
>> all about.  We try to measure contention and disable the optimization
>> if that was the case.
>
>
> Now I understand ext4_fs_is_busy() is used to detect contention. If there is 
> contention, ext4 uses preallocation to avoid further contentions. If not, 
> ext4 allocates without preallocation to pack files together. But why not just 
> use LG preallocation? It can do both: packing small files together and 
> avoiding contention. The space in LG preallocation is not wasted. It can 
> always be used later. Besides, LG preallocations are created often anyway.
>
>>
>> I suspect we may want to take a step back and what's the best way of
>> handling all of the high level goals, and refomulate a better strategy:
>>
>> 1) If the file doesn't have any blocks allocated yet, and it is closed
>> by the time we need to allocate all of its blocks (thanks to delayed
>> allocation), we should allocate exactly the number of blocks needed,
>> and try to avoid fragmenting free space.
>>
>> 2) In the case where we are doing any tails at all, it's a different
>> optimization altogether, and the main goal should be to make sure the
>> tail is connected the previous blocks if at all possible.  To the
>> extent that we have per-CPU LG groups, this can be... problematic.
>>
> That's why I suggest always put small files to LG preallocations (and remove 
> scheduler dependency), which allows small files to be contiguous. For large 
> files, tail's goal should be set to the location after the existing extents.
>
>> 3) The primary goal of the per-group locality group is to avoid lock
>> contention when multiple CPU's are trying to access the blocks at the
>> same time.  Depending on why we are doing block allocation, this may
>> not be an issue.  We should probably into account why it is we re
>> doing block allocation at particular moment:
>>
>>    *) In the case of delayed allocation w/o fsync(2), the writeback
>>       will be taking place from the writeback daemon.  In that case,
>>       all of allocations will be issued from a writeback thread, so
>>       contention is much less of an issue.  In addition, in the case
>>       where the entire file has been written within 30 seconds, we
>>       will be allocating the entire file at once, as described in (1).
>>       This is a fairly common case, so we should optimize for it.
>
> I think currently ext4 handles this fine. One thing that I am worried about 
> is that the write back thread may sit on different CPU every 30 seconds, and 
> use different LG preallocations that may be far away from each other.
>>
>>
>>    *) In the case of block allocations resulting from an fallocate(2),
>>       we should assume userspace knew what it was doing, and we
>>       shouldn't try to do any kind of size normalization.  Also,
>>       fallocate() calls generally don't happen with high frequency, so
>>       worrying about CPU scalability is probalby not a major issue.
>
> I agree.
>>
>>
>>    *) In the case of nodelalloc (or ext3 compatibility mode).  In that
>>       case, we will be doing lots of singleton allocation, and it's in
>>       this case where normalization and preallocation is most important.
>
> And maybe O_DIRECT mode?
>>
>>
>>    *) The Lustre file calls ext4_map_blocks() directly.  We'll need to
>>       check with Andreas, but the user is writing a lot of small files
>>       over Lustre, it may be that lock contention will be an issue,
>>       and we might want to use LG preallocation groups.
>
> Again, I suggest small files always use LG preallocations. Files are packed 
> together and there's little contention.
>>
>>
>>    *) Writes caused by fsync(2).  (For example, from SQLite files.)
>>       In this case, lock contention might very well be an issue, and
>>       use of the LG preallocation group some kind of file
>>       normalization may very well be a good idea.  In fact, we might
>>       want to have a hueristic which notices files which are written
>>       using either a slow append or an append followed by an fsync
>>       workload, and make sure we use an inode perallocation group, and
>>       in the case of a slow append workload, we might want to hold the
>>       preallocation over a file close (with some way of releasing
>>       blocks in an ENOSPC situation).
>
> Maybe SQLite can pass a hint to block allocator. Maybe ext4 should predict by 
> previous pattern.
>>
>>
>> Some of the mballoc failure modes which you found in the Chopper paper
>> was specifically we weren't taking these various different scenarios
>> into account, and so we used an optimization that worked well when we
>> were allocating the whole file at a time (thanks to delalloc and
>> writebacks), and tht hueristic failed miserably in the case of large
>> files where we happened to be writing a tail.
>>
>> So before we start playing with changing hueristics, it would be a
>> good idea if we make sure the block allocator has enough context so it
>> understands how it is being called and by whom, and make sure we that
>> we don't use the wrong hueristic by accident.
>>
>> Does this make sense?
>
> Yes. I'll keep it in mind while doing my research.
>
> Thanks,
> Jun
>
>>
>>
>>                                         - Ted
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to