Re: Documenting VFS, the next installment (was: Tailmerging for Ext2)

2000-08-08 Thread Malcolm Beattie

[EMAIL PROTECTED] writes:
 Previously I was talking about three functions:
  
  1) generic_file_read
  2) generic_file_write
  3) generic_file_mmap

Here are some comments of mine, just because I'd like to check out my
own understanding of this sort of thing and trying to explain it in
writing is a good way of doing so.

 and had this to say:
 
 "Each of these functions works its magic through a 'mapping' defined by
 an address_space struct.  In OO terminology this would be a memory
 mapping class with instances denoting the mapping between a section of
 memory and a particular kind of underlying secondary storage.  So what
 we have looks a lot like a generalization of a swap file - all (or the
 vast majority of) file I/O in Linux is now done by memory-mapping..."
 
 "The three functions in question make use of a 'page cache', which
 appears to be a set of memory page 'heads' (page head - my terminology,
 since I haven't seen any other so far - an instance of struct page)

The "struct page" represents (and is the unit of currency) of the
physical page. You mention the idea of a MemoryPage lower down in the
context of allocating memory but the struct page is even more basic.
Every single physical page of "real" memory accessible to the kernel
has/is a "struct page". Since there's no architecture-portable way of
tacking on extra bits of data to a physical page, Linux does it by
having a separate array (mem_map[]) indexed by page frame number and
mem_map[1000] is a structure containging various bits and pieces of
information associated with page frame number 1000. In passing, an
architecture like S390 *does* in fact store some extra fields for
each physical page: a key and some ref/access bits. So thinking of a
struct page as a "real physical page" along with some private
attributes (in an OO sense) is a good idea.

A struct page represents a physical page of memory and has nothing to
do with addresses: one may not even be able to directly write to that
physical page from the kernel. For example, with PAE36 on x86, "high"
memory pages (roughly speaking those beyond about page number 256000,
or 1GB into physical memory) aren't mapped directly in the kernel's
address space. They are reached by mapping them temporarily with
calls such as roughly
addr = kmap(page);
... can now use address addr to write/read to/from the page
kunmap(page);
or else they are mapped into a process address space with
remap_page_range. The "addr = kmap(); ...; kunmap()" way of
accessing memory reminds me of the PalmOS way of allocating and
reading/writing memory where you are only granted temporary
permission to write to memory, possibly via a different virtual
address each time. It's a nice way to separate physical and
virtual memory both as an API and while thinking about it.
Obviously, Linux uses various caches and cleverness behind the
scenes to avoid too much of a map/unmap/map/unmap performance hit.

 indexed by offset expressed in units of PAGE_SIZE (4096 on most
 architectures), munged together with the address of the mapping struct. 
 So we use the page offset within a file plus a unique characteristic of
 the file (its mapping struct), followed by a linear search to find a
 piece of memory that has already been set up to map the file/offset
 we're interested in."

The page cache is simply a fast hash look up from
(addressspace, offset) to the struct page holding that data. It used
to be (inode, offset) but, from an OO point of view, there's no
reason why the "lookup key" needs to be an inode and so inventing a
new type of "addressspace" object (of which an inode is one example)
works nicely.

 Understanding the mapping between pages and buffers is crucial to
 understanding those three functions.  In (1) and (2) operations are
 performed directly on the pages and buffers, and these operations are
 quite easy to understand in spite of the fact that the function call
 chain wanders around goes very deep at times.  At least it's all
 synchronous.

Being synchronous may make the coding simpler (and hence possibly
easier to understand looking through it) but I find the underlying
abstractions easier to understand without worrying about whether
it's asynchronous/synchronous. Basically, you can (a) initiate I/O
to write/read into a physical page, (b) ask whether a particular
page is uptodate/dirty, (c) do a blocking wait for a particular
page to become up to date and, sort of, (d) ask to get your own
function called back when an I/O you initiated is complete.
Thinking about it that way lets you separate the vm, filesystem,
and I/O request subsystems without them interfering with each
other. Again, nice from the OO point of view.

 115 struct vm_operations_struct {
 116 void (*open)
 117 void (*close)
 118 void (*unmap)
 119 void (*protect)
 120 int (*sync)
 121 struct page * (*nopage)
 122 struct page * (*wppage)
 123 int (*swapout)
 124 };
 
 At this point 

Re: Tailmerging for Ext2

2000-08-06 Thread Hans Reiser

Alexander Viro wrote:
 
 On Wed, 26 Jul 2000, Daniel Phillips wrote:
 
  Stephen asked me some sharp questions about how this would work,
  and after I answered them to his satisfaction he asked me if I
  would have time to implement this feature.  I said yes, and went
  on to write an initial design document describing the required
  modifications to Ext's handling of inodes, and a prototype
  algorithm for doing the tail merging.
 
 Here is one more for you:
 Suppose we grow the last fragment/tail/whatever. Do you copy the
 data out of that shared block? If so, how do you update buffer_heads in
 pages that cover the relocated data? (Same goes for reiserfs, if they are
 doing something similar). BTW, our implementation of UFS is fucked up in
 that respect, so variant from there will not work.
Hmmm, this is the question I faced in 1993.  Balanced trees were the answer.

Hans



Re: Tailmerging for Ext2

2000-08-06 Thread Hans Reiser

Post your document on the reiserfs mailing list when you finish it, the ReiserFS team 
will enjoy
reading it.

Hans

Daniel Phillips wrote:
 
 Alexander Viro wrote:
  On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:
   On Wed, Jul 26, 2000 at 03:19:46PM -0400, Alexander Viro wrote:
  
Erm? Consider that: huge lseek() + write past the end of file. Woops - got
to unmerge the tail (it's an internal block now) and we've got no
knowledge of IO going on the page. Again, IO may be asynchronous - no
protection from i_sem for us. After that page becomes a regular one,
right? Looks like a change of state to me...
  
   Naturally, and that change of state must be made atomically by the
   filesystem.
 
  Yep. Which is the point - there _are_ dragons. I believe that it's doable,
  but I realy want to repeat: Daniel, watch out for races at the moments
  when page state changes, it needs more accurate approach than usual
  pagecache-using fs. It can be done, but it will take some reading (and
  yes, Stephen, I know that _you_ know it ;-)
 
 That's apparent, and I feel that Stephen could probably implement the entire
 tail merge as described so far in few days.  But that wouldn't be as useful as
 having me and perhaps some interested observers others go all the way through
 the exercise of figuring out the so-far unwritten rules of the
 buffercache/pagecache duo.
 
 The exact same accurate work is required for Tux2, which makes massive use of
 copy-on-write.  Right now, buffer issues are the main thing standing in the way
 of making a development code release for Tux2.  So there is no question in my
 mind about whether such issues have to be dealt with: they do.
 
 I dove into the 2.4.0 cache code for the first time last night (using lxr - try
 it, you'll like it) and I'm almost at the point where I have some relevant
 questions to ask.  I notice that buffer.c has increased in size by almost 50%
 and is far and away the largest module in the VFS.  Worse, buffer.c is massively
 cross-coupled to the mm subsystem and the page cache, as we know too well.
 Buffer.c is right at the core of the issues we're talking about.
 
 Bearing that in mind, instead of just jumping in and starting to code I'll try
 the methodical approach :-)  My immediate objective is to try clarify a few
 things that aren't immediately obvious from the source, in the following areas:
 
   - States and transitions for the main objects:
 - Buffer heads
 - Buffer data
 - Page heads
 - Page data
 - Other?
 
   - Existing concurrency controls:
 - Semaphores/Spinlocks
 - Big kernel lock
 - Filesystem locks
 - Posix locks?
 - Other?
 
   - Planned additions/deletions of concurrency controls
 
 I will also try to make a list of the main internal functions in the VFS (and
 some related ones from the mm and drivers modules) and examine
 function-by-function what the intended usage is, what the issues/caveats are,
 and maybe even how we can expect them to evolve in the future.
 
 I think we need even more than this in terms of documentation in order to work
 effectively, but this at least will be a good start.  It will be more than what
 we have now.  If it gets to the point where we can actually answer questions
 about race conditions by consulting the docs then we really will have
 accomplished something.  Yes, I know that the code is going to keep evolving and
 sometimes will break the docs, but I also have confidence that the docs can keep
 up with such evolution given some interested volunteer doc maintainers willing
 to hang out on the devel list and keep asking questions.
 
 Even in 2.2.x I felt that there is a lot of understated elegance in Linux's
 buffer cache design.  In 2.4.0 it seems to be getting more elegant, although
 it's hard to say exactly, because of the sparse (read: nonexistent)
 documentation.  This is a problem that can be easily fixed.
 
 To get through this I will have to ask a lot of naive-sounding questions.
 Hopefully I'll have the first batch ready this afternoon (morning, your time).
 
 --
 Daniel



Re: Tailmerging for Ext2

2000-07-28 Thread Daniel Phillips

Alexander Viro wrote:
 
 On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:
 
  For tail writes, I'd imagine we would just end up using the page cache
  as a virtual cache as NFS uses it, and doing plain copy into the
  buffer cache pages.
 
 Ouch. I _really_ don't like it - we end up with special behaviour on one
 page in the pagecache. And getting data migration from buffer cache to
 page cache, which is Not Nice(tm). Yuck... Besides, when do we decide that
 tail is going to be, erm, merged? What will happen with the page then?

Notwithstanding the fact that I intend to try to figure this out for myself over
the weekend by reading the code, could I please have a simple explanation of the
relationship between the page cache and buffer cache?

Specific questions: What shall we call the struct that describes a page - a page
header?  This is a descriptor for a virtual page or a physical page?  (I think:
physical page.)  What exactly is the function of the list of buffer pages
hanging off a page descriptor?  Do these hold the filesystem blocks that map the
page in virtual memory?

I'm sure all of this was discussed somewhere on the kernel list sometime in the
past, and I missed it.

-- 
Daniel



Re: Tailmerging for Ext2

2000-07-27 Thread Daniel Phillips

Alexander Viro wrote:
 On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:
  On Wed, Jul 26, 2000 at 03:19:46PM -0400, Alexander Viro wrote:
 
   Erm? Consider that: huge lseek() + write past the end of file. Woops - got
   to unmerge the tail (it's an internal block now) and we've got no
   knowledge of IO going on the page. Again, IO may be asynchronous - no
   protection from i_sem for us. After that page becomes a regular one,
   right? Looks like a change of state to me...
 
  Naturally, and that change of state must be made atomically by the
  filesystem.
 
 Yep. Which is the point - there _are_ dragons. I believe that it's doable,
 but I realy want to repeat: Daniel, watch out for races at the moments
 when page state changes, it needs more accurate approach than usual
 pagecache-using fs. It can be done, but it will take some reading (and
 yes, Stephen, I know that _you_ know it ;-)

That's apparent, and I feel that Stephen could probably implement the entire
tail merge as described so far in few days.  But that wouldn't be as useful as
having me and perhaps some interested observers others go all the way through
the exercise of figuring out the so-far unwritten rules of the
buffercache/pagecache duo.

The exact same accurate work is required for Tux2, which makes massive use of
copy-on-write.  Right now, buffer issues are the main thing standing in the way
of making a development code release for Tux2.  So there is no question in my
mind about whether such issues have to be dealt with: they do.

I dove into the 2.4.0 cache code for the first time last night (using lxr - try
it, you'll like it) and I'm almost at the point where I have some relevant
questions to ask.  I notice that buffer.c has increased in size by almost 50%
and is far and away the largest module in the VFS.  Worse, buffer.c is massively
cross-coupled to the mm subsystem and the page cache, as we know too well. 
Buffer.c is right at the core of the issues we're talking about.

Bearing that in mind, instead of just jumping in and starting to code I'll try
the methodical approach :-)  My immediate objective is to try clarify a few
things that aren't immediately obvious from the source, in the following areas:

  - States and transitions for the main objects:
- Buffer heads
- Buffer data
- Page heads
- Page data
- Other?

  - Existing concurrency controls:
- Semaphores/Spinlocks
- Big kernel lock
- Filesystem locks
- Posix locks?
- Other?

  - Planned additions/deletions of concurrency controls

I will also try to make a list of the main internal functions in the VFS (and
some related ones from the mm and drivers modules) and examine
function-by-function what the intended usage is, what the issues/caveats are,
and maybe even how we can expect them to evolve in the future.

I think we need even more than this in terms of documentation in order to work
effectively, but this at least will be a good start.  It will be more than what
we have now.  If it gets to the point where we can actually answer questions
about race conditions by consulting the docs then we really will have
accomplished something.  Yes, I know that the code is going to keep evolving and
sometimes will break the docs, but I also have confidence that the docs can keep
up with such evolution given some interested volunteer doc maintainers willing
to hang out on the devel list and keep asking questions.

Even in 2.2.x I felt that there is a lot of understated elegance in Linux's
buffer cache design.  In 2.4.0 it seems to be getting more elegant, although
it's hard to say exactly, because of the sparse (read: nonexistent)
documentation.  This is a problem that can be easily fixed.

To get through this I will have to ask a lot of naive-sounding questions. 
Hopefully I'll have the first batch ready this afternoon (morning, your time).

-- 
Daniel



Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Daniel Phillips wrote:

 Stephen asked me some sharp questions about how this would work,
 and after I answered them to his satisfaction he asked me if I 
 would have time to implement this feature.  I said yes, and went 
 on to write an initial design document describing the required
 modifications to Ext's handling of inodes, and a prototype
 algorithm for doing the tail merging.

Here is one more for you:
Suppose we grow the last fragment/tail/whatever. Do you copy the
data out of that shared block? If so, how do you update buffer_heads in
pages that cover the relocated data? (Same goes for reiserfs, if they are
doing something similar). BTW, our implementation of UFS is fucked up in
that respect, so variant from there will not work.




Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote:
 
 Here is one more for you:
   Suppose we grow the last fragment/tail/whatever. Do you copy the
 data out of that shared block? If so, how do you update buffer_heads in
 pages that cover the relocated data? (Same goes for reiserfs, if they are
 doing something similar). BTW, our implementation of UFS is fucked up in
 that respect, so variant from there will not work.

For tail writes, I'd imagine we would just end up using the page cache
as a virtual cache as NFS uses it, and doing plain copy into the
buffer cache pages.

Cheers,
 Stephen



Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:

 Hi,
 
 On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote:
  
  Here is one more for you:
  Suppose we grow the last fragment/tail/whatever. Do you copy the
  data out of that shared block? If so, how do you update buffer_heads in
  pages that cover the relocated data? (Same goes for reiserfs, if they are
  doing something similar). BTW, our implementation of UFS is fucked up in
  that respect, so variant from there will not work.
 
 For tail writes, I'd imagine we would just end up using the page cache
 as a virtual cache as NFS uses it, and doing plain copy into the
 buffer cache pages.

Ouch. I _really_ don't like it - we end up with special behaviour on one
page in the pagecache. And getting data migration from buffer cache to
page cache, which is Not Nice(tm). Yuck... Besides, when do we decide that
tail is going to be, erm, merged? What will happen with the page then?




Re: Tailmerging for Ext2

2000-07-26 Thread Daniel Phillips

On Wed, 26 Jul 2000, Alexander Viro wrote:
 On Wed, 26 Jul 2000, Daniel Phillips wrote:
 
  Stephen asked me some sharp questions about how this would work,
  and after I answered them to his satisfaction he asked me if I 
  would have time to implement this feature.  I said yes, and went 
  on to write an initial design document describing the required
  modifications to Ext's handling of inodes, and a prototype
  algorithm for doing the tail merging.
 
 Here is one more for you:
   Suppose we grow the last fragment/tail/whatever. Do you copy the
 data out of that shared block? If so, how do you update buffer_heads in
 pages that cover the relocated data? (Same goes for reiserfs, if they are
 doing something similar). BTW, our implementation of UFS is fucked up in
 that respect, so variant from there will not work.

Please bear in mind that I don't pretend to be an expert on the VFS, and
especially its latest incarnation in 2.4.0.  I'm coming to grips with it now. 
Notwithstanding that, I'll try to provide some insight anyway.

   Suppose we grow the last fragment/tail/whatever. Do you copy the
 data out of that shared block? 

Yes, except possibly in the case where the fragment grows by an amount will that
will still fit in the shared block.  Even in that case, you might want to
ignore the possible optimization and copy it out mindlessly, on the assumption
that another write is coming soon.  My plan is to do the incremental merging at
file close time.

 If so, how do you update buffer_heads in
 pages that cover the relocated data? 

We have to be sure that if blocks are buffered then they are buffered in
exactly one place and you always access them through through the buffer hash
table.  So far so good, but the picture gets murkier for me when you talk
about the page cache.  I'm not clear yet on the details of how the buffer cache
interacts with the page cache, and perhaps you can help shed some light on
that.  Until I am clear on it, I'll hold off commenting.

 (Same goes for reiserfs, if they are doing something similar).

I don't know exactly what ReiserFS does - I just heard Hans mention the term
'tail merging' and I could see that it was a good idea.

 BTW, our implementation of UFS is fucked up in  that respect, so variant
 from there will not work.

I'm not sure what you mean there...

-- 
Daniel



Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Daniel Phillips wrote:

  If so, how do you update buffer_heads in
  pages that cover the relocated data? 
 
 We have to be sure that if blocks are buffered then they are buffered in
 exactly one place and you always access them through through the buffer hash
 table.  So far so good, but the picture gets murkier for me when you talk

Not. Data normally is in page. Buffer_heads are not included into buffer
cache. They are refered from the struct page and their -b_data just
points to appropriate pieces of page. You can not get them via bread().
At all. Buffer cache is only for metadata.

  BTW, our implementation of UFS is fucked up in  that respect, so variant
  from there will not work.
 
 I'm not sure what you mean there...

I mean that UFS has the same problem (relocation of the last fragment) and
our implementation is fucked up (== does not deal with that properly and
eats data). So if you will look for existing solutions - forget about the
UFS one; it isn't. UFS will need fixing, but that's a separate story...




Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote:
 
 Not. Data normally is in page. Buffer_heads are not included into buffer
 cache. They are refered from the struct page and their -b_data just
 points to appropriate pieces of page. You can not get them via bread().
 At all. Buffer cache is only for metadata.

Only in the default usage.  There's no reason at all why we can't use
separate buffer and page cache aliases of the same data for tails as a
special case.

Cheers,
 Stephen



Re: Tailmerging for Ext2

2000-07-26 Thread Stephen C. Tweedie

Hi,

On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote:

  For tail writes, I'd imagine we would just end up using the page cache
  as a virtual cache as NFS uses it, and doing plain copy into the
  buffer cache pages.
 
 Ouch. I _really_ don't like it - we end up with special behaviour on one
 page in the pagecache.

Correct.  But it's all inside the filesystem, so there is zero VFS
impact.  And we're talking about non-block-aligned data for tails, so
we simply don't have a choice in this case.

 And getting data migration from buffer cache to
 page cache, which is Not Nice(tm).

Not preferred for bulk data, perhaps, but the VFS should cope just
fine.

 Yuck... Besides, when do we decide that
 tail is going to be, erm, merged? What will happen with the page then?

To the page?  Nothing.  To the buffer?  It gets updated with the new
contents of disk.  Page == virtual contents.  Buffer == physical
contents.  Plain and simple.

Cheers,
 Stephen




Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:

 Hi,
 
 On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote:
  
  Not. Data normally is in page. Buffer_heads are not included into buffer
  cache. They are refered from the struct page and their -b_data just
  points to appropriate pieces of page. You can not get them via bread().
  At all. Buffer cache is only for metadata.
 
 Only in the default usage.  There's no reason at all why we can't use
 separate buffer and page cache aliases of the same data for tails as a
 special case.

In theory - yes, but doing that will require a _lot_ of accurate thinking
about possible races. IOW, I'm afraid that transitions tail-normal block
will be race-prone. Paint me over-cautious, but after you-know-what... Oh,
well... I'm not saying that it's impossible, but I _really_ recommend to
take a hard look at race scenarios - there is a potential for plenty of
them.




Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Stephen C. Tweedie wrote:

 Hi,
 
 On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote:
 
   For tail writes, I'd imagine we would just end up using the page cache
   as a virtual cache as NFS uses it, and doing plain copy into the
   buffer cache pages.
  
  Ouch. I _really_ don't like it - we end up with special behaviour on one
  page in the pagecache.
 
 Correct.  But it's all inside the filesystem, so there is zero VFS
 impact.  And we're talking about non-block-aligned data for tails, so
 we simply don't have a choice in this case.

shrug Sure, it's not a VFS problem (albeit it _will_ require accurate
playing with unmap_() in buffer.c), but ext2 problems are pretty
interesting too...

  And getting data migration from buffer cache to
  page cache, which is Not Nice(tm).
 
 Not preferred for bulk data, perhaps, but the VFS should cope just
 fine.
 
  Yuck... Besides, when do we decide that
  tail is going to be, erm, merged? What will happen with the page then?
 
 To the page?  Nothing.  To the buffer?  It gets updated with the new
 contents of disk.  Page == virtual contents.  Buffer == physical
 contents.  Plain and simple.

Erm? Consider that: huge lseek() + write past the end of file. Woops - got
to unmerge the tail (it's an internal block now) and we've got no
knowledge of IO going on the page. Again, IO may be asynchronous - no
protection from i_sem for us. After that page becomes a regular one,
right? Looks like a change of state to me...




Re: Tailmerging for Ext2

2000-07-26 Thread Chris Mason



On Wed, 26 Jul 2000, Daniel Phillips wrote:

 I don't know exactly what ReiserFS does - I just heard Hans mention the term
 'tail merging' and I could see that it was a good idea.
 

I'll give the quick and dirty answer, if people want more details, let me
know.  In 2.2, reiserfs_file_write deals directly with tails.  It appends
to them if there is room in the packed block, or converts them if there
isn't.  If reiserfs_file_write is called with a buffer size  512 bytes,
it tries to write into full blocks instead of tails.  This limits the
overhead when you cp/untar to create new files.

In both releases, there is locking on the tail to prevent races, and we
don't bother with tails on files  16k (configurable).  

For 2.4, the functions work like this:

reiserfs_get_block converts the tail into its own block and points
the buffer head at the new block.

reiserfs_readpage reads directly from the tail into the page, leaves the
buffer head mapped, and sets b_blocknr to 0.

reiserfs_writepage and reiserfs_prepare_write both check for mapped buffer
heads with a block number of 0 in the page.  If found, they are unmapped.
Then block_write_full_page or block_prepare_write is called.

reiserfs_truncate deals directly with the tail.  If the last block is
packed back into the tail, it is unmapped from the page cache.

reiserfs_file_release will check to see if the tail needs to be repacked,
and use truncate (without changing i_size) to pack the tail.

-chris






Re: Tailmerging for Ext2

2000-07-26 Thread Alexander Viro



On Wed, 26 Jul 2000, Chris Mason wrote:

 In both releases, there is locking on the tail to prevent races, and we
 don't bother with tails on files  16k (configurable).  

What granularity do you have? (for tail size, that is).




Re: Tailmerging for Ext2

2000-07-26 Thread Chris Mason



On Wed, 26 Jul 2000, Alexander Viro wrote:

 
 
 On Wed, 26 Jul 2000, Chris Mason wrote:
 
  In both releases, there is locking on the tail to prevent races, and we
  don't bother with tails on files  16k (configurable).  
 
 What granularity do you have? (for tail size, that is).
 

From 1 byte to almost the blocksize (4k). But, there is a macro for
deciding when to use a tail, which varies it based on the file size.  If
the file  12k, it won't have a tail bigger than 1k, an 8k file won't have
a tail bigger than 2k.

Of course, this is just a guess about the right balance between space and
performance...

-chris