Tailmerging for Ext2

Daniel Phillips Wed, 26 Jul 2000 10:54:42 -0700

Last week I was discussing some issues related to my Tux2 
filesystem project with Stephen Tweedie.  While we were 
discussing something completely unrelated I happened to mention 
an idea I had for reducing disk wastage in Ext2 for small files.
The basic idea is called 'tail merging'.  Tail merging already 
exists in ReiserFS and I thought that it would not be hard to 
add it to Ext2.  If tail merging becomes part of Ext2 then it 
indirectly becomes part of Tux2 and Ext3 as well, since both 
are upward-compatible with Ext2.

Tail merging would replace the 'fragments' feature that is
already implemented in the BSD Fast File System and was planned
for Ext2 but never implemented.  Data fields were reserved in
Ext2 inodes for the bookkeeping that fragments would have needed,
but those fields always contain zeros in current and past 
versions of Ext2.  I plan to use those fields in a different
way, to implement tail merging instead of fragments.

Stephen asked me some sharp questions about how this would work,
and after I answered them to his satisfaction he asked me if I 
would have time to implement this feature.  I said yes, and went 
on to write an initial design document describing the required
modifications to Ext's handling of inodes, and a prototype
algorithm for doing the tail merging.

I think that tail merging can be implemented quite quickly, and
that the result will be a big improvement for Ext2.  I think 
tail merging will improve Ext2's average storage efficiency by 
somewhere between 15 and 30%.  Currently the worst case for 
Ext2's disk allocation is surprisingly bad: if every file in a 
filesystem is one byte long, you will waste 99.9% of the disk (!).
Tail merging as I defined it will improve this worst case to only 
50% wastage (this worst case occurs when every file on the 
filesystem is exactly blocksize/2+1 bytes long).

The tail merging scheme I have proposed will have little impact
on efficiency - in fact, it is likely to improve reading speed
somewhat because fewer blocks need to be read.  I don't think
implementing it will be very hard.  All these things taken
together suggest that this is a good idea to try.

I have collected all the emails between me and Stephen and attached 
them as an html document.

-- 
Daniel

Subject: Re: Tux2 filesystemDate: Fri, 14 Jul 2000 15:07:17 +0200
From: Daniel Phillips <[EMAIL PROTECTED]>

> > Since there are at minimum 8 inodes in an inode block, a single byte per inode
> > would be entirely adequate for both a full inode number and a 32 bit magic
> > number to help identify a block as an inode block. I'm looking at the inode
> > layout now and I see the dtime field is always 0 until the inode has been
> > deleted, and you can tell when a node has been deleted. So, for an active
> > inode, I think it would be a good idea to take a byte from the dtime field for
> > my own use.
>
> The dtime is used by e2fsck, though, to distinguish between deleted
> inodes and inodes which have just become unreferenced due to directory
> problems.

Yes, that is important, but it would be just as easy to zero some other key
field to make this unambiguous. I'm still looking with greedy eyes at that
field :-)

> But yes, dtime can be used, and ext3 uses it too for a linked-list
> data structure for all inodes in the process of being deleted.

Ah, very good, there's a precedent. Tux2 probably doesn't need such a wrinkle
because you have a little more control over what constitutes a filesystem
transaction. So now there are just two things to wonder about: Is the high
byte or the low byte better? And what to use the other 3 bytes for? :-)

> If you really want some bits right now, reuse the fragment fields,
> which are not and never (I expect!) will be used in ext2.

Yes, after studying the code I came to the conclusion that fragments
might well never be implemented. On the other hand, I have a rather
interesting idea to propose about how to use the same 32 bits to implement a
"tail merge" feature similar to ReiserFS.

In brief: 16 bits is used to refer to another inode in the same group whose
tail is merged with this one. The other 16 bits gives the offset of the
beginning of the tail fragment within the final block. The final block of the
file is also indexed as the final block of the other file. N files in the same
group can have their tails merged. The node tail references form a circular
list, and in fact, the structure is completely symmetric.

The savings from tail merges would be very significant - I haven't got exact
figures just now, but I think (for example) disk usage in my Mozilla source
directory would drop by about 30%, since it consists almost entirely of small
files with average 50% wastage in the final block.

--
Daniel

Subject: Re: Tux2 filesystem
Date: Sun, 16 Jul 2000 22:33:50 +0100
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>

Hi,

On Fri, Jul 14, 2000 at 03:07:17PM +0200, Daniel Phillips wrote:

> Yes, after studying the code I came to the conclusion that fragments
> might well never be implemented. On the other hand, I have a rather
> interesting idea to propose about how to use the same 32 bits to implement a
> "tail merge" feature similar to ReiserFS.
>
> In brief: 16 bits is used to refer to another inode in the same group whose
> tail is merged with this one. The other 16 bits gives the offset of the
> beginning of the tail fragment within the final block. The final block of the
> file is also indexed as the final block of the other file. N files in the same
> group can have their tails merged. The node tail references form a circular
> list, and in fact, the structure is completely symmetric.

That's OK for storing the tails, but how do you store the free space
information in such a case? How would you deal with the deletion of a
file which occupies the middle of a group tail block, for example?

Cheers,
Stephen

From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tux2 filesystem
Date: Mon, 17 Jul 2000 08:49:46 +0200
X-Mailer: KMail [version 1.0.21]
Content-Type: text/plain
References: <[EMAIL PROTECTED]>
Cc: "Me" <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <00071709590800.00599@fronius>
Content-Transfer-Encoding: 8bit
Status: R
X-Status: N

On Sun, 16 Jul 2000, Stephen C. Tweedie wrote:
> On Fri, Jul 14, 2000 at 03:07:17PM +0200, Daniel Phillips wrote:
>
> > Yes, after studying the code I came to the conclusion that fragments
> > might well never be implemented. On the other hand, I have a rather
> > interesting idea to propose about how to use the same 32 bits to implement a
> > "tail merge" feature similar to ReiserFS.
> >
> > In brief: 16 bits is used to refer to another inode in the same group whose
> > tail is merged with this one. The other 16 bits gives the offset of the
> > beginning of the tail fragment within the final block. The final block of the
> > file is also indexed as the final block of the other file. N files in the same
> > group can have their tails merged. The node tail references form a circular
> > list, and in fact, the structure is completely symmetric.
>
> That's OK for storing the tails, but how do you store the free space
> information in such a case? How would you deal with the deletion of a
> file which occupies the middle of a group tail block, for example?

You don't have to store the free space information in the disk image, but you
might want to cache it in memory. I think you can get away with not even doing
that - see below.

To delete the file you just delete it, taking care not to free the tail block..
This leaves a gap somewhere in the tail block. That's OK - this can be
squeezed out when it's necessary and convenient to do so.

The only extra thing you have to do in a delete is follow the list around to
find the prev inode and fix it up.

To get a map of the freespace in the shared block you have to look at all the
inodes in the circular list. This map will only be needed when attempting to
merge tails, but not when reading, writing, rewriting or deleting. The needed
inodes will often be in the buffer cache, especially if they happen to be on
the same block. Things can be arranged so that this is likely.

The tail data is not pinned; it is referenced from only one place per file.
This makes compaction very easy.

Reading runs at full speed with this approach because everything you need to
know is in the inode. Actually, reading will be accelerated because fewer
blocks have to be read.

Appending to a merged file will be slightly slower because you have to find
and update the previous pointer in the circular list (unmerging the file).

Writing speed in general might will stay the same or even improve because,
though you have to update pointers in inodes that are not being written, you
are also writing fewer data blocks.

Fancy things can be done to accelerate the allocation of space in the tail
block. I think something fairly mindless would work pretty well too, for
example: just examine the next few inodes to see if tail merging can be done,
and compact the tail fragments if necessary.

Daniel

Subject: Re: Tux2 filesystem
Date: Mon, 17 Jul 2000 10:17:19 +0100
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>

Hi,

On Mon, Jul 17, 2000 at 08:49:46AM +0200, Daniel Phillips wrote:
>
> You don't have to store the free space information in the disk image, but you
> might want to cache it in memory. I think you can get away with not even doing
> that - see below.

Interesting idea. Got time to implement it? :-)

Cheers,
Stephen

From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Tail Mergin
Date: Mon, 17 Jul 2000 15:45:36 +0200
X-Mailer: KMail [version 1.0.21]
Content-Type: text/plain
References: <[EMAIL PROTECTED]>
Cc: Me <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <00071716065503.00599@fronius>
Content-Transfer-Encoding: 8bit
Status: R
X-Status: N

On Mon, 17 Jul 2000, Stephen C. Tweedie wrote:
> On Mon, Jul 17, 2000 at 08:49:46AM +0200, Daniel Phillips wrote:
> >
> > You don't have to store the free space information in the disk image, but you
> > might want to cache it in memory. I think you can get away with not even doing
> > that - see below.
>
> Interesting idea. Got time to implement it? :-)

Yes, actually. I'm lucky enough to have nothing else than Linux file system
hacking in my job description for the time being. While I have to concentrate
on getting working code out the door for Tux2, I'm pretty sure I can get time to
try out the tail-merging idea. As you can see, it doesn't rely on any special
properties of Tux2 and If it works well it wil be completely applicable to Ext2
and probably Ext3 as well, barring journalling issues.

I've been hearing some comments about how Ext2, 3 et al are nearing the end of
their useful lives in Linux, mostly based on not having tail merging, extents,
and B-tree directories. But you are doing extents, and if I do tail merging,
and that's two out of three. There could be a lot more life left in the good
old filesystem than some people think.

--
Daniel

From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Tail Merging
Date: Mon, 17 Jul 2000 16:12:34 +0200
X-Mailer: KMail [version 1.0.21]
Content-Type: text/plain
References: <[EMAIL PROTECTED]>
Cc: Me <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <00071716065503.00599@fronius>
Content-Transfer-Encoding: 8bit
Status: R
X-Status: N

On Mon, 17 Jul 2000, Stephen C. Tweedie wrote:
> Interesting idea. Got time to implement it? :-)

Gazing at the ext2_inode struct (once again) I see that 6 bytes are devoted to
fragments:

u32      i_faddr;                /* Fragment address */
u8       h_i_frag;       /* Fragment number */
u8               h_i_fsize;      /* Fragment size */

I only need 4 bytes for the tail merge if I make the merge list pointer
relative to the inode's group. 16 bits is certainly enough for the tail offset.
(Will anyone *ever* want a block size larger than 64K?)

I would break the i_faddr field into two u16's and free the remaining two
bytes for future use. The result is:

u16      i_tail_ share; /* Tail block share list */
u16      i_tail_ offset;         /* Tail block data offset */
u16      l_i_reserved3; /* Were frag and fsize */

It is essential that i_tail_share==zero be interpreted as "no shared tail"
- otherwise it wouldn't be backward compatible. A cute way to do this is to
interpret the share pointer as "relative to this inode, wrapping to the
beginning of the group according to inodes_per_group". This implies that an
inode with a share entry of zero can be interpreted as a circular list of one
element, in other ords, it shares its tail block only with itself. A oundary
condition goes away.

Perhaps this is *too* cute?

---------

I have to confess to not having really understood how fragments were supposed
to work in the first place. I got a book on xxxBSD and am now somewhat clearer
on the concept. It seems to be a workable idea (in fact it we know it works,
and efficiently) but I think it's fair to say it causes a lot of code
complexity. It's also not as good at recovering wasted space as my idea,
because fragments have to have a fixed size and you will siill waste on average
half that size.

Ext2 as it now stands has a worst-case disk utilization of .1 percent,
arising when every file is exactly one byte long. That is for 1024 byte
blocks and it gets worse for larger block sizes. Average utilization is 50% if
every file is less than a block long.

Fragments as implemented in BSD improve this somewhat, but only by the amount
that a fragment is smaller than a block.

Tail merging as I described it reduces worst-case wastage to about 50%.
This arises when every file is exactly blocksize/2 + 1 bytes long Average
case is very close to zero wastage, assuming the tail packer heuristics are
good.

--
Daniel

From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Tail Merging
Date: Tue, 18 Jul 2000 10:39:10 +0200
X-Mailer: KMail [version 1.0.21]
Content-Type: text/plain
References: <[EMAIL PROTECTED]>
Cc: Me <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <00071716065503.00599@fronius>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S

Yesterday I said:
>
>I would break the i_faddr field into two u16's and free the remaining two
>bytes for future use. The result is:
>
>        u16 i_tail_ share; /* Tail block share list */
>        u16 i_tail_ offset; /* Tail block data offset */
>        u16 l_i_reserved3; /* Were frag and fsize */
>
>It is essential that i_tail_share==zero be interpreted as "no shared tail"
>- otherwise it wouldn't be backward compatible. A cute way to do this is to
>interpret the share pointer as "relative to this inode, wrapping to the
>beginning of the group according to inodes_per_group". This implies that an
>inode with a share entry of zero can be interpreted as a circular list of one
>element, in other ords, it shares its tail block only with itself. A boundary
>condition goes away.

This has evolved somewhat since yesterday:

1) The tail merge pointer should be signed, 16 bits. Then no modular
arithmetic is needed and the pointer can cross group boundaries if it
wants to.

2) The merge list should be double-linked. This makes some operations
involving tail merging (append, delete and in Tux2, rewrite) significantly more
efficient - you don't have to follow the entire list to update a Next pointer.

The new arrangement uses 6 bytes in the on-disk inode, exactly the same as the
current fragment strategy has reserved. How convenient. Furthermore, the
current contents of those fields (all zeros) is interpreted by default in my
new scheme as "doubly linked circular list of one element - this inode - with
tail data offset of 0 bytes". I think this is about as elegant as it gets.
Doubly linked circular lists that never become empty were made in heaven by a
high-ranking angel. :-)

I think I can now proceed to a detailed design and an implementation plan. My
intuition is that there won't be too many difficulties. The only slight
irritation I've encountered so far is that (I think) I can't use generic_read
any more. (VFS doesn't have any concept of data offset within a block.)
The solution for now is to import a modified version of generic_read into
the hacked Ext2 code. Hopefully something better could be done later if the
prototype works out well.

Now the changed inode struct fields are:

        s16 i_tail_next; /* Was faddr */
        s16 i_tail_prev; /* Was faddr */
        u16 l_tail_offset; /* Was frag+fsize */

--
Daniel

From: Daniel Phillips <[EMAIL PROTECTED]>
To: Stephen C. Tweedie <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Date: Mon, 24 Jul 2000 12:42:42 +0200
X-Mailer: KMail [version 1.0.28]
Content-Type: text/plain
References: <[EMAIL PROTECTED]> <00071716065503.00599@fronius>
In-Reply-To: <00071716065503.00599@fronius>
Cc: [EMAIL PROTECTED]
MIME-Version: 1.0
Message-Id: <00072413184801.04767@gimli>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S

A few days ago I said:
> I think I can now proceed to a detailed design and an implementation plan. My
> intuition is that there won't be too many difficulties. The only slight
> irritation I've encountered so far is that (I think) I can't use generic_read
> any more. (VFS doesn't have any concept of data offset within a block.)
> The solution for now is to import a modified version of generic_read into
> the hacked Ext2 code. Hopefully something better could be done later if the
> prototype works out well.

This turned out to be inaccurate - actually, after a closer look, the current
VFS turned out to be quite friendly to the required modications. BTW, do you
know where I can find some notes on the current VFS and cache structure,
especially the reworked page cache, new buffer cache operations and locking
rules for the VFS, buffer cache and page cache?

I copied Victor on this email because I know he's interested in this kind of
thing.

Here is a more-or-less complete description of an initial implementation of a
tail merging feature for Ext2. It should apply equally well to Ext3 and Tux2.
For now a special merging pass will be performed at mount time, and any write
to a merged tail block will cause the block to be unmerged. This is
admittedly a poor substitute for incremental merging, nonetheless it
accomplishes the following:

- it's considerably simpler than an incremental algorithm
- it's strictly compatible with an incremental version to be done later
- it's already useful for such things as read-only filesystems
- it works globally and probably achieves pretty good results
- it will allow measurement of space savings and performance impact
- it will allow experimentation with various merging heuristics
- it will get us started

Unused inode fields totalling six bytes formerly reserved for implementation
of a BSD-style fragment strategy are to be "repurposed" for tail merging:

tail_next (signed 16 bit offset to next inode in tail merge ring)
tail_prev (signed 16 bit offset to prev inode in tail merge ring)
tail_offset (unsigned 16 bit offset of tail data fragment in tail block)

It is convenient and important that the current contents of those fields, all
zeros, are consistent with the default state of the new tail merge data: the
offset of the data in the tail block (the last data block assigned to the file)
is zero, and the next and prev inodes in the ring are the inode itself (pointer
offset of zero) forming a ring containing a single inode.

Signed offsets are used for the next/prev links to save space in the inode
(16 bit fields; we could even use 8 bit fields necessary). In any event, we
want to try to merge the tails only of inodes that are numerically close
together, so a limited range offset is not an onerous limitation.

Implementation Strategy
-----------------

Reading:

generic_read - needs a new read_actor that accounts for the necessary
data offset in the tail block of a tail-merged file.

Writing:

A write to a merged tail block will cause that inode's tail block to be unmerged:
- If the ring has only one inode, optionally move the tail data to the beginning
of the block and set the tail offset to zero, otherwise:
- Allocate a new block and copy in the tail data.
- Remove the inode from the merge ring.
- Concurrency issues/race conditions?

Deleting:

In addition to the normal delete processing, a tail-merged inode needs to be
removed from its merge ring. Deletion from a double-linked circular list is a
nice, efficient operation and nothing other than that needs to be done.

Truncating:

Truncating of a tail-merged file will cause the file to be un-merged if the
truncation point is before the beginning of the tail block. If truncation to a
length past the end of a file is allowed (I don't know, is it?) then the
easiest (and probably most efficient) thing to do in this case is also to
unmerge the file. If the truncation point lies within the tail fragment then
nothing special needs to be done.

Tail merge pass at mount time:

Here is a relatively simple, non-incremental algorithm. It makes one pass
through all inodes at mount time, trying to merge as many tails as possible.
We keep a working list of merge rings that we have already seen and that have
more than some threshold amount of slack space available in the tail block.

Each entry in the working list has the following fields:

next ring in the working list
prev ring in the working list
inode (any inode in the ring)
lowest inode in ring
highest inode in ring
slack size (total free space in tail block)

The "center" of a ring is the number halfway between the lowest and highest
inode number. The working list will be kept sorted by the distance of ring
center from the center of the "current" ring, thus when looking for a merge
candidate we will choose the "closest" ring with that has enough tail slack.
A practical result of this is that sequences of many small files with adjacent
inode numbers will tend to be merged together into a single block, which is
exactly what we want.

Algorithm:

For each (allocated) inode:

     (Each inode is a member of a tail merge ring consisting of one or more
     inodes.) Traverse the current ring starting with this inode. If any link
     points to an inode lower-numbered than the current one then we have
     already seen this ring and we are finished with this inode. Otherwise,
     accumulate the ring statistics: total tail size; highest numbered inode.
     While doing this, make a map of space allocation in the tail node.
     Afterwards, use the map to check that the tail node is densely packed, and
     if it isn't, repack the tail node and adjust the tail offsets of the ring
     inodes accordingly. (Now we can be sure that the tail node of any ring
     in the working list is densely packed and thereby avoid the need to map
     the tail node allocation again later.)

(Try to merge the ring with a ring in the working list.) Working through
the rings in the working list in sorted order:

If a ring in the working list has enough tail slack space, merge the
current ring with it:

          Copy in the tail fragments from the tail block of the current inode.
          Free the tail block of the current ring and set the tail pointers to
          the tail block of the working list ring (not necessarily in that
          order).

Then let:

             pred1 be the current inode
             succ1 be the successor inode in pred1's tail ring
             pred2 be an inode from the working list ring
             succ2 be the succesor inode to pred2

and set:

             prec1.next = succ2
             succ2.prev = prec1
             prec2.next = succ1
             succ1.prev = prec2

          Update the statistics for the working list ring (remaining tail slack;
          lowest and highest inode numbers) unless slack size has fallen to some
          threshold (possibly zero). In that case, remove the ring from the
          working list, otherwise resort the list.

    If it was not possible to merge the current ring then add it to the
    working list in the proper sorted location, setting the tail slack and
    lowest/highest inode numbers.

(drop distant rings from the working list)

This merge pass is intended to be run optionally at filesystem mount time.
This is not as good as a incremental merge but it will answer three important
questions: (1) How much space is saved by tail merging in real filesystems?
(2) Are filesystem operations still stable when merged tail blocks are
present? (3) What is the performance impact?

Later, after things are stable, the onte-time merge pass will be replaced by
an incremental merge performed at file close time. The merge pass will still
remain useful as a means of checking that an incremental merging algorithm is
close to optimal, and also for one-time optimal packing of files in a
read-only file system.

--
Daniel

From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Daniel Phillips <[EMAIL PROTECTED]>
Cc: "Stephen C.Tweedie" <[EMAIL PROTECTED]>,
[EMAIL PROTECTED]
Subject: Re: Tail Merging
Message-ID: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <00071716065503.00599@fronius> <00072413184801.04767@gimli>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
In-Reply-To: <00072413184801.04767@gimli>; from [EMAIL PROTECTED] on Mon, Jul 24, 2000 at 12:42:42PM +0200
Sender: [EMAIL PROTECTED]
Status: R
X-Status: N

Hi,

On Mon, Jul 24, 2000 at 12:42:42PM +0200, Daniel Phillips wrote:

> This turned out to be inaccurate - actually, after a closer look, the current
> VFS turned out to be quite friendly to the required modications. BTW, do you
> know where I can find some notes on the current VFS and cache structure,
> especially the reworked page cache, new buffer cache operations and locking
> rules for the VFS, buffer cache and page cache?

There aren't any, anywhere. The closest to documentation is the
source of the existing NFS and ext2 filesystems --- those two
filesystems tend to be the simplest, most actively-maintained examples
of how to use the VFS layer properly.

> For now a special merging pass will be performed at mount time, and any write
> to a merged tail block will cause the block to be unmerged. This is
> admittedly a poor substitute for incremental merging, nonetheless it
> accomplishes the following:
>
>   - it's considerably simpler than an incremental algorithm
>   - it's strictly compatible with an incremental version to be done later
>   - it's already useful for such things as read-only filesystems
>   - it works globally and probably achieves pretty good results
>   - it will allow measurement of space savings and performance impact
>   - it will allow experimentation with various merging heuristics
>   - it will get us started

But what about the speed of mount? Especially with a transactional
filesystem, you really don't want the mount to become any slower.

> Reading:
>
> generic_read - needs a new read_actor that accounts for the necessary
> data offset in the tail block of a tail-merged file.

You shouldn't need a new read actor --- nor do you want one (you want
the sendfile actor to continue to work, for example). You should be
able to get this to work just inside ext2_readpage, by reading the
data into the buffer cache and then copying to the page cache for the
special case of the last page.

> Later, after things are stable, the onte-time merge pass will be replaced by
> an incremental merge performed at file close time.

Oh, OK, that makes much more sense!

Cheers,
Stephen

From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Date: Tue, 25 Jul 2000 19:06:59 +0200
X-Mailer: KMail [version 1.0.28]
Content-Type: text/plain
References: <[EMAIL PROTECTED]> <00072413184801.04767@gimli> <[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <0007251911570H.04767@gimli>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S

On Tue, 25 Jul 2000, Stephen C. Tweedie wrote:
> > For now a special merging pass will be performed at mount time, and any write
> > to a merged tail block will cause the block to be unmerged. This is
> > admittedly a poor substitute for incremental merging, nonetheless it
> > accomplishes the following:
> >
> >   - it's considerably simpler than an incremental algorithm
> >   - it's strictly compatible with an incremental version to be done later
> >   - it's already useful for such things as read-only filesystems
> >   - it works globally and probably achieves pretty good results
> >   - it will allow measurement of space savings and performance impact
> >   - it will allow experimentation with various merging heuristics
> >   - it will get us started
>
> But what about the speed of mount? Especially with a transactional
> filesystem, you really don't want the mount to become any slower.

Of course. It's just a way to get the basics in place while I design a decent
incremental algorithm.

> > generic_read - needs a new read_actor that accounts for the necessary
> > data offset in the tail block of a tail-merged file.
>
> You shouldn't need a new read actor --- nor do you want one (you want
> the sendfile actor to continue to work, for example). You should be
> able to get this to work just inside ext2_readpage, by reading the
> data into the buffer cache and then copying to the page cache for the
> special case of the last page.

OK, that sounds better. Next question: are there any docs on the way the
buffer cache relates to the page cache? I've begun to decode if from the
2.4.0-xxxx source, but it's slow going.

--
Daniel

From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Daniel Phillips <[EMAIL PROTECTED]>
Cc: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Message-ID: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <00072413184801.04767@gimli> <[EMAIL PROTECTED]> <0007251911570H.04767@gimli>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
In-Reply-To: <0007251911570H.04767@gimli>; from [EMAIL PROTECTED] on Tue, Jul 25, 2000 at 07:06:59PM +0200
Status: R
X-Status: N

Hi,

On Tue, Jul 25, 2000 at 07:06:59PM +0200, Daniel Phillips wrote:

> OK, that sounds better. Next question: are there any docs on the way the
> buffer cache relates to the page cache? I've begun to decode if from the
> 2.4.0-xxxx source, but it's slow going.

No, none. Again, it's read-the-source.

Basically, though, they are independent, but they can share the same
page. The only tricky bits involve cleaning up the buffers on a page
during truncate.

You don't _have_ to use the same memory for buffer and page cache if
you don't want to. You can easily use the normal generic_read_page
stuff for most of the pages, but for the tail page just read into the
buffer cache and then copy to the page cache.

Cheers,
Stephen

From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Date: Tue, 25 Jul 2000 19:12:03 +0200
X-Mailer: KMail [version 1.0.28]
Content-Type: text/plain
References: <[EMAIL PROTECTED]> <00072413184801.04767@gimli> <[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <0007251916270I.04767@gimli>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S

On Tue, 25 Jul 2000, Stephen C. Tweedie wrote:
>
> On Mon, Jul 24, 2000 at 12:42:42PM +0200, Daniel Phillips wrote:
>
> > BTW, do you
> > know where I can find some notes on the current VFS and cache structure,
> > especially the reworked page cache, new buffer cache operations and locking
> > rules for the VFS, buffer cache and page cache?
>
> There aren't any, anywhere. The closest to documentation is the
> source of the existing NFS and ext2 filesystems --- those two
> filesystems tend to be the simplest, most actively-maintained examples
> of how to use the VFS layer properly.

I was afraid you'd say that. Do you mind if I ask you a whole bunch of
questions about it then? I have a volunteer that's indicated willingness to
collate the answers into a locking primer for the new VFS - does that sound
interesting?

And, who shoud the other victims of these questions be? My guess is:

- Ted T'so
- Alex Viro
- Alan Cox
- Linus?
- Others?

--
Daniel

From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Daniel Phillips <[EMAIL PROTECTED]>
Cc: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Message-ID: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <00072413184801.04767@gimli> <[EMAIL PROTECTED]> <0007251916270I.04767@gimli>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
In-Reply-To: <0007251916270I.04767@gimli>; from [EMAIL PROTECTED] on Tue, Jul 25, 2000 at 07:12:03PM +0200
Status: R
X-Status: N

Hi,

On Tue, Jul 25, 2000 at 07:12:03PM +0200, Daniel Phillips wrote:
>
> I was afraid you'd say that. Do you mind if I ask you a whole bunch of
> questions about it then? I have a volunteer that's indicated willingness to
> collate the answers into a locking primer for the new VFS - does that sound
> interesting?

Sure, but I'm only here for 2 days more before heading off on holiday
for 2 weeks.

> And, who shoud the other victims of these questions be? My guess is:
>
> - Ted T'so
> - Alex Viro

Yes.

> - Alan Cox
> - Linus?

No.

> - Others?

linux-fsdevel@vger?

Cheers,
Stephen

From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Date: Tue, 25 Jul 2000 19:19:39 +0200
X-Mailer: KMail [version 1.0.28]
Content-Type: text/plain
References: <[EMAIL PROTECTED]> <0007251911570H.04767@gimli> <[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <0007251936340J.04767@gimli>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S

On Tue, 25 Jul 2000, Stephen C. Tweedie wrote:
> On Tue, Jul 25, 2000 at 07:06:59PM +0200, Daniel Phillips wrote:
>
> > OK, that sounds better. Next question: are there any docs on the way the
> > buffer cache relates to the page cache? I've begun to decode if from the
> > 2.4.0-xxxx source, but it's slow going.
>
> No, none. Again, it's read-the-source.
>
> Basically, though, they are independent, but they can share the same
> page.

OK.

> The only tricky bits involve cleaning up the buffers on a page
> during truncate.

Ah, while you're talking about truncate - I'm not exactly clear on whether
a truncate is supposed to be able to work concurrently with a read and/or write
in 2.4.0, or is each operation expected to run to completion before another can
start. It looks to me lilke all operations were supposed to work concurrently
in early versions of Ext2, but then some locking was added to simplify things.

> You don't _have_ to use the same memory for buffer and page cache if
> you don't want to. You can easily use the normal generic_read_page
> stuff for most of the pages, but for the tail page just read into the
> buffer cache and then copy to the page cache

OK.

Should I be posting these questions on fsdevel instead of via private email?

--
Daniel

From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Daniel Phillips <[EMAIL PROTECTED]>
Cc: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Message-ID: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <0007251911570H.04767@gimli> <[EMAIL PROTECTED]> <0007251936340J.04767@gimli>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
In-Reply-To: <0007251936340J.04767@gimli>; from [EMAIL PROTECTED] on Tue, Jul 25, 2000 at 07:43:15PM +0200
Status: R
X-Status: N

Hi,

On Tue, Jul 25, 2000 at 07:43:15PM +0200, Daniel Phillips wrote:

> Ah, while you're talking about truncate - I'm not exactly clear on whether
> a truncate is supposed to be able to work concurrently with a read and/or write
> in 2.4.0, or is each operation expected to run to completion before another can
> start. It looks to me lilke all operations were supposed to work concurrently
> in early versions of Ext2, but then some locking was added to simplify things.

Some has been added. In particular, there are semaphores which
protect against things that change the i_size, such as writes and
truncates. You can still write using mmap() during a truncate,
though.

> Should I be posting these questions on fsdevel instead of via private email?

Definitely! CC: me if you want, though.

Cheers,
Stephen

Tailmerging for Ext2

Reply via email to