Last week I was discussing some issues related to my Tux2 filesystem project with Stephen Tweedie. While we were discussing something completely unrelated I happened to mention an idea I had for reducing disk wastage in Ext2 for small files. The basic idea is called 'tail merging'. Tail merging already exists in ReiserFS and I thought that it would not be hard to add it to Ext2. If tail merging becomes part of Ext2 then it indirectly becomes part of Tux2 and Ext3 as well, since both are upward-compatible with Ext2. Tail merging would replace the 'fragments' feature that is already implemented in the BSD Fast File System and was planned for Ext2 but never implemented. Data fields were reserved in Ext2 inodes for the bookkeeping that fragments would have needed, but those fields always contain zeros in current and past versions of Ext2. I plan to use those fields in a different way, to implement tail merging instead of fragments. Stephen asked me some sharp questions about how this would work, and after I answered them to his satisfaction he asked me if I would have time to implement this feature. I said yes, and went on to write an initial design document describing the required modifications to Ext's handling of inodes, and a prototype algorithm for doing the tail merging. I think that tail merging can be implemented quite quickly, and that the result will be a big improvement for Ext2. I think tail merging will improve Ext2's average storage efficiency by somewhere between 15 and 30%. Currently the worst case for Ext2's disk allocation is surprisingly bad: if every file in a filesystem is one byte long, you will waste 99.9% of the disk (!). Tail merging as I defined it will improve this worst case to only 50% wastage (this worst case occurs when every file on the filesystem is exactly blocksize/2+1 bytes long). The tail merging scheme I have proposed will have little impact on efficiency - in fact, it is likely to improve reading speed somewhat because fewer blocks need to be read. I don't think implementing it will be very hard. All these things taken together suggest that this is a good idea to try. I have collected all the emails between me and Stephen and attached them as an html document. -- DanielSubject: Re: Tux2 filesystemDate: Fri, 14 Jul 2000 15:07:17 +0200
From: Daniel Phillips <[EMAIL PROTECTED]>
> > Since there are at minimum 8 inodes in an inode block, a single
byte per inode
> > would be entirely adequate for both a full inode number and a 32
bit magic
> > number to help identify a block as an inode block. I'm looking
at the inode
> > layout now and I see the dtime field is always 0 until the inode
has been
> > deleted, and you can tell when a node has been deleted. So,
for an active
> > inode, I think it would be a good idea to take a byte from the
dtime field for
> > my own use.
>
> The dtime is used by e2fsck, though, to distinguish between deleted
> inodes and inodes which have just become unreferenced due to directory
> problems.
Yes, that is important, but it would be just as easy to zero some other
key
field to make this unambiguous. I'm still looking with greedy
eyes at that
field :-)
> But yes, dtime can be used, and ext3 uses it too for a linked-list
> data structure for all inodes in the process of being deleted.
Ah, very good, there's a precedent. Tux2 probably doesn't need
such a wrinkle
because you have a little more control over what constitutes a filesystem
transaction. So now there are just two things to wonder about:
Is the high
byte or the low byte better? And what to use the other 3 bytes
for? :-)
> If you really want some bits right now, reuse the fragment fields,
> which are not and never (I expect!) will be used in ext2.
Yes, after studying the code I came to the conclusion that fragments
might well never be implemented. On the other hand, I have
a rather
interesting idea to propose about how to use the same 32 bits to implement
a
"tail merge" feature similar to ReiserFS.
In brief: 16 bits is used to refer to another inode in the same group
whose
tail is merged with this one. The other 16 bits gives the offset
of the
beginning of the tail fragment within the final block. The final
block of the
file is also indexed as the final block of the other file. N
files in the same
group can have their tails merged. The node tail references form
a circular
list, and in fact, the structure is completely symmetric.
The savings from tail merges would be very significant - I haven't got
exact
figures just now, but I think (for example) disk usage in my Mozilla
source
directory would drop by about 30%, since it consists almost entirely
of small
files with average 50% wastage in the final block.
--
Daniel
Subject: Re: Tux2 filesystem
Date: Sun, 16 Jul 2000 22:33:50 +0100
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Hi,
On Fri, Jul 14, 2000 at 03:07:17PM +0200, Daniel Phillips wrote:
> Yes, after studying the code I came to the conclusion that fragments
> might well never be implemented. On the other hand, I
have a rather
> interesting idea to propose about how to use the same 32 bits to
implement a
> "tail merge" feature similar to ReiserFS.
>
> In brief: 16 bits is used to refer to another inode in the same group
whose
> tail is merged with this one. The other 16 bits gives the offset
of the
> beginning of the tail fragment within the final block. The
final block of the
> file is also indexed as the final block of the other file.
N files in the same
> group can have their tails merged. The node tail references
form a circular
> list, and in fact, the structure is completely symmetric.
That's OK for storing the tails, but how do you store the free space
information in such a case? How would you deal with the deletion
of a
file which occupies the middle of a group tail block, for example?
Cheers,
Stephen
From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tux2 filesystem
Date: Mon, 17 Jul 2000 08:49:46 +0200
X-Mailer: KMail [version 1.0.21]
Content-Type: text/plain
References: <[EMAIL PROTECTED]>
Cc: "Me" <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <00071709590800.00599@fronius>
Content-Transfer-Encoding: 8bit
Status: R
X-Status: N
On Sun, 16 Jul 2000, Stephen C. Tweedie wrote:
> On Fri, Jul 14, 2000 at 03:07:17PM +0200, Daniel Phillips wrote:
>
> > Yes, after studying the code I came to the conclusion that fragments
> > might well never be implemented. On the other hand,
I have a rather
> > interesting idea to propose about how to use the same 32 bits to
implement a
> > "tail merge" feature similar to ReiserFS.
> >
> > In brief: 16 bits is used to refer to another inode in the same
group whose
> > tail is merged with this one. The other 16 bits gives the
offset of the
> > beginning of the tail fragment within the final block. The
final block of the
> > file is also indexed as the final block of the other file.
N files in the same
> > group can have their tails merged. The node tail references
form a circular
> > list, and in fact, the structure is completely symmetric.
>
> That's OK for storing the tails, but how do you store the free space
> information in such a case? How would you deal with the deletion
of a
> file which occupies the middle of a group tail block, for example?
You don't have to store the free space information in the disk image,
but you
might want to cache it in memory. I think you can get away with
not even doing
that - see below.
To delete the file you just delete it, taking care not to free the tail
block..
This leaves a gap somewhere in the tail block. That's OK - this
can be
squeezed out when it's necessary and convenient to do so.
The only extra thing you have to do in a delete is follow the list around
to
find the prev inode and fix it up.
To get a map of the freespace in the shared block you have to look at
all the
inodes in the circular list. This map will only be needed when
attempting to
merge tails, but not when reading, writing, rewriting or deleting.
The needed
inodes will often be in the buffer cache, especially if they happen
to be on
the same block. Things can be arranged so that this is likely.
The tail data is not pinned; it is referenced from only one place per
file.
This makes compaction very easy.
Reading runs at full speed with this approach because everything you
need to
know is in the inode. Actually, reading will be accelerated because
fewer
blocks have to be read.
Appending to a merged file will be slightly slower because you have
to find
and update the previous pointer in the circular list (unmerging the
file).
Writing speed in general might will stay the same or even improve because,
though you have to update pointers in inodes that are not being written,
you
are also writing fewer data blocks.
Fancy things can be done to accelerate the allocation of space in the
tail
block. I think something fairly mindless would work pretty well
too, for
example: just examine the next few inodes to see if tail merging can
be done,
and compact the tail fragments if necessary.
--
Daniel
Subject: Re: Tux2 filesystem
Date: Mon, 17 Jul 2000 10:17:19 +0100
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Hi,
On Mon, Jul 17, 2000 at 08:49:46AM +0200, Daniel Phillips wrote:
>
> You don't have to store the free space information in the disk image,
but you
> might want to cache it in memory. I think you can get away
with not even doing
> that - see below.
Interesting idea. Got time to implement it? :-)
Cheers,
Stephen
From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Tail Mergin
Date: Mon, 17 Jul 2000 15:45:36 +0200
X-Mailer: KMail [version 1.0.21]
Content-Type: text/plain
References: <[EMAIL PROTECTED]>
Cc: Me <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <00071716065503.00599@fronius>
Content-Transfer-Encoding: 8bit
Status: R
X-Status: N
On Mon, 17 Jul 2000, Stephen C. Tweedie wrote:
> On Mon, Jul 17, 2000 at 08:49:46AM +0200, Daniel Phillips wrote:
> >
> > You don't have to store the free space information in the disk
image, but you
> > might want to cache it in memory. I think you can get away
with not even doing
> > that - see below.
>
> Interesting idea. Got time to implement it? :-)
Yes, actually. I'm lucky enough to have nothing else than Linux
file system
hacking in my job description for the time being. While I have
to concentrate
on getting working code out the door for Tux2, I'm pretty sure I can
get time to
try out the tail-merging idea. As you can see, it doesn't rely
on any special
properties of Tux2 and If it works well it wil be completely applicable
to Ext2
and probably Ext3 as well, barring journalling issues.
I've been hearing some comments about how Ext2, 3 et al are nearing
the end of
their useful lives in Linux, mostly based on not having tail merging,
extents,
and B-tree directories. But you are doing extents, and if I do
tail merging,
and that's two out of three. There could be a lot more life left
in the good
old filesystem than some people think.
--
Daniel
From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Tail Merging
Date: Mon, 17 Jul 2000 16:12:34 +0200
X-Mailer: KMail [version 1.0.21]
Content-Type: text/plain
References: <[EMAIL PROTECTED]>
Cc: Me <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <00071716065503.00599@fronius>
Content-Transfer-Encoding: 8bit
Status: R
X-Status: N
On Mon, 17 Jul 2000, Stephen C. Tweedie wrote:
> Interesting idea. Got time to implement it? :-)
Gazing at the ext2_inode struct (once again) I see that 6 bytes are
devoted to
fragments:
u32 i_faddr;
/* Fragment address */
u8 h_i_frag;
/* Fragment number */
u8
h_i_fsize; /* Fragment size */
I only need 4 bytes for the tail merge if I make the merge list pointer
relative to the inode's group. 16 bits is certainly enough for
the tail offset.
(Will anyone *ever* want a block size larger than 64K?)
I would break the i_faddr field into two u16's and free the remaining
two
bytes for future use. The result is:
u16 i_tail_ share; /* Tail block
share list */
u16 i_tail_ offset;
/* Tail block data offset */
u16 l_i_reserved3; /* Were frag
and fsize */
It is essential that i_tail_share==zero be interpreted as "no shared
tail"
- otherwise it wouldn't be backward compatible. A cute way to
do this is to
interpret the share pointer as "relative to this inode, wrapping to
the
beginning of the group according to inodes_per_group". This implies
that an
inode with a share entry of zero can be interpreted as a circular list
of one
element, in other ords, it shares its tail block only with itself.
A oundary
condition goes away.
Perhaps this is *too* cute?
---------
I have to confess to not having really understood how fragments were
supposed
to work in the first place. I got a book on xxxBSD and am now
somewhat clearer
on the concept. It seems to be a workable idea (in fact it we
know it works,
and efficiently) but I think it's fair to say it causes a lot of code
complexity. It's also not as good at recovering wasted space
as my idea,
because fragments have to have a fixed size and you will siill waste
on average
half that size.
Ext2 as it now stands has a worst-case disk utilization of .1 percent,
arising when every file is exactly one byte long. That is for
1024 byte
blocks and it gets worse for larger block sizes. Average utilization
is 50% if
every file is less than a block long.
Fragments as implemented in BSD improve this somewhat, but only by the
amount
that a fragment is smaller than a block.
Tail merging as I described it reduces worst-case wastage to about 50%.
This arises when every file is exactly blocksize/2 + 1 bytes long
Average
case is very close to zero wastage, assuming the tail packer heuristics
are
good.
--
Daniel
From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Tail Merging
Date: Tue, 18 Jul 2000 10:39:10 +0200
X-Mailer: KMail [version 1.0.21]
Content-Type: text/plain
References: <[EMAIL PROTECTED]>
Cc: Me <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <00071716065503.00599@fronius>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S
Yesterday I said:
>
>I would break the i_faddr field into two u16's and free the remaining
two
>bytes for future use. The result is:
>
> u16 i_tail_ share; /* Tail
block share list */
> u16 i_tail_ offset; /*
Tail block data offset */
> u16 l_i_reserved3;
/* Were frag and fsize */
>
>It is essential that i_tail_share==zero be interpreted as "no shared
tail"
>- otherwise it wouldn't be backward compatible. A cute way to
do this is to
>interpret the share pointer as "relative to this inode, wrapping to
the
>beginning of the group according to inodes_per_group". This
implies that an
>inode with a share entry of zero can be interpreted as a circular
list of one
>element, in other ords, it shares its tail block only with itself.
A boundary
>condition goes away.
This has evolved somewhat since yesterday:
1) The tail merge pointer should be signed, 16 bits. Then no modular
arithmetic is needed and the pointer can cross group boundaries if
it
wants to.
2) The merge list should be double-linked. This makes some operations
involving tail merging (append, delete and in Tux2, rewrite) significantly
more
efficient - you don't have to follow the entire list to update a Next
pointer.
The new arrangement uses 6 bytes in the on-disk inode, exactly the same
as the
current fragment strategy has reserved. How convenient.
Furthermore, the
current contents of those fields (all zeros) is interpreted by default
in my
new scheme as "doubly linked circular list of one element - this inode
- with
tail data offset of 0 bytes". I think this is about as elegant
as it gets.
Doubly linked circular lists that never become empty were made in heaven
by a
high-ranking angel. :-)
I think I can now proceed to a detailed design and an implementation
plan. My
intuition is that there won't be too many difficulties. The only
slight
irritation I've encountered so far is that (I think) I can't use generic_read
any more. (VFS doesn't have any concept of data offset within
a block.)
The solution for now is to import a modified version of generic_read
into
the hacked Ext2 code. Hopefully something better could be done
later if the
prototype works out well.
Now the changed inode struct fields are:
s16 i_tail_next; /* Was faddr
*/
s16 i_tail_prev; /* Was
faddr */
u16 l_tail_offset;
/* Was frag+fsize */
--
Daniel
From: Daniel Phillips <[EMAIL PROTECTED]>
To: Stephen C. Tweedie <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Date: Mon, 24 Jul 2000 12:42:42 +0200
X-Mailer: KMail [version 1.0.28]
Content-Type: text/plain
References: <[EMAIL PROTECTED]> <00071716065503.00599@fronius>
In-Reply-To: <00071716065503.00599@fronius>
Cc: [EMAIL PROTECTED]
MIME-Version: 1.0
Message-Id: <00072413184801.04767@gimli>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S
A few days ago I said:
> I think I can now proceed to a detailed design and an implementation
plan. My
> intuition is that there won't be too many difficulties. The
only slight
> irritation I've encountered so far is that (I think) I can't use
generic_read
> any more. (VFS doesn't have any concept of data offset within
a block.)
> The solution for now is to import a modified version of generic_read
into
> the hacked Ext2 code. Hopefully something better could be done
later if the
> prototype works out well.
This turned out to be inaccurate - actually, after a closer look, the
current
VFS turned out to be quite friendly to the required modications.
BTW, do you
know where I can find some notes on the current VFS and cache structure,
especially the reworked page cache, new buffer cache operations and
locking
rules for the VFS, buffer cache and page cache?
I copied Victor on this email because I know he's interested in this
kind of
thing.
Here is a more-or-less complete description of an initial implementation
of a
tail merging feature for Ext2. It should apply equally well to
Ext3 and Tux2.
For now a special merging pass will be performed at mount time,
and any write
to a merged tail block will cause the block to be unmerged.
This is
admittedly a poor substitute for incremental merging, nonetheless it
accomplishes the following:
- it's considerably simpler than an incremental algorithm
- it's strictly compatible with an incremental version to be
done later
- it's already useful for such things as read-only filesystems
- it works globally and probably achieves pretty good results
- it will allow measurement of space savings and performance
impact
- it will allow experimentation with various merging heuristics
- it will get us started
Unused inode fields totalling six bytes formerly reserved for implementation
of a BSD-style fragment strategy are to be "repurposed" for tail
merging:
tail_next (signed 16 bit offset to next inode in tail merge ring)
tail_prev (signed 16 bit offset to prev inode in tail merge
ring)
tail_offset (unsigned 16 bit offset of tail data fragment in
tail block)
It is convenient and important that the current contents of those fields,
all
zeros, are consistent with the default state of the new tail merge
data: the
offset of the data in the tail block (the last data block assigned
to the file)
is zero, and the next and prev inodes in the ring are the inode itself
(pointer
offset of zero) forming a ring containing a single inode.
Signed offsets are used for the next/prev links to save space in the
inode
(16 bit fields; we could even use 8 bit fields necessary).
In any event, we
want to try to merge the tails only of inodes that are numerically
close
together, so a limited range offset is not an onerous limitation.
Implementation Strategy
-----------------
Reading:
generic_read - needs a new read_actor that accounts for the necessary
data offset in the tail block of a tail-merged file.
Writing:
A write to a merged tail block will cause that inode's tail block to
be unmerged:
- If the ring has only one inode, optionally move the tail data
to the beginning
of the block and set the tail offset to zero, otherwise:
- Allocate a new block and copy in the tail data.
- Remove the inode from the merge ring.
- Concurrency issues/race conditions?
Deleting:
In addition to the normal delete processing, a tail-merged inode needs
to be
removed from its merge ring. Deletion from a double-linked circular
list is a
nice, efficient operation and nothing other than that needs to be done.
Truncating:
Truncating of a tail-merged file will cause the file to be un-merged
if the
truncation point is before the beginning of the tail block. If
truncation to a
length past the end of a file is allowed (I don't know, is it?) then
the
easiest (and probably most efficient) thing to do in this case is also
to
unmerge the file. If the truncation point lies within the tail
fragment then
nothing special needs to be done.
Tail merge pass at mount time:
Here is a relatively simple, non-incremental algorithm. It makes
one pass
through all inodes at mount time, trying to merge as many tails
as possible.
We keep a working list of merge rings that we have already seen
and that have
more than some threshold amount of slack space available in the
tail block.
Each entry in the working list has the following fields:
next ring in the working list
prev ring in the working list
inode (any inode in the ring)
lowest inode in ring
highest inode in ring
slack size (total free space in tail block)
The "center" of a ring is the number halfway between the lowest and
highest
inode number. The working list will be kept sorted by the distance
of ring
center from the center of the "current" ring, thus when looking
for a merge
candidate we will choose the "closest" ring with that has enough
tail slack.
A practical result of this is that sequences of many small files
with adjacent
inode numbers will tend to be merged together into a single block,
which is
exactly what we want.
Algorithm:
For each (allocated) inode:
(Each inode is a member of a tail merge ring
consisting of one or more
inodes.) Traverse the current ring starting
with this inode. If any link
points to an inode lower-numbered than the
current one then we have
already seen this ring and we are finished
with this inode. Otherwise,
accumulate the ring statistics: total tail
size; highest numbered inode.
While doing this, make a map of space allocation
in the tail node.
Afterwards, use the map to check that the
tail node is densely packed, and
if it isn't, repack the tail node and adjust
the tail offsets of the ring
inodes accordingly. (Now we can be sure
that the tail node of any ring
in the working list is densely packed and
thereby avoid the need to map
the tail node allocation again later.)
(Try to merge the ring with a ring in the working
list.) Working through
the rings in the working list in sorted order:
If a ring in the working
list has enough tail slack space, merge the
current ring with it:
Copy in the tail
fragments from the tail block of the current inode.
Free the tail
block of the current ring and set the tail pointers to
the tail block
of the working list ring (not necessarily in that
order).
Then let:
pred1 be the current inode
succ1 be the successor inode in pred1's tail ring
pred2 be an inode from the working list ring
succ2 be the succesor inode to pred2
and set:
prec1.next = succ2
succ2.prev = prec1
prec2.next = succ1
succ1.prev = prec2
Update the statistics
for the working list ring (remaining tail slack;
lowest and highest
inode numbers) unless slack size has fallen to some
threshold (possibly
zero). In that case, remove the ring from the
working list,
otherwise resort the list.
If it was not possible to merge the current ring
then add it to the
working list in the proper sorted location,
setting the tail slack and
lowest/highest inode numbers.
(drop distant rings from the working list)
This merge pass is intended to be run optionally at filesystem mount
time.
This is not as good as a incremental merge but it will answer
three important
questions: (1) How much space is saved by tail merging in real
filesystems?
(2) Are filesystem operations still stable when merged tail blocks
are
present? (3) What is the performance impact?
Later, after things are stable, the onte-time merge pass will be replaced
by
an incremental merge performed at file close time. The
merge pass will still
remain useful as a means of checking that an incremental merging
algorithm is
close to optimal, and also for one-time optimal packing of files
in a
read-only file system.
--
Daniel
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Daniel Phillips <[EMAIL PROTECTED]>
Cc: "Stephen C.Tweedie" <[EMAIL PROTECTED]>,
[EMAIL PROTECTED]
Subject: Re: Tail Merging
Message-ID: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <00071716065503.00599@fronius>
<00072413184801.04767@gimli>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
In-Reply-To: <00072413184801.04767@gimli>; from [EMAIL PROTECTED]
on Mon, Jul 24, 2000 at 12:42:42PM +0200
Sender: [EMAIL PROTECTED]
Status: R
X-Status: N
Hi,
On Mon, Jul 24, 2000 at 12:42:42PM +0200, Daniel Phillips wrote:
> This turned out to be inaccurate - actually, after a closer look,
the current
> VFS turned out to be quite friendly to the required modications.
BTW, do you
> know where I can find some notes on the current VFS and cache structure,
> especially the reworked page cache, new buffer cache operations and
locking
> rules for the VFS, buffer cache and page cache?
There aren't any, anywhere. The closest to documentation is the
source of the existing NFS and ext2 filesystems --- those two
filesystems tend to be the simplest, most actively-maintained examples
of how to use the VFS layer properly.
> For now a special merging pass will be performed at mount time,
and any write
> to a merged tail block will cause the block to be unmerged.
This is
> admittedly a poor substitute for incremental merging, nonetheless
it
> accomplishes the following:
>
> - it's considerably simpler than an incremental algorithm
> - it's strictly compatible with an incremental version
to be done later
> - it's already useful for such things as read-only filesystems
> - it works globally and probably achieves pretty good
results
> - it will allow measurement of space savings and performance
impact
> - it will allow experimentation with various merging
heuristics
> - it will get us started
But what about the speed of mount? Especially with a transactional
filesystem, you really don't want the mount to become any slower.
> Reading:
>
> generic_read - needs a new read_actor that accounts for the necessary
> data offset in the tail block of a tail-merged file.
You shouldn't need a new read actor --- nor do you want one (you want
the sendfile actor to continue to work, for example). You should
be
able to get this to work just inside ext2_readpage, by reading the
data into the buffer cache and then copying to the page cache for the
special case of the last page.
> Later, after things are stable, the onte-time merge pass will be replaced
by
> an incremental merge performed at file close time.
Oh, OK, that makes much more sense!
Cheers,
Stephen
From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Date: Tue, 25 Jul 2000 19:06:59 +0200
X-Mailer: KMail [version 1.0.28]
Content-Type: text/plain
References: <[EMAIL PROTECTED]> <00072413184801.04767@gimli>
<[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <0007251911570H.04767@gimli>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S
On Tue, 25 Jul 2000, Stephen C. Tweedie wrote:
> > For now a special merging pass will be performed at mount
time, and any write
> > to a merged tail block will cause the block to be unmerged.
This is
> > admittedly a poor substitute for incremental merging, nonetheless
it
> > accomplishes the following:
> >
> > - it's considerably simpler than an incremental algorithm
> > - it's strictly compatible with an incremental version
to be done later
> > - it's already useful for such things as read-only
filesystems
> > - it works globally and probably achieves pretty good
results
> > - it will allow measurement of space savings and performance
impact
> > - it will allow experimentation with various merging
heuristics
> > - it will get us started
>
> But what about the speed of mount? Especially with a transactional
> filesystem, you really don't want the mount to become any slower.
Of course. It's just a way to get the basics in place while I
design a decent
incremental algorithm.
> > generic_read - needs a new read_actor that accounts for the necessary
> > data offset in the tail block of a tail-merged file.
>
> You shouldn't need a new read actor --- nor do you want one (you
want
> the sendfile actor to continue to work, for example). You should
be
> able to get this to work just inside ext2_readpage, by reading the
> data into the buffer cache and then copying to the page cache for
the
> special case of the last page.
OK, that sounds better. Next question: are there any docs on the
way the
buffer cache relates to the page cache? I've begun to decode
if from the
2.4.0-xxxx source, but it's slow going.
--
Daniel
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Daniel Phillips <[EMAIL PROTECTED]>
Cc: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Message-ID: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <00072413184801.04767@gimli>
<[EMAIL PROTECTED]> <0007251911570H.04767@gimli>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
In-Reply-To: <0007251911570H.04767@gimli>; from [EMAIL PROTECTED]
on Tue, Jul 25, 2000 at 07:06:59PM +0200
Status: R
X-Status: N
Hi,
On Tue, Jul 25, 2000 at 07:06:59PM +0200, Daniel Phillips wrote:
> OK, that sounds better. Next question: are there any docs on
the way the
> buffer cache relates to the page cache? I've begun to decode
if from the
> 2.4.0-xxxx source, but it's slow going.
No, none. Again, it's read-the-source.
Basically, though, they are independent, but they can share the same
page. The only tricky bits involve cleaning up the buffers on
a page
during truncate.
You don't _have_ to use the same memory for buffer and page cache if
you don't want to. You can easily use the normal generic_read_page
stuff for most of the pages, but for the tail page just read into the
buffer cache and then copy to the page cache.
Cheers,
Stephen
From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Date: Tue, 25 Jul 2000 19:12:03 +0200
X-Mailer: KMail [version 1.0.28]
Content-Type: text/plain
References: <[EMAIL PROTECTED]> <00072413184801.04767@gimli>
<[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <0007251916270I.04767@gimli>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S
On Tue, 25 Jul 2000, Stephen C. Tweedie wrote:
>
> On Mon, Jul 24, 2000 at 12:42:42PM +0200, Daniel Phillips wrote:
>
> > BTW, do you
> > know where I can find some notes on the current VFS and cache structure,
> > especially the reworked page cache, new buffer cache operations
and locking
> > rules for the VFS, buffer cache and page cache?
>
> There aren't any, anywhere. The closest to documentation is
the
> source of the existing NFS and ext2 filesystems --- those two
> filesystems tend to be the simplest, most actively-maintained examples
> of how to use the VFS layer properly.
I was afraid you'd say that. Do you mind if I ask you a whole
bunch of
questions about it then? I have a volunteer that's indicated
willingness to
collate the answers into a locking primer for the new VFS - does that
sound
interesting?
And, who shoud the other victims of these questions be? My guess is:
- Ted T'so
- Alex Viro
- Alan Cox
- Linus?
- Others?
--
Daniel
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Daniel Phillips <[EMAIL PROTECTED]>
Cc: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Message-ID: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <00072413184801.04767@gimli>
<[EMAIL PROTECTED]> <0007251916270I.04767@gimli>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
In-Reply-To: <0007251916270I.04767@gimli>; from [EMAIL PROTECTED]
on Tue, Jul 25, 2000 at 07:12:03PM +0200
Status: R
X-Status: N
Hi,
On Tue, Jul 25, 2000 at 07:12:03PM +0200, Daniel Phillips wrote:
>
> I was afraid you'd say that. Do you mind if I ask you a whole
bunch of
> questions about it then? I have a volunteer that's indicated
willingness to
> collate the answers into a locking primer for the new VFS - does
that sound
> interesting?
Sure, but I'm only here for 2 days more before heading off on holiday
for 2 weeks.
> And, who shoud the other victims of these questions be? My guess
is:
>
> - Ted T'so
> - Alex Viro
Yes.
> - Alan Cox
> - Linus?
No.
> - Others?
linux-fsdevel@vger?
Cheers,
Stephen
From: Daniel Phillips <[EMAIL PROTECTED]>
To: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Date: Tue, 25 Jul 2000 19:19:39 +0200
X-Mailer: KMail [version 1.0.28]
Content-Type: text/plain
References: <[EMAIL PROTECTED]> <0007251911570H.04767@gimli>
<[EMAIL PROTECTED]>
In-Reply-To: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Message-Id: <0007251936340J.04767@gimli>
Content-Transfer-Encoding: 8bit
Status: RO
X-Status: S
On Tue, 25 Jul 2000, Stephen C. Tweedie wrote:
> On Tue, Jul 25, 2000 at 07:06:59PM +0200, Daniel Phillips wrote:
>
> > OK, that sounds better. Next question: are there any docs
on the way the
> > buffer cache relates to the page cache? I've begun to decode
if from the
> > 2.4.0-xxxx source, but it's slow going.
>
> No, none. Again, it's read-the-source.
>
> Basically, though, they are independent, but they can share the same
> page.
OK.
> The only tricky bits involve cleaning up the buffers on a page
> during truncate.
Ah, while you're talking about truncate - I'm not exactly clear on whether
a truncate is supposed to be able to work concurrently with a read
and/or write
in 2.4.0, or is each operation expected to run to completion before
another can
start. It looks to me lilke all operations were supposed to work
concurrently
in early versions of Ext2, but then some locking was added to simplify
things.
> You don't _have_ to use the same memory for buffer and page cache
if
> you don't want to. You can easily use the normal generic_read_page
> stuff for most of the pages, but for the tail page just read into
the
> buffer cache and then copy to the page cache
OK.
Should I be posting these questions on fsdevel instead of via private email?
--
Daniel
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
To: Daniel Phillips <[EMAIL PROTECTED]>
Cc: "Stephen C. Tweedie" <[EMAIL PROTECTED]>
Subject: Re: Tail Merging
Message-ID: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]> <0007251911570H.04767@gimli>
<[EMAIL PROTECTED]> <0007251936340J.04767@gimli>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2i
In-Reply-To: <0007251936340J.04767@gimli>; from [EMAIL PROTECTED]
on Tue, Jul 25, 2000 at 07:43:15PM +0200
Status: R
X-Status: N
Hi,
On Tue, Jul 25, 2000 at 07:43:15PM +0200, Daniel Phillips wrote:
> Ah, while you're talking about truncate - I'm not exactly clear on
whether
> a truncate is supposed to be able to work concurrently with a read
and/or write
> in 2.4.0, or is each operation expected to run to completion before
another can
> start. It looks to me lilke all operations were supposed to
work concurrently
> in early versions of Ext2, but then some locking was added to simplify
things.
Some has been added. In particular, there are semaphores which
protect against things that change the i_size, such as writes and
truncates. You can still write using mmap() during a truncate,
though.
> Should I be posting these questions on fsdevel instead of via private email?
Definitely! CC: me if you want, though.
Cheers,
Stephen