>>> On Wed, 26 Oct 2005 11:05:40 -0500, Dave Kleikamp
>>> <[EMAIL PROTECTED]> said:
BTW, in this discussion if we were in the same room and with a
blackboard for doing a couple of pictures I could get across my
guesses/points a lot easier and quicker and with less repetition.
This is such a narrow bandwidth medium... Oh well :-/.
[ ... ]
>> Yes, this is more or less what I was expecting (that is, it
>> obviously first creates 32KiB extents and then coalesces them
>> after writing them).
shaggy> It doesn't work that way today. The blocks are actually
shaggy> allocated one page at a time and the extent is grown
shaggy> with each new allocation. [ ... ]
Uh it is _really_ block-at-a-time. My previous understanding was
that it was a buddy allocator with a block-scoring bitmap, but
it seems it is instead really a bitmap allocator with a tree
index; or perhaps not...
[ ... ]
shaggy> Currently holes can exist either in the middle or at the
shaggy> end of a file. If there is no phyisical block mapped to
shaggy> a logical block of a file, it is read as zeros. I only
shaggy> suggested the ABNR extents as a way of preallocating
shaggy> contiguous space for the holes, since I thought that was
shaggy> what you were asking for.
Yes, but the unwritten bit at the end does not need a special
extent type, it can be part of an existing extent, because that
it is unwritten-but-zero is implied in its position, which is
not really the case for a hole in the middle of a file.
The scheme some people have been thinking of is to have _three_
''file sizes'', in order of increasing (or same) value:
* max bytes written; anything beyond this reads as zeroes.
* max bytes readable; anything beyond this is not readable.
* bytes actually allocated.
For an empty but preallocated file of size N, it could be a
single extent of size N. Then the three sizes would be initially
like 0:0:N. Then a seek to the end would make them 0:N:N, and a
write of 4KiB would make them 4096:N:N for example.
shaggy> [ ... ] Would we need a mechanism to free unused
shaggy> preallocations?
>> [ ... ] Also, preallocations can be ephemeral (disappear on
>> close) or persistent. [ ... ]
shaggy> If preallocations are freed at file close, I'm not sure
shaggy> there's an advantage over the current behavior where jfs
shaggy> locks out other allocations in the AG until the file is
shaggy> closed.
Ahhhh, I can see some advantages, e.g. if the free list is
fragmented. In part because one no longer need to scatter
parallel writes across different AGs; I'd like to have higher
chanced of keeping ''related'' files near each other in the same
AG, not just blocks of the same file near each other.
But also because if the free list is fragmented, there will be
free blocks of potentially many different sizes in it, and
preallocating from the beginning the final size means that the
largest, or a large, contiguous block can be reserved (but I
see that your compromise below would achieve this, so fine).
>> - alternatively, a default minimum extent size? So that
>> the extents are initially allocated of that size, but
>> can be reduced by 'close'(2) or 'ftruncate'(2) to the
>> actual size of the file. [ ... ]
This is an alternative to whole file preallocation; just
preallocate the minimum extent size whenever a write happens,
around the address of that write for example.
[ ... ]
shaggy> I can see preallocation adding more complexity to the
shaggy> code. Suppose we preallocate 1M when we first write 4K
shaggy> at the beginning of a file. We then seek out to say 256K
shaggy> and write another 4K. [ ... alternatives ... ]
Well, in this case I'd just zero all the intervening blocks. It
would not be, I hope :-), really that hard to do the other two
things you mention (ABNR or extent splitting), but I guess that
if one sets an option to preallocate I would assume that one
wants dense files.
I personally reckon that a second per-block bit (written or
not, not just allocated or not) might be useful, but probably
too late. However conceivably whether a block is allocated is
probably recorded redundantly in the fact it is part of an
extent and in the bitmap, so one could do some reinterpretation.
Might be hairy though...
Unless the option is for a minimum extent size, in which case
one wants just chunky files (that is files with holes, where
however allocated bit and unallocated bits come in much larger
chunks that a single block).
>> - a maximum extent size? For example to ensure that no extent
>> larger than 256KiB is ever allocated? [ ... ]
shaggy> I'm not sure what that would buy us.
>> Well, it would prevent really large extents from happening. In
>> a buddy system very large allocations (wrt to the size of the
>> whole) cause trouble.
shaggy> I don't understand this.
Just speculating... In general theoretical terms, Knuth in
analyzing the buddy system (for RAM allocation though) says that
it has good performance, which is a surprise, as it has the two
problems of potentially lots of internal fragmentation because
of the power-of-two issue, and of free list fragmentation
because of the coalesce-only-buddies issue. The good performance
happens because in his tests most of the blocks are rather small
wrt to the size of the arena.
So, well, perhaps I don't know how JFS really allocates stuff,
but suppose that one creates a 5GiB file, preallocated or
written, and suppose there is a free 8GiB block. 3GiB might get
wasted because of the power-of-two issue.
Now suppose the maximum extent size allowed is 1GiB. We get 5
1GiB extents allocated, and then 1GiB+2GiB free block. This
seems to me a better outcome than one 8GiB extent, and this is
a good thing that can't be done with a RAM buddy allocator.
There is also a bit of the BSD FFS/'ext3' logic that when
writing large files they switch cylinder group every now and
then, so that both small and large files in the same directory
say can be (begin) nearby in disc distance... This is not
per-file locality, but per-(sub)tree locality, which I think
often matters too.
>> So for example an 11MiB file should ideally be not in one
>> 16MiB extent, but in three ones, 8+2+1MiB (or arguably
>> 8+4MiB), and now I wonder what happens with JFS -- I shall
>> try.
shaggy> I can see that having 3 extents is no worse, but I can't
shaggy> see why you would want to avoid a larger extent.
Again, if it is no worse, why not give the option ''just-in-case''?
:-)
But more seriously for example because in this case one saves
5GiB, if the single larger extent must be 16GiB for an 11GB
file. Those 5GiB of internal fragmentation not only waste space,
they make the arm travel further over unused data.
[ ... ]
shaggy> I still don't get the problem of tying up the buddy
shaggy> tree. If an extent takes up an entire AG, that's great.
shaggy> We've got a better chance of finding contiguous free
shaggy> space in other AGs.
Yes, if all you care is single-file-performance, and performance
just after a fresh load. But suppose one cares also about
keeping files that are logically nearby (e.g. in the same
directory) nearby on the disc too, and what happens when the
free list becomes fragmented.
And as long as file bodies are _mostly_ contiguous, that's fine.
>> Conversely, when allocating smaller files, having a minimum
>> advisory extent size of say 256KiB can help ensuring
>> contiguity even in massive multithreaded writes or on
>> fragemented free lists.
shaggy> I guess you're not interested in efficient use of free
shaggy> space.
Well, if the user sets a minimum (fixed or default) extent size,
that's a tradeoff they make knowing what they are doing. But as
remarked below, I may not have made it too clear that I would
have such options for adjusting allocation/preallocation
granularity default to 0 or infinity (for min/max granule), so
by default allocation would be exactly as it is now.
[ ... ]
>> * In many important cases files are overwritten with contents
>> of much the same size as they were (e.g. recompiling a '.c'
>> into a '.o'). So one might as well, tentatively, preserve
>> the previously allocated size on a 'ftruncate', and then do
>> it for real on a 'close'.
shaggy> Interesting. We'd have to be careful about leaving
shaggy> stale data in pages that may not be written to. They
shaggy> would either have to be zero-filled, or have a hole
shaggy> punched into the file.
That's why ideally one has a ''max byte written so far'' high
water mark, not just the ''max readable' and ''max allocated''
ones. My expectation is that seeking around while writing is
actually rather rare...
The other classic example is repeated package ('.rpm', '.deb')
upgrades. Almost always the upgraded packages has the same files
with the same or much the same sizes, just different contents.
shaggy> (Does the compiler really truncate an existing file and
shaggy> re-write it, or does it completely replace the .o with a
shaggy> new file?)
Most such programs are stupid unfortunately. But modifying them
is very easy (compiler, 'tar', 'cp', ...), and even easier and
probably almost as good is to modify what 'stdio' does for
example with a 'fopen(....,"w")': instead of that becoming an
'open'(2) with 'O_CREAT', which deallocates the existing blocks,
do it with 'O_RDWR', and then 'ftruncate' on 'fclose'(3).
One of the scandals of our modern times is that various libcs
and kernels dont take advantage of the useful implicit hints
in 'fopen'(3)/'open'(2) options, both as to allocation and
read/write clustering and access patterns.
One could also easily modify the 'open'(2) implementation to
make 'O_CREAT' equivalent to 'O_RDWR' plus resetting the ''max
readable'' watermark to zero, and then auto-truncate on
'close'(2)'.
Either could be done by 'LD_PRELOAD' of a suitable set of
wrappers, at least initially.
[ ... ]
>> * Imagine a 2300GiB JFS filesystem, with a minimum extent
>> size of 1GiB and a maximum extent size of say 16GiB (never
>> mind the AG limits :->), mounted perhaps with '-o
>> nointegrity'.
shaggy> Uh, you'd be willing to lose everything if your system
shaggy> crashed or lost power? If not, you don't want nointegrity.
Ahhhh, but all that journaling does in JFS so far is to protect
_metadata_ transactions. Once the virtual volumes are created,
there are no further metadata updates, except perhaps for the
inode time fields. Admittedly by the same argument there is not
much point in disabling journaling for the ''pool'' JFS filesystem.
All the journaling that matters would happen _inside_ the files
(virtual volumes), and I would not disable that...
>> * Such a filesystem plus '-o loop' (built on 'md' if needed)
>> looks to me like a ''for free'' LVM, and with essentially
>> the same performance, and with no need for special
>> utilities or configuration.
shaggy> You're losing me here. I don't think we need a
shaggy> filesystem to replace lvm.
Yes, but if a filesystem can perform nearly as well a DM/LVM2,
having the option to use it like that seems to me to be rather
valuable, if only for the sake of minimizing entities.
For example, one of the major uses of DM/LVM2 is for Oracle
tablespaces, for two reasons:
* Tablespaces should be ideally contiguous and low overhead,
so partitions are often used (even if some people think this
is not necessary).
* Many Oracle databases have hundreds of them, and one can
only create so many real partitions, and managing them is a
pain regardless.
It is therefore in this case that DM/LVM2 are used as a crude
replacement for JFS.
Now consider: instead of creating hundreds of logical volumes
with DM/LVM2, just create hundreds of ordinary preallocated
files with JFS with high minimum extent size and somewhat higher
maximum extent size. Quick and easy and same performance.
And I am fairly sure of this: because this article:
http://WWW.Oracle.com/technology/oramag/webcolumns/2002/techarticles/scalzo_linux02.html
says that raw tablespaces (under DM/LVM2) are _slower_ than
using 'ext3' files (and JFS does not do too bad).
My guess is that because these are obviously (like in most naive
benchmarks) freshly loaded filesystems, and 'ext3' achieves
optimal layout on freshly loaded data (and JFS almost).
So my further guess is that if the layout is good, a file system
can beat DM/LVM2 at its own game, because DM/LVM 2 are in effect
a crude large-extent large-file filesystem.
>> My understanding of the current logic is to allocate small
>> extents (the size of a 'write'(2) I guess), use the hint to
>> put them near each other and then coalesce them if possible
>> (if the hints worked that well).
shaggy> Ideally, the size of the extents would be the size of
shaggy> the write. In most cases, we are doing allocation a page
shaggy> at a time. If there is space immediately after the
shaggy> previous extent, that extent is extended to contain the
shaggy> new blocks, so there is no coalescing going on.
Just nitpicking, but to illustrate my mental model of JFS: that
«is extended» is in effect coalescing, unless the buddy system
is really a fiction. Suppose that 2GiB have been just written in
the currently open extent, and that these 2GiB are all inside
the first of a pair of two 2GiB buddies. Write another byte and
you need to allocate the second 2GiB buddy, thus coalescing it
with the first one, effectively now the 2GiB+1 extent is
contained in a 4GiB buddy.
shaggy> It is not as complex as preallocation. Even if we know
shaggy> in advance the size of the file, we would have to make
shaggy> sure that unwritten pages are zeroed, either by
shaggy> physically writing zero to disk, or punching holes in
shaggy> the extent (either a real hole, or an ABNR extent).
Or use the written-readable-allocate ''sizes''. I would expect
most preallocations to be for sequentially written files, so not
worry much about holes in the middle.
[ ... top-down vs. bottom up allocation ... ]
shaggy> Hmm. Maybe there's a compromise. When doing allocations
shaggy> for file data, jfs could search the binary buddy tree
shaggy> for an extent of a certain size (say 1 MB), but continue
shaggy> to allocate as it does. That way a sequentially-written
shaggy> file would grow contiguously into that space.
Yes, that seems quite a good idea as it would most likely
achieve the same effect with minimal code disruption.
Now, this is equivalent as the whole AG is locked, so in effect
the 1MiB buddy is preallocated.
[ ... ]
>> The general case for preallocation is made for example in
>> this 'ext[23]' paper:
>> http://WWW.USENIX.org/events/usenix02/tech/freenix/full_papers/tso/tso_html/
shaggy> [ ... ] If we allow some explicit mechanism to
shaggy> preallocate a large file, I think we would have some
shaggy> options.
Yes, and there are two ways to do so, in-band and out-of-band;
in-band, which is what the 'ext3' guys are thinking mostly
about, may require changing APIs or adding extended attributes
to files. My ''on-the-cheap'' preference is for out-of-band
options, that is either global or per-filesystem.
shaggy> Maybe we could implement dense files and use ABNR
shaggy> extents in some explicit cases. Again, if we have some
shaggy> way to know to begin a file where there is a lot of free
shaggy> space, and can lock out other allocations, we should get
shaggy> the desired results.
Yes, that sound reasonable. ABNR as indicated elsewhere is
probably not that needed because I expect more preallocation to
be sequential (no holes in the middle) or to be by large granule
extents, with holes in the middle of large chunks.
[ ... ]
shaggy> I guess I'm uncomfortable preallocating all the time,
shaggy> since it will lead to more fragmentation. If every
shaggy> small file begins at a 1 MB offset, we'll have lots of
shaggy> free space in between these small allocations. [ ... ]
This is a big misunderstanding, sorry; I did say that I would
like either a global or per-filesystem set of options, like:
* 'smallest-extent-size', [default 0]: no newly created extent
can be smaller than this.
* 'default-extent-size' [default 0]: extents are initially
created this small, and the last one is truncated on close.
* 'largest-extent-size', [default 0 to mean infinity]: no newly
created extent can be larger than this.
These could be either in '/proc/fs/jfs/' (global), or as mount
options (per-filesystem).
Then also ideally 'ftruncate' would also support the ''truncate
to a larger size than the current one'' semantics (obeying the
options above too).
>> Then there is the ''DM/LVM2'' replacement story...
shaggy> I don't want to go there. :^)
Oh no.... :-)
But again, suppose that you can create a 2300GiB JFS filesystem
over an MD RAID, and within this you can efficiently create
(which means, preallocated, mostly contiguous, unwritten)
50-500GiB files in say 1-10GiB extents, say for tablespaces, or
virtual machine discs, to be mounted '-o loop'...
Sure, one can do that now but there are a couple of annoyances:
* To ensure best contiguity all the volumes should be created
just after 'jfs_mkfs', when the free list is mostly contiguous
and the buddy system wholesome. And this may achieve too much
contiguity.
* The big files need to be actually _written to_ to achieve
allocation.
One would either preallocate the large files, or set a minimum
mandatory extent size of say 4GiB and not preallocate, and let
the filesystem be allocated in 4GiB chunks.
It would sound wonderful to me... MD/JFS/'loop' would then do at
least 90% of what DM/LVM2 do, for free.
It would be 100% if you implemented reverse-copy-on-write
(that is, snapshot) files :-).
BTW, the Oracle tablespace and VM virtual disc stories are part
of my interests, and these are about DM/LVM2 replacement.
But I am also interested in the upgrade-the-installed-packages
and the archive-of-DVD-images for a video-on-demand server
stories, in case that was not obvious.
All these would rather benefit from preallocation either
whole-file or chunky-extents...
-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
Jfs-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jfs-discussion