Re: [Jfs-discussion] Some more questions, preallocation

Peter Grandi Mon, 24 Oct 2005 07:49:00 -0700

>>> On Sun, 23 Oct 2005 22:53:02 -0500, Dave Kleikamp
>>> <[EMAIL PROTECTED]> said:

shaggy> On Mon, 2005-10-24 at 01:06 +0100, Peter Grandi wrote:

[ ... ]

>> * How to see how many extents of which size have been
>>   allocated to an inode from the command line?
>> * How to list the free list from the command line?

shaggy> There is no free list.  The block map is a binary-buddy
shaggy> tree, where the leaves contain bitmaps.

Ahhh, I mentioned that the free list is a buddy tree somewhere
else. I have done _some_ homework :-).

shaggy> [ ... 'jfs_debugs' and 'xtree' and 'dmap' ... ]
shaggy> Hopefully, it's not too hard to navigate around.

Well, I was angling for something neater than using
'jfs_debugfs', but perhaps I can wrap it up in some script...

>> * Suppose a filesystem is empty, and 'tar' extracts to it an
>>   8MiB file, writing 32KiB blocks. How many extents will it
>>   span? After it has been written, what will the free list
>>   look like?

shaggy> I just tried creating a file with 'dd bs=32768
shaggy> count=256' on a newly formatted jfs volume, and it went
shaggy> into one extent.

Yes, this is more or less what I was expecting (that is, it
obviously first creates 32KiB extents and then coalesces them
after writing them).

[ ... ]

shaggy> There's no code to do that now, but we could create an
shaggy> extent and mark it ABNR (allocated but not recorded).
shaggy> This is a holdover from OS/2, where the default behavior
shaggy> was to have dense files, rather than sparse ones.

Uhm, instead of having ABNR one could simply have a the
convention that whole area between file size and length be
zero-on-demand, whether allocated or not. ABNR seems like for a
hole ''in the middle'' of a file, and the difference between
that and a preallocated area beyond its end is that while in the
former case the preallocated area is accessible, in the latter
is not, so if the former is unwritten one must flag it specially
or actually zero it; the latter needs neither (the flag is its
position). However on a 'seek' beyond the current file end might
require turning some of that into into ABNR extents or zeroing
them... Bah!

shaggy> [ ... ] Currently, space is allocated one page at a
shaggy> time. [ ... appending ... ] we generally end up with
shaggy> large extents.

Yes, but this results in rather hairy code to handle
after-the-fact coalescing of buddy extents (I guess the JFS
terminology is the ''backsplits'' mentioned in the comments),
and can result in other suboptimal behaviour, more later on
this.

[ ... ]

>> - a minimum extent size? for example to ensure that no extent
>>   smaller than 1MiB is allocated? The purpose is to ensure
>>   that files tend to be contiguous.

shaggy> Non-trivial, but it shouldn't be a show-stopper.
shaggy> jfs_fsck would have to be taught to not flag an error or
shaggy> blocks are allocated beyond the file's size.

I believe this would be a generally good idea, and modifying
'jfs_fsck' for this might be pretty easy. I might try...

shaggy> [ ... ] Would we need a mechanism to free unused
shaggy> preallocations?

For an ''unconditional' minimum extent size that was a tunable,
I would not handle low space conditions at all.

And actually in general because of two reasons:

* In any case it is a user tunable. The use can always leave the
  1 block default unchanged.

* If you have a low space condition where one (or even more than
  one) 1MiB makes a difference on a say 40GiB filesystem, who
  cares...

* In any case several memory (but we can sort of rely on them
  here too) allocation studies show that in most cases it is
  pointless to handle low free memory conditions cleverly,
  because in such cases it just delays an almost inevitable
  overflow, does not avoid it. Again, especially if the margin
  is of the order of 1MiB over a 40GiB arena.

Also, preallocations can be ephemeral (disappear on close) or
persistent. The former are useful in their own right and have
very little downside.

>> - alternatively, a default minimum extent size? So that the
>>   extents are initially allocated of that size, but can be
>>   reduced by 'close'(2) or 'ftruncate'(2) to the actual size
>>   of the file. [ ... ]

shaggy> A jfs volume is logically divided into a number of
shaggy> allocation groups (AGs).  While a file is opened, jfs
shaggy> will always try to put allocations to other files in a
shaggy> separate AG. This generally works pretty well, [ ... ]

Yes, this is like in BSD FFS and 'ext[23]' cylinder groups,
where it is also used to ensure all ''groups'' have some free
space.

As you say, it works well especially to prevent interleaving on
parallel writes. But I surmise that a preallocate logic delivers
the same result more reliably, and a few more advantages.

>> - a maximum extent size? For example to ensure that no extent
>>   larger than 256KiB is ever allocated? The purpose is to
>>   minimize internal fragmentation by allocating only at the
>>   lower levels of the buddy system.

shaggy> I'm not sure what that would buy us.

Well, it would prevent really large extents from happening. In a
buddy system very large allocations (wrt to the size of the
whole) cause trouble.

>> I hope that the rationales are fairly clear;

Well, perhaps a bit more is needed, so bear with me if I repeat
a bit of the obvious here, to illustrate the rest.

A buddy system has two big problems for _memory_ allocation,
that it does not coalesce buddies that are not part of the same
branch (''buddy system'' in the language of the comments in the
JFS source) and that it can waste a lot of space in internal
fragmentation because of the power-of-2 thing.

These problems are big for memory allocation, especially if
allocations are not very small with respect to the size of the
arena. But they are essentially irrelevant for file allocation,
and thus the use of a buddy system for JFS is fairly brilliant,
because files do not need to be wholly contiguous, unlike memory
blocks.  it is only for performance reasons that they should be
mostly contiguous.

The size of an optimally contiguous extent should be mostly
depending on the speed characteristics of the disc, which are
fairly constant. If one has discs with a seek time of 5-10ms,
that do a full rotation in 5-10ms, and have a transfer rate of
50-100MiB/s that is 50-100KiB/ms, it is not that essential to
have 100MiB contiguous regions, a MiB at a time or so (that is a
few ms at a time) is pretty reasonable.

So for example an 11MiB file should ideally be not in one 16MiB
extent, but in three ones, 8+2+1MiB (or arguably 8+4MiB), and
now I wonder what happens with JFS -- I shall try.

At the same time truly large extents ''tie up'' many higher
levels of the buddy tree, and too much of an AG. So instead of
having a say 1GiB file as a single extent, having it as several
64MiB (or whatever) extents may be perfectly reasonable and give
some advantages as to performance. Even DM/LVM2 allocates
logical volumes with that kind of granularity.

Conversely, when allocating smaller files, having a minimum
advisory extent size of say 256KiB can help enesuring contiguity
even in massive multithreaded writes or on fragemented free lists.

Also, in general preallocation (that is the ability of having
file length != size by semi-arbitrary amounts) can help
considerably in several special cases.

I have in mind in particular these scenarios:

* In many important cases the end size of a file is well known
  in advance when the file is created. So one might as well
  preallocate the whole file in advance.

* In many important cases files are overwritten with contents of
  much the same size as they were (e.g. recompiling a '.c' into
  a '.o'). So one might as well, tentatively, preserve the
  previously allocated size on a 'ftruncate', and then do it for
  real on a 'close'.

The two above points (which would be greatly enhanced by trivial
changes to 'libc' and some common utilities) are relevant to me
because I care also about minimizing ''churn'' over the lifetime
of a filesystem, not just how good it performs freshly laid out,
where the free list is a single block to start with and remains
totally contiguous.

  Note: indeed considering that many filesystems are created
  from a 'tar x' (resulting in a ''perfect'' layout) and then
  updated, overwrites would help preserving preserve the
  initially ''perfect'' layout.

I have another scenario in mind:

* DM is basically a simple minded ''manual allocation of
  extents'' filesystem, and LVM2 is basically '-o loop' over it.

* Imagine a 2300GiB JFS filesystem, with a minimum extent size
  of 1GiB and a maximum extent size of say 16GiB (never mind the
  AG limits :->), mounted perhaps with '-o nointegrity'.

* Such a filesystem plus '-o loop' (built on 'md' if needed)
  looks to me like a ''for free'' LVM, and with essentially the
  same performance, and with no need for special utilities or
  configuration.

>> [ ... ] part of that is to short circuit when possible the
>> somewhat hairy ''hint'' related logic in 'jfs_dmap.c' and
>> that in 'jfs_open()' for example.

shaggy> I don't understand the problem with the "hint".  The
shaggy> hints are used to attempt to allocated file data near
shaggy> the inode, or to append onto existing extents when the
shaggy> following blocks are available.

Ah yes sure, but they have a hairy logic and they have
conceivale performance limitations.

My understanding of the current logic is to allocate small
extents (the size of a 'write'(2) I guess), use the hint to put
them near each other and then coalesce them if possible (if the
hints worked that well). But this is complex and has some
limitations (which do not happen if one is doing an initial load
of a filesystem, but otherwise may bite). It is designed to make
the best of a bad lot, if one does not know in advance the end
size of a file.

Allocating larger extents to start with and then cut that down
on 'close' or whatever seems to me it could be a rather more
successful if the free list (the buddy tree) is already a bit
fragmented, and in any case handles several common cases where
the end size is known in advance directly.

  Another way: the difference between attempting to allocate
  directly a 1MiB extent and something like 64x16KiB ones is
  that the first does a top-down scan of the buddy tree, the
  second does in effect a bottom up one, trying to find a 1MiB
  free block ''after the fact''. Even with hints, probably
  latter only works well if the free list is really almost
  unfragmented. Also, there may be several 1MiB free blocks in
  a somewhat fragmented free list, but it is random whether the
  16KiB extents get allocated inside one, so that after-the-fact
  coalescing them back into the original 1Mib one might not
  work that well.

[ ... ]

shaggy> I think preallocation may be useful in some
shaggy> circumstances, i.e. when a file is created
shaggy> non-sequentially,

I think that apart from non-sequential or parallel writes (and
the AG switch helps in the latter case) or when the appends have
already happened, the free list (buddy tree) is already
fragemented (the the AG switch does not help).

The general case for preallocation is made for example in this
'ext[23]' paper:

  http://WWW.USENIX.org/events/usenix02/tech/freenix/full_papers/tso/tso_html/

and later updates like this (even if some of the arguments do
not apply that much to JFS):

  http://ext2.SourceForge.net/2005-ols/paper-html/node6.html

But I think preallocation in the context of a buddy/extent based
allocator and free list manager makes even more sense.

shaggy> but I am concerned that leaving preallocated, but
shaggy> unused, blocks between actual file data

But the unused bits would not be necessarily left there: because
they would disappear on close, if any are left (that is the
preallocation was overestimated, which for things like files
written by 'tar' or 'gcc -c' would be impossible or rather rare)
unless one marked the file as ''persistently preallocated'', in
which case they would be e.g. for logs or virtual volume images
ala DM/LVM2.

shaggy> will result in more fragmentation, or just wasted space
shaggy> on the disk.

As to fragmentation, I suspect less, because of the particular
aspects of a buddy system, which strongly favours keeping large
blocks together so when they are deallocated they are whole
related (thus splittable/coalesceable) subtrees. The wasted
space can be minuscule if it is ''shrink-wrap on close'', or
irrelevant if one sets something like a ''keep preallocation''
flag.

Then there is the ''DM/LVM2'' replacement story...

[ ... ]

-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
Jfs-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jfs-discussion

Re: [Jfs-discussion] Some more questions, preallocation

Reply via email to