Fwd: Re: RFC: [PATCH] ext2 BLOCK_SIZE independence

Daniel Phillips Mon, 12 Mar 2001 09:14:08 -0800


----------  Forwarded Message  ----------
Subject: Re: RFC: [PATCH] ext2 BLOCK_SIZE independence
Date: Mon, 12 Mar 2001 13:45:32 +0100
From: Daniel Phillips <[EMAIL PROTECTED]>


On Mon, 12 Mar 2001, Anton Altaparmakov wrote:
> At 02:44 12/03/2001, Alexander Viro wrote:
> >On Mon, 12 Mar 2001, Anton Altaparmakov wrote:
> [snip]
> > > 1) Makes the ext2 filesystem independent of the kernel's BLOCK_SIZE by
> > > making use of the already defined but for some reason unused
> > > EXT2_MIN_BLOCK_SIZE. This makes ext2 work on kernels with BLOCK_SIZE !=
> > > 1024 and anyway, there is no good reason to depend on BLOCK_SIZE being any
> > > particular value.
> >
> >Why would you need to redefine it? Not that I had any objections on the
> >ext2 side, but... What's the point?
> 
> Point is to eventually have the kernel working in 512 byte blocks rather 
> than 1024, which makes a lot more sense IMHO, as most block devices (most 
> hard drives) use a sector size of 512 bytes and NTFS for example has a 
> granularity of 512 bytes at the lowest level.

What is sacred about 512 byte blocks?  Talking to hard disk engineers I
get the impression that many disks are using whatever blocking scheme
they want internally and doing a translation to create the appearance
of 512 byte blocks.  The only reason we see 512 byte sectors by
default is that sony floppy disk controllers worked in 512 byte
sectors and DOS->VFAT has never been able to understand a hard disk as
anything other than a big floppy disk.  Right now, Linux is trying to
be part of the problem, the only difference is that we are trying to
impose an arbitrary 1K block measure on the world instead of the equally
arbitrary 512 byte measure.

Instead of continuing to be part of the problem, why don't we try to
make some forward progress?  It would be so much nicer if we measured
all device sizes and filesystem sizes in the kernel measured in units of
device->block_shift, then we could get away completely from the
awkward 1K *and* 512 byte magic numbers, not to mention fixing at least
one bug, losing some filesystem/file size limits, making the code more
efficient and improving readability.

> My reasoning for making ext2 
> independent of BLOCK_SIZE is that Linus might not want to have the 
> BLOCK_SIZE changed in the kernel but he might accept to have it as a 
> configure option, but for such an option to be viable it means all the code 
> in the kernel has to become independent of BLOCK_SIZE or at least cope with 
> it being different value rather than assume that it equals 1024. I just 
> thought submitting my ext2 patch is as good a place to start as any.

I'd suggest leaving BLOCK_SIZE entirely alone - just rename it to
BLOCK_SIZE_1024 or similar - and get started on the process of
generalizing the 42 or so kernel bits one at a time, getting completely
away from the idea of a fixed BLOCK_SIZE in a series of nice, easy,
forward/backward-compatible steps.  My point is, it doesn't make
any sense at all to change the constant BLOCK_SIZE to a different
constant.  Block size is by nature variable so let's recognize that and
handle it.

> Note that changing BLOCK_SIZE to 512 automagically solves the current 
> problem of not being able to access the last odd sector on a disk/partition 
> for example and it solves software RAID devices on NTFS (NTFS quite happily 
> will split a cluster into two parts to use up the last not cluster size 
> granular number of sectors on a partition, so the cluster will be contained 
> part in one  partition and part in the next partition in RAID array! That 
> really kills us at the moment as software RAID uses BLOCK_SIZE blocks and 
> assumes nobody is crazy enough to split a block in two.)

I noticed that, but using device->block_shift is a better solution. 
If we go to BLOCK_SIZE=512 we get the 2TB limit on everything unless we
go to long long, which generates horrible code on ARCH=i386.  Using
device->blocksize_bits is much better - it actually improves the 32
bit code in many places.

> [snip]
> >I don't see the bug in #1. As a matter of taste - why not, but if you
> >really want to redefine the BLOCK_SIZE you'll most likely find that
> >places where you want a new value are _seriously_ outnumbered by
> >places where you'll need to preserve the old one. I.e. it's easier to
> >replace BLOCK_SIZE with something else in the places where want a
> >new value.

Here's a pointer to all those references for your browsing enjoyment:

  http://innominate.org/~graichen/projects/lxr/ident?v=v2.4&i=BLOCK_SIZE
  (BLOCK_SIZE referenced in 42 files)

> Well, but then it would become confusing! BLOCK_SIZE is supposed to be the 
> default kernel block size. (I know we could rename it, but...) I realize 
> that it is not trivial.

It's trivial - BLOCK_SIZE -> BLOCK_SIZE_1024, and don't reuse the old
symbol.  Then at our leisure we can exterminate most of the
BLOCK_SIZE_1024's, wherever the underlying device blocksize
is a better measure (roughly speaking: everywhere but fs->stat, and even
there we can easily drop the number of refs to 1/function instead of
~5).

> I know as I have converted all subsystems that I 
> use personally and am running a BLOCK_SIZE = 512 kernel on my VMware setup 
> and have been doing for ages without problems (well, there are a few minor 
> things left to sort out but it does work and is stable, the only thing I 
> get are spurious messages from ll_rw_block about submitting wrong sized 
> requests, and that only when I run fdisk(!), but then again ll_rw_block 
> will be dropping that limitation in the future anyway).
> 
> I guess the alternative to changing the BLOCK_SIZE would be to fix 
> ll_rw_block and friends so they can return the last sector (by modifying 
> the out of bounds check, jumping out of the fast path and performing a 
> proper check perhaps) as well as software raid and that would suffice but 
> it's a hack.

*Ick*.  I guess that's the reaction you wanted?

I strongly support the idea that BLOCK_SIZE is broken and needs to be
fixed but I don't think you're going far enough.

-- 
Daniel
-------------------------------------------------------

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
Fwd: Re: RFC: [PATCH] ext2 BLOCK_SIZE independence

Reply via email to