Re: [RFC] ext3 freeze feature ver 0.2

2008-02-26 Thread Andreas Dilger
On Feb 26, 2008  08:39 -0800, Eric Sandeen wrote:
> Takashi Sato wrote:
> 
> > o Elevate XFS ioctl numbers (XFS_IOC_FREEZE and XFS_IOC_THAW) to the VFS
> >   As Andreas Dilger and Christoph Hellwig advised me, I have elevated
> >   them to include/linux/fs.h as below.
> > #define FIFREEZE_IOWR('X', 119, int)
> >    #define FITHAW  _IOWR('X', 120, int)
> >   The ioctl numbers used by XFS applications don't need to be changed.
> >   But my following ioctl for the freeze needs the parameter
> >   as the timeout period.  So if XFS applications don't want the timeout
> >   feature as the current implementation, the parameter needs to be
> >   changed 1 (level?) into 0.
> 
> So, existing xfs applications calling the xfs ioctl now will behave
> differently, right?  We can only keep the same ioctl number if the
> calling semantics are the same.  Keeping the same number but changing
> the semantics is harmful, IMHO

Do we know what this parameter was supposed to mean?

We could special case "1" if needed to keep compatibility (documenting
this clearly), either making it == 0, or some very long timeout (1h
or whatever).  A relatively minor wart I think.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: i_version changes

2008-02-13 Thread Andreas Dilger
On Feb 13, 2008  09:07 -0500, Trond Myklebust wrote:
> On Wed, 2008-02-13 at 13:52 +0100, Christoph Hellwig wrote:
> > Btw, stupid question:  the commit message for the i_version changes
> > mentions this is to work around lack of granularity for ctime updates.
> > But all modern filesystems (and I includ ext4 in that here) have 64bit
> > timestamps already, so that should be enough.  It would certainly
> > avoid all this additional code, and especially the additional space
> > used in struct inode which can hurt quite a lot.
> 
> Support for 64-bit on-disk time stamps alone does not suffice. In order
> to label all file/directory changes unambiguously, you would also need
> nanosecond timekeeping support.
> 
> An example is XFS, which has had on-disk support for 64-bit time stamps
> since forever, but is in practice limited by the Linux default 250Hz
> internal clock. We've seen plenty of examples of NFS clients missing
> updates on the resulting filesystem due to the fact that they occurred
> within 1/250 sec of each other.

The other issue which unfortunately makes ctime a non-starter is the
ability of ctime to go backward due to clock changes.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: i_version changes

2008-02-13 Thread Andreas Dilger
On Feb 12, 2008  15:06 -0500, J. Bruce Fields wrote:
> On Sun, Feb 10, 2008 at 08:30:41AM +0100, Christoph Hellwig wrote:
> > Third using the MS_ flag but then actually having a filesystem
> > mount option to enable it is more than confusing.  After all MS_
> > options (at least the exported parts) are the mount ABI for common
> > options.  Also this option doesn't show up in ->show_options,
> > which is something Miklos will beat you up for :)
> > I'm also not convinced this should be option behaviour, either you
> > do update i_version for a given filesystem or you don't - having
> > an obscure mount option will only give you confusion.
> 
> That does sound likely to be confusing.  Any chance we could just make
> the new behavior mandatory?

One of the reasons NOT to make it mandatory is that it forces updates
of the inode after every write.  On ext3/ext4 this is expensive, as the
ext3_dirty_inode() packs the inode from memory into the buffer each time,
so that it can be journaled.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [sample] mem_notify v6: usage example

2008-02-11 Thread Andreas Dilger
On Feb 10, 2008  01:46 +0900, KOSAKI Motohiro wrote:
> > This really needs to be triggered via a generic kernel event in the
> > final version - I picture glibc having a reservation API and having
> > generic support for freeing such reservations.
> 
> to be honest, I doubt idea of generic reservation framework.
> 
> end up, we hope drop the application cache, not also dataless memory.
> but, automatically drop mechanism only able to drop dataless memory.
> 
> and, many application have own memory management subsystem.
> I afraid to nobody use too complex framework.

Having such notification handled by glibc to free up unused malloc (or
any heap allocations) would be very useful, because even if a program
does "free" there is no guarantee the memory is returned to the kernel.

I think that having a generic reservation framework is too complex, but
hiding the details of /dev/mem_notify from applications is desirable.
A simple wrapper (possibly part of glibc) to return the poll fd, or set
up the signal is enough.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-02-08 Thread Andreas Dilger
On Feb 08, 2008  19:48 +0900, Takashi Sato wrote:
> OK I would like to implement the freeze feature on VFS
> as the filesystem independent ioctl so that it can be
> available on filesystems that have already had write_super_lockfs()
> and unlockfs().
> The usage for the freeze ioctl is the following.
>  int ioctl(int fd, int FIFREEZE, long *timeval);
>fd:file descriptor of mountpoint
>FIFREEZE:request cord for freeze
>timeval:timeout period (second)
>
> And the unfreeze ioctl is the following.
>  int ioctl(int fd, int FITHAW, NULL);
>fd:file descriptor of mountpoint
>FITHAW:Request cord for unfreeze

You may as well make the common ioctl the same as the XFS version,
both by number and parameters, so that applications which already
understand the XFS ioctl will work on other filesystems.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-06 Thread Andreas Dilger
On Jan 26, 2008  08:27 +0300, Al Boldi wrote:
> Jan Kara wrote:
> > > data=ordered mode has proven reliable over the years, and it does this
> > > by ordering filedata flushes before metadata flushes.  But this
> > > sometimes causes contention in the order of a 10x slowdown for certain
> > > apps, either due to the misuse of fsync or due to inherent behaviour
> > > like db's, as well as inherent starvation issues exposed by the
> > > data=ordered mode.
> > >
> > > data=writeback mode alleviates data=order mode slowdowns, but only works
> > > per-mount and is too dangerous to run as a default mode.
> > >
> > > This RFC proposes to introduce a tunable which allows to disable fsync
> > > and changes ordered into writeback writeout on a per-process basis like
> > > this:
> > >
> > >   echo 1 > /proc/`pidof process`/softsync
> >
> >   I guess disabling fsync() was already commented on enough. Regarding
> > switching to writeback mode on per-process basis - not easily possible
> > because sometimes data is not written out by the process which stored
> > them (think of mmaped file).
> 
> Do you mean there is a locking problem?
> 
> > And in case of DB, they use direct-io
> > anyway most of the time so they don't care about journaling mode anyway.
> 
> Testing with sqlite3 and mysql4 shows that performance drastically improves 
> with writeback writeout.
> 
> >  But as Diego wrote, there is definitely some room for improvement in
> > current data=ordered mode so the difference shouldn't be as big in the
> > end.
> 
> Yes, it would be nice to get to the bottom of this starvation problem, but 
> even then, the proposed tunable remains useful for misbehaving apps.

Al, can you try a patch posted to linux-fsdevel and linux-ext4 from
Hisashi Hifumi <[EMAIL PROTECTED]> to see if this improves
your situation?  Dated Mon, 04 Feb 2008 19:15:25 +0900.

[PATCH] ext3,4:fdatasync should skip metadata writeout when overwriting

It may be that we already have a solution in that patch for database
workloads where the pages are already allocated by avoiding the need
for ordered mode journal flushing in that case.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Andreas Dilger
On Wednesday 30 January 2008, Al Boldi wrote:
> And, a quick test of successive 1sec delayed syncs shows no hangs until
> about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
> hangs for minutes on end, and io-wait shows almost 100%.

How large is the journal in this filesystem?  You can check via
"debugfs -R 'stat <8>' /dev/XXX".  Is this affected by increasing
the journal size?  You can set the journal size via "mke2fs -J size=400" 
at format time, or on an unmounted filesystem by running
"tune2fs -O ^has_journal /dev/XXX" then "tune2fs -J size=400 /dev/XXX".

I suspect that the stall is caused by the journal filling up, and then
waiting while the entire journal is checkpointed back to the filesystem
before the next transaction can start.

It is possible to improve this behaviour in JBD by reducing the amount
of space that is cleared if the journal becomes "full", and also doing
journal checkpointing before it becomes full.  While that may reduce
performance a small amount, it would help avoid such huge latency problems.
I believe we have such a patch in one of the Lustre branches already,
and while I'm not sure what kernel it is for the JBD code rarely changes
much

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Andreas Dilger
On Jan 24, 2008  23:36 +0300, Al Boldi wrote:
> data=ordered mode has proven reliable over the years, and it does this by 
> ordering filedata flushes before metadata flushes.  But this sometimes 
> causes contention in the order of a 10x slowdown for certain apps, either 
> due to the misuse of fsync or due to inherent behaviour like db's, as well 
> as inherent starvation issues exposed by the data=ordered mode.
> 
> data=writeback mode alleviates data=order mode slowdowns, but only works 
> per-mount and is too dangerous to run as a default mode.
> 
> This RFC proposes to introduce a tunable which allows to disable fsync and 
> changes ordered into writeback writeout on a per-process basis like this:
> 
>   echo 1 > /proc/`pidof process`/softsync

If fsync performance is an issue for you, run the filesystem in data=journal
mode, put the journal on a separate disk and make it big enough that you
don't block on it to flush the data to the filesystem (but not so big that
it is consuming all of your RAM).

That keeps your data guarantees without hurting performance.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Andreas Dilger
On Jan 24, 2008  17:25 -0700, Zan Lynx wrote:
> Have y'all been following the /dev/mem_notify patches?
> http://article.gmane.org/gmane.linux.kernel/628653

Having the notification be via poll() is a very restrictive processing
model.  Having the notification be via a signal means that any kind of
process (and not just those that are event loop driven) can register
a callback at some arbitrary point in the code and be notified.  I
don't object to the poll() interface, but it would be good to have a
signal mechanism also.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/26] mount options: fix ext4

2008-01-25 Thread Andreas Dilger
On Jan 24, 2008  20:33 +0100, Miklos Szeredi wrote:
> Add stripe= option to /proc/mounts for ext4 filesystems.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>

Acked-by: Andreas Dilger <[EMAIL PROTECTED]>

> Index: linux/fs/ext4/super.c
> ===
> --- linux.orig/fs/ext4/super.c2008-01-23 12:57:07.0 +0100
> +++ linux/fs/ext4/super.c 2008-01-23 21:43:51.0 +0100
> @@ -742,7 +742,8 @@ static int ext4_show_options(struct seq_
>   seq_puts(seq, ",nomballoc");
>   if (!test_opt(sb, DELALLOC))
>   seq_puts(seq, ",nodelalloc");
> -
> + if (sbi->s_stripe)
> + seq_printf(seq, ",stripe=%lu", sbi->s_stripe);
>  
>   /*
>* journal mode get enabled in different ways

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Andreas Dilger
On Jan 24, 2008  18:32 +0100, Bodo Eggert wrote:
> I think a single, system-wide signal is the second-to worst solution: All
> applications (or the wrong one, if you select one) would free their caches
> and start to crawl, and either stay in this state or slowly increase their
> caches again until they get signaled again. And the signal would either
> come too early or too late. The userspace daemon could collect the weighted
> demand of memory from all applications and tell them how much to use.

Well, sending a few signals (maybe to the top 5 processes in the OOM killer
list) is still a LOT better than OOM-killing them without warning...  That
way important system processes could be taught to understand SIGDANGER and
maybe do something about it instead of being killed, and if Firefox and
other memory hungry processes flush some of their cache it is not fatal.

I wouldn't think that SIGDANGER means "free all of your cache", since the
memory usage clearly wasn't a problem a few seconds previously, so as
an application writer I'd code it as "flush the oldest 10% of my cache"
or similar, and the kernel could send SIGDANGER again (or kill the real
offender) if the memory usage again becomes an issue.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger
On Jan 22, 2008  14:38 +1100, David Chinner wrote:
> On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
> > I discussed this with Ted at one point also.  This is a generic problem,
> > not just for readahead, because "fsck" can run multiple e2fsck in parallel
> > and in case of many large filesystems on a single node this can cause
> > memory usage problems also.
> > 
> > What I was proposing is that "fsck.{fstype}" be modified to return an
> > estimated minimum amount of memory needed, and some "desired" amount of
> > memory (i.e. readahead) to fsck the filesystem, using some parameter like
> > "fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
> > return the output in the expected format, or returns an error then fsck
> > will assume some amount of memory based on the device size and continue
> > as it does today.
> 
> And while fsck is running, some other program runs that uses
> memory and blows your carefully calculated paramters to smithereens?

Well, fsck has a rather restricted working environment, because it is
run before most other processes start (i.e. single-user mode).  For fsck
initiated by an admin in other runlevels the admin would need to specify
the upper limit of memory usage.  My proposal was only for the single-user
fsck at boot time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger
On Jan 21, 2008  23:17 -0500, [EMAIL PROTECTED] wrote:
> On Tue, 22 Jan 2008 14:38:30 +1100, David Chinner said:
> > Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
> > to a processes that aren't masking the signal followed by a short
> > grace period to allow the processes to free up some memory before
> > swapping out pages from that process?
> 
> AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
> the system was about to hit OOM, not when it was about to start swapping.

I'd tried to advocate SIGDANGER some years ago as well, but none of
the kernel maintainers were interested.  It definitely makes sense
to have some sort of mechanism like this.  At the time I first brought
it up it was in conjunction with Netscape using too much cache on some
system, but it would be just as useful for all kinds of other memory-
hungry applications.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger
On Jan 16, 2008  13:30 -0800, Valerie Henson wrote:
> I have a partial solution that sort of blindly manages the buffer
> cache.  First, the user passes e2fsck a parameter saying how much
> memory is available as buffer cache.  The readahead thread reads
> things in and immediately throws them away so they are only in buffer
> cache (no double-caching).  Then readahead and e2fsck work together so
> that readahead only reads in new blocks when the main thread is done
> with earlier blocks.  The already-used blocks get kicked out of buffer
> cache to make room for the new ones.
>
> What would be nice is to take into account the current total memory
> usage of the whole fsck process and factor that in.  I don't think it
> would be hard to add to the existing cache management framework.
> Thoughts?

I discussed this with Ted at one point also.  This is a generic problem,
not just for readahead, because "fsck" can run multiple e2fsck in parallel
and in case of many large filesystems on a single node this can cause
memory usage problems also.

What I was proposing is that "fsck.{fstype}" be modified to return an
estimated minimum amount of memory needed, and some "desired" amount of
memory (i.e. readahead) to fsck the filesystem, using some parameter like
"fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
return the output in the expected format, or returns an error then fsck
will assume some amount of memory based on the device size and continue
as it does today.

If the fsck.{fstype} does understand this parameter, then fsck makes a
decision based on devices, parallelism, total RAM (less some amount to
avoid thrashing), then it can call the individual fsck commands with
"--maximum-memory MMM /dev/XXX" so each knows how much cache it can
allocate.  This parameter can also be specified by the user if running
e2fsck directly.

I haven't looked through your patch yet, but I hope to get to it soon.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Andreas Dilger
On Jan 15, 2008  22:05 -0500, Rik van Riel wrote:
> With a filesystem that is compartmentalized and checksums metadata,
> I believe that an online fsck is absolutely worth having.
> 
> Instead of the filesystem resorting to mounting the whole volume
> read-only on certain errors, part of the filesystem can be offlined
> while an fsck runs.  This could even be done automatically in many
> situations.

In ext4 we store per-group state flags in each group, and the group
descriptor is checksummed (to detect spurious flags), so it should
be relatively straight forward to store an "error" flag in a single
group and have it become read-only.

As a starting point, it would be worthwhile to check instances of
ext4_error() to see how many of them can be targetted at a specific
group.  I'd guess most of them could be (corrupt inodes, directory
and indirect blocks, incorrect bitmaps).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Incremental fsck

2008-01-09 Thread Andreas Dilger
Andi Kleen wrote:
>> Theodore Tso <[EMAIL PROTECTED]> writes:
>> > Now, there are good reasons for doing periodic checks every N mounts
>> > and after M months.  And it has to do with PC class hardware.  (Ted's
>> > aphorism: "PC class hardware is cr*p").
>>
>> If these reasons are good ones (some skepticism here) then the correct
>> way to really handle this would be to do regular background scrubbing
>> during runtime; ideally with metadata checksums so that you can actually
>> detect all corruption.
>>
>> But since fsck is so slow and disks are so big this whole thing
>> is a ticking time bomb now. e.g. it is not uncommon to require tens
>> of minutes or even hours of fsck time and some server that reboots
>> only every few months will eat that when it happens to reboot.
>> This means you get a quite long downtime.
>
> Has there been some thought about an incremental fsck?

While an _incremental_ fsck isn't so easy for existing filesystem types,
what is pretty easy to automate is making a read-only snapshot of a
filesystem via LVM/DM and then running e2fsck against that.  The kernel
and filesystem have hooks to flush the changes from cache and make the
on-disk state consistent.

You can then set the the ext[234] superblock mount count and last check
time via tune2fs if all is well, or schedule an outage if there are
inconsistencies found.

There is a copy of this script at:
http://osdir.com/ml/linux.lvm.devel/2003-04/msg1.html

Note that it might need some tweaks to run with DM/LVM2 commands/output,
but is mostly what is needed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Andreas Dilger
On Nov 29, 2007  00:48 +0100, Jörn Engel wrote:
> I didn't follow the discussion much, since it didn't appear to suit
> logfs too well.  In a nutshell, logfs is purely block-based, prepends
> every block with a header, may compress blocks and packs them as tightly
> as possible (byte alignment).

The FIEMAP interface in some ways well suited to your needs, because it
uses byte offsets instead of FIBMAP, which uses block offsets.  I now see
that one hole in the specification is that your physical extent is not the
same length as the logical extent that the data represents.

I'm not sure that is in itself a reason not to use FIEMAP.  There is already
a provision for logical extents that do not map directly to the physical
layout (i.e. FIEMAP_EXTENT_NO_DIRECT flag on the extent).  In the case of
logfs, the extent fe_length would represent the logical length of the data
that the extent covers, fe_offset is the start of the physical extent, and
FIEMAP_EXTENT_NO_DIRECT indicates that this extent cannot be directly mapped.
This is useful for applications like LILO that would otherwise assume the
physical offset can be used to access the data from the block device.

It would still provide very useful information about the layout of files on
disk for filefrag, and if tar or cp used FIEMAP they could know to read the
data from the start up to the end of an extent to avoid waiting for a seek
in the middle of the IO, and of course skipping holes during copy.  A very
smart tar might even FIEMAP a whole bunch of files and then read the extents
in physical block offset order to reduce seeking.

I don't think most applications will care about the actual physical location
of an extent, so much as the relative locations of extents within a file and
between files.

> Maybe the "MAP" part fooled me to believe FIEMAP would also expose
> physical location of extends on the medium.  But reading the proposal
> again, I am unsure about that part.  If physical locations are exposed,
> SEEK_HOLE/SEEK_DATA is significantly more elegant for logfs.  If not,
> FIEMAP could be useful.

SEEK_HOLE/SEEK_DATA only provides a fraction of the useful information
that FIEMAP does.  It won't give users or developers any information about
the on disk layout (which is quite important for knowing if allocation
algorithms are good).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Andreas Dilger
On Nov 28, 2007  15:02 -0500, Josef Bacik wrote:
> +static loff_t generic_seek_hole_data(struct file *file, loff_t offset,
> +  int origin)
> +{
> + loff_t retval = -ENXIO;
> + struct inode *inode = file->f_mapping->host;
> + sector_t block, found_block;
> + sector_t last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
> + int want = (origin == SEEK_HOLE) ? 0 : 1;
> +
> + for (block = offset >> inode->i_blkbits; block <= last_block; block++) {
> + found_block = bmap(inode, block);
> +
> + if (!!found_block == want) {
> + retval = block << inode->i_blkbits;
> + break;
> + }
> + }

The problem with this is that it doesn't know anything about the filesystem,
and might spin for ages in a tight loop for a large sparse or dense file.

Almost any fs would have to implement ->seek_hole_data() for this to be
useful, at which point we might as well just report -EINVAL to the
application or glibc (since this needs to be handled anyways).

Also, what about network filesystems that don't support bmap?  They will
show up as entirely holes, since bmap() returns 0 if ->bmap is not
supported.  That could lead to serious data loss, if e.g. a filesystem
is being backed up with a SEEK_DATA-aware tool.

In light of this, it makes much more sense to only support this
functionality if the filesystem explicitly adds it, instead of pretending
that it works when there are sharp pointy things in the dark corners.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Andreas Dilger
On Nov 28, 2007  14:56 -0800, Nicholas Miell wrote:
> I stand by my belief that SEEK_HOLE/SEEK_DATA is a lousy interface.
> 
> It abuses the seek operation to become a query operation, it requires a
> total number of system calls proportional to the number holes+data and
> it isn't general enough for other similar uses (e.g. total number of
> contiguous extents, compressed extents, offline extents, extents
> currently shared with other inodes, extents embedded in the inode
> (tails), etc.)
> 
> Something like the following would be much better:
> 
> int getfilextents(int fd, off_t offset, int type, size_t *length, struct
> extent *extents)
> 
> with
> 
> int fd: open file
> 
> offset: offset in file to start reporting extents
> 
> type: one of EXTENT_TYPE_HOLE, EXTENT_TYPE_DATA, EXTENT_TYPE_EXTENTS,
> EXTENT_TYPE_COMPRESSED, EXTENT_TYPE_UNCOMPRESSED etc.

This is what FIEMAP is supposed to do.  We wrote a spec and implemented
a prototype for ext4, but haven't had time to make it generic to move
the large part of the code into the VFS.  If someone wanted to take that
up, it would be much appreciated.

See "[RFC] add FIEMAP ioctl to efficiently map file allocation" in
linux-fsdevel for details on this interface.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Beagle and logging inotify events

2007-11-14 Thread Andreas Dilger
On Nov 14, 2007  11:32 -0500, Chuck Lever wrote:
> I disagree: we don't need a "bullet-proof" log.  We can get a significant 
> performance improvement even with a permanent dnotify log implemented in 
> user-space.  We already have well-defined fallback behavior if such a log 
> is missing or incomplete.
>
> The problem with a permanent inotify log is that it can become unmanageably 
> enormous, and a performance problem to boot.  Recording at that level of 
> detail makes it more likely that the logger won't be able to keep up with 
> file system activity.
>
> A lightweight solution gets us most of the way there, is simple to 
> implement, and doesn't introduce many new issues.  As long as it can tell 
> us precisely where the holes are, it shouldn't be a problem.

Jan Kara is working on a patch for ext4 which would store a recursive
timestamp for each directory that gives the latest time that a file in
that directory was modified.  ZFS has a similar mechanism by virtue of
doing full-tree updates during COW of all the metadata blocks and storing
the most recent transaction number in each block.  I suspect btrfs could
do the same thing easily.

That would allow recursive-descent filesystem traversal to be much more
efficient because whole chunks of the filesystem tree can be ignored during
scans.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  17:11 -0700, Mark Fasheh wrote:
> On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
> > > Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
> > > blocks.
> > 
> > Hmm, but part of the issue would be how to request the extra data, and
> > what offset it would be given?  One could, for example, use negative
> > offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
> > or similar, I hadn't given that much thought.
> 
> Well, fe_offset and fe_length are already expressed in bytes, so we could
> just put the byte offset to where the inline data starts in there. fe_length
> is just used as the length allocated for inline-data.
> 
> If fe_offset is required to be block aligned, then we could add a field to
> express an offset within the block where data would be found - say
> 'fe_data_start_offset'. In the non-inline case, we could guarantee that
> fe_data_start_offset is zero. That way software which doesn't want to care
> whether something is inline-data (for example, a backup program) or not
> could just blidly add it to fe_offset before looking at the data.

Oh, I was confused as to what you are asking.  Mapping in-inode data is
just fine using the existing interface.  The byte offset of the data is
given, and the "FIEMAP_EXTENT_NO_DIRECT" flag is set to indicate that it
isn't necessarily safe to do IO directly to that byte offset in the file
(e.g. tail packed, compressed data, etc).

I was thinking you were asking how to map metadata (e.g. indirect blocks).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  16:13 -0600, Andreas Dilger wrote:
> On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
> > I'm a little bit confused by fe_offset. Is it a physical offset, or a
> > logical offset? The reason I ask is that your description above says "FIEMAP
> > ioctl will return the logical to physical mapping for the extent that
> > contains the specified logical byte address." Which seems to imply physical,
> > but your math to get to the next logical start in a very fragmented file,
> > implies that fe_offset is a logical offset:
> > 
> >fm_start = fm_extents[fm_extent_count - 1].fe_offset +
> >  fm_extents[fm_extent_count - 1].fe_length + 1; 
> 
> Note the distinction between "fe_offset" (which is a physical offset for
> a single extent) and "fm_offset" (which is a logical offset for that file).

Actually, that is completely bunk.  What it should say is something like:
"filefrag can easily call the FIEMAP ioctls repeatedly using the returned
fm_start and fm_length as the start offset for the next ioctl:

fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
>   Thanks for posting this. I believe that an interface such as FIEMAP
> would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

I tried to make it as Lustre-agnostic as possible...

> On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
> > The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
> > ioctl block device ioctl used for mapping an individual logical block
> > address in a file to a physical block address in the block device. The
> > FIEMAP ioctl will return the logical to physical mapping for the extent
> > that contains the specified logical byte address.
> > 
> > struct fiemap_extent {
> > __u64 fe_offset;/* offset in bytes for the start of the extent */
> 
> I'm a little bit confused by fe_offset. Is it a physical offset, or a
> logical offset? The reason I ask is that your description above says "FIEMAP
> ioctl will return the logical to physical mapping for the extent that
> contains the specified logical byte address." Which seems to imply physical,
> but your math to get to the next logical start in a very fragmented file,
> implies that fe_offset is a logical offset:
> 
>fm_start = fm_extents[fm_extent_count - 1].fe_offset +
>  fm_extents[fm_extent_count - 1].fe_length + 1; 

Note the distinction between "fe_offset" (which is a physical offset for
a single extent) and "fm_offset" (which is a logical offset for that file).

> > We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
> > will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.
> 
> I think you meant 'fm_length' instead of 'fm_end' there.

You're right, thanks.

> > #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
> > #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond EOF*/
> 
> Is "EOF" here considering "beyond i_size" or "beyond allocation"?

_EOF == beyond i_size.
_LAST == last extent in the file.

In most cases FIEMAP_EXTENT_EOF will be set at the same time as
FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the 
EOF flag may be set on one or more earlier extents.

> > FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
> > encrypted, compressed, etc.)
> 
> Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
> Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
> blocks.

Hmm, but part of the issue would be how to request the extra data, and
what offset it would be given?  One could, for example, use negative
offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
or similar, I hadn't given that much thought.  The other issue is that
I'd like to get the basics of the API in place before it gets too complex.
We can always add functionality with more FIEMAP_FLAG_* (whether in the
INCOMPAT range or not, depending on what is being done).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Andreas Dilger
On Oct 29, 2007  12:16 -0700, Mike Waychison wrote:
> Chris Mason wrote:
>> Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
>> was a way to indicate your-data-is-here-but-isn't-alone.  But that's
>> more of a feature for the FIEMAP stuff.
>
> I hadn't heard of FIEMAP, so I went back and read the thread from 
> April/May.  It seems that this is a much better approach than introducing a 
> FIBMAP64.
>
> What ever happened with this proposal?

We made a patch for ext4 that we need to update and push upstream.  I've
just resent the spec we used in a separate email (attached to old thread)
for reference.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger
By request on #linuxfs, here is the FIEMAP spec that we used to implement
the FIEMAP support for ext4.  There was an ext4 patch posted on August 29
to linux-ext4 entitled "[PATCH] FIEMAP ioctl".   I've asked Kalpak to post
an updated version of that patch along with the changes to the "filefrag"
tool to use FIEMAP.

 FIEMAP_1.0.txt ==

File Mapping Interface

18 June 2007

Andreas Dilger, Kalpak Shah

Introduction

This document covers the user interface and internal implementation of
an efficient fragmentation reporting tool. This will include addition
of a FIEMAP ioctl to fetch extents and changes to filefrag to use this
ioctl. The main objective of this tool is to efficiently and easily allow
inspection of the disk layout of one or more files without requiring
user access to the underlying storage device(s).

1 Requirements

The tool should be efficient in its use of resources, even for large
files. The FIBMAP ioctl is not suitable for use on large files,
as this can result in millions or even billions of ioctls to get the
mapping information for a single file. It should be possible to get the
information about an arbitrary-sized extent in a single call, and the
kernel component and user tool should efficiently use this information.

The user interface should be simple, and the output should be easily
understood - by default the filename(s), a count of extents (for each
file), and the optimal number of extents for a file with the given
striping parameters. The user interface will be "filefrag [options]
{filename ...}" and will allow retrieving the fragmentation information
for one or more files specified on the command-line. The output will be
of the form:

/path/to/file1: extents=2 optimal=1

/path/to/file2: extents=10 optimal=4

..

2 Functional specification

The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
ioctl block device ioctl used for mapping an individual logical block
address in a file to a physical block address in the block device. The
FIEMAP ioctl will return the logical to physical mapping for the extent
that contains the specified logical byte address.

struct fiemap_extent {
__u64 fe_offset;/* offset in bytes for the start of the extent */
__u64 fe_length;/* length in bytes for the extent */
__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
__u32 fe_lun;   /* logical device number for extent(starting at 0)*/
};



struct fiemap {
__u64 fm_start; /* logical byte offset (in/out) */
__u64 fm_length;/* logical length of map (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags (in/out) */
__u32 fm_extent_count;  /* extents in fm_extents (in/out) */
__u64 fm_unused;

struct fiemap_extent fm_extents[0];  
};



In the ioctl request, the fiemap struct is initialized with the desired
mapping information.

fiemap.fm_start = {desired start byte offset, 0 if whole file};
fiemap.fm_length = {length of mapping in bytes, ~0ULL if whole file}
fiemap.fm_extent_count = {number of fiemap_extents in fm_extents array};
fiemap.fm_flags = {flags from FIEMPA_FLAG_* array, if needed};

ioctl(fd, FIEMAP, &fiemap);
{verify fiemap flags are understood }

for (i = 0; i < fiemap.fm_extent_count; i++) {
{ process extent fiemap.fm_extents[i]};
}


The logic for the filefrag would be similar to above. The size of the
extent array will be extrapolated from the filesize and multiple ioctls
of increasing extent count may be called for very large files. filefrag
can easily call the FIEMAP ioctls repeatedly using the end of the last
extent as the start offset for the next ioctl:

fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1;

We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
given then the fm_extents array is not filled, and only fm_extent_count is
returned with the total number of extents in the file. Any new flags that
introduce and/or require an incompatible behaviour in an application or
in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
(e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
range if they were not part of the original specification). This is
currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
is not large enough then it is possible to use the last INCOMPAT flag
0x0100 to incidate that more of the flag range contains incompatible
flags.

#define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */
#define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */
#define FIEMAP_

Re: batching support for transactions

2007-10-03 Thread Andreas Dilger
On Oct 03, 2007  06:42 -0400, Ric Wheeler wrote:
> >>With 2 threads writing to the same directory, we instantly drop down to 
> >>234 files/sec.
> >
> >Is this with HZ=250?
> 
> Yes - I assume that with HZ=1000 the batching would start to work again 
> since the penalty for batching would only be 1ms which would add a 0.3ms 
> overhead while waiting for some other thread to join.

This is probably the easiest solution, but at the same time using HZ=1000
adds overhead to the server because of extra interrupts, etc.

> >It would seem one of the problems is that we shouldn't really be
> >scheduling for a fixed 1 jiffie timeout, but rather only until the
> >other threads have a chance to run and join the existing transaction.
> 
> This is really very similar to the domain of the IO schedulers - when do 
> you hold off an IO and/or try to combine it.

I was thinking the same.

> >my guess would be that yield() doesn't block the first thread long enough
> >for the second one to get into the transaction (e.g. on an 2-CPU system
> >with 2 threads, yield() will likely do nothing).
> 
> Andy tried playing with yield() and it did not do well. Note this this 
> server is a dual CPU box, so your intuition is most likely correct.

How many threads did you try?

> >It makes sense to track not only the time to commit a single synchronous
> >transaction, but also the time between sync transactions to decide if
> >the initial transaction should be held to allow later ones.
> 
> Yes, that is what I was trying to suggest with the rate. Even if we are 
> relatively slow, if the IO's are being synched at a low rate, we are 
> effectively adding a potentially nasty latency for each IO.
> 
> That would give us two measurements to track per IO device - average 
> commit time and this average IO's/sec rate. That seems very doable.

Agreed.

> >Alternately, it might be possible to check if a new thread is trying to
> >start a sync handle when the previous one was also synchronous and had
> >only a single handle in it, then automatically enable the delay in that 
> >case.
> 
> I am not sure that this avoids the problem with the current defaults at 
> 250HZ where each wait is sufficient to do 3 fully independent 
> transactions ;-)

I was trying to think if there was some way to non-busy-wait that is
less than 1 jiffie.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: batching support for transactions

2007-10-03 Thread Andreas Dilger
On Oct 02, 2007  08:57 -0400, Ric Wheeler wrote:
> One thing that jumps out is that the way we currently batch synchronous 
> work loads into transactions does really horrible things to performance 
> for storage devices which have really low latency.
> 
> For example, one a mid-range clariion box, we can use a single thread to 
> write around 750 (10240 byte) files/sec to a single directory in ext3. 
> That gives us an average time around 1.3ms per file.
> 
> With 2 threads writing to the same directory, we instantly drop down to 
> 234 files/sec.

Is this with HZ=250?

> The culprit seems to be the assumptions in journal_stop() which throw in 
> a call to schedule_timeout_uninterruptible(1):
> 
> pid = current->pid;
> if (handle->h_sync && journal->j_last_sync_writer != pid) {
> journal->j_last_sync_writer = pid;
> do {
> old_handle_count = transaction->t_handle_count;
> schedule_timeout_uninterruptible(1);
> } while (old_handle_count != transaction->t_handle_count);
> }

It would seem one of the problems is that we shouldn't really be
scheduling for a fixed 1 jiffie timeout, but rather only until the
other threads have a chance to run and join the existing transaction.

> What seems to be needed here is either a static per file system/storage 
> device tunable to allow us to change this timeout (maybe with "0" 
> defaulting back to the old reiserfs trick of simply doing a yield()?)

Tunables are to be avoided if possible, since they will usually not be
set except by the .1% of people who actually understand them.  Using
yield() seems like the right thing, but Andrew Morton added this code and
my guess would be that yield() doesn't block the first thread long enough
for the second one to get into the transaction (e.g. on an 2-CPU system
with 2 threads, yield() will likely do nothing).

> or a more dynamic, per device way to keep track of the average time it 
> takes to commit a transaction to disk. Based on that rate, we could 
> dynamically adjust our logic to account for lower latency devices.

It makes sense to track not only the time to commit a single synchronous
transaction, but also the time between sync transactions to decide if
the initial transaction should be held to allow later ones.

Alternately, it might be possible to check if a new thread is trying to
start a sync handle when the previous one was also synchronous and had
only a single handle in it, then automatically enable the delay in that case.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upgrading datastructures between different filesystem versions

2007-09-26 Thread Andreas Dilger
On Sep 25, 2007  23:40 -0600, Jim Cromie wrote:
> kernel learner wrote:
> >ext3 filesystem has 32-bit block address and ext4 filesystem has 
> >48-bit block address. If a user installs ext4, how will the file 
> >system handle already existing block with 32 bit values? 
>
> Why should it ? thats what ext3 is for.

Bzzt. Wrong answer.  The ext4 code will be able to read existing ext3
(and ext2) filesystems just fine.  Otherwise there wouldn't be much
of an upgrade path.

> Id expect ext4 drivers handling ext3 filesystems is a distant, secondary 
> goal to getting a fast, reliable, clean 48bit filesystem working.

Far from the truth.  One of the main goals of ext4 is that it is a drop-in
replacement for ext3.  The code is mostly incremental improvements over
ext3, and that IS one of the reasons that it is reliable.  We didn't throw
away 10 years of bug fixes in the ext2/ext3 code when adding the ext4
features.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/4] ext2: fix rec_len overflow for 64KB block size

2007-09-25 Thread Andreas Dilger
On Sep 25, 2007  16:30 -0700, Christoph Lameter wrote:
> [2/4]  ext2: fix rec_len overflow
>  - prevent rec_len from overflow with 64KB blocksize
> 
> Signed-off-by: Takashi Sato <[EMAIL PROTECTED]>
> Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Note that we just got a cleaner implemantation of this code on the ext4
mailing list from Jan Kara yesterday.  Please use that one instead, in
thread "Avoid rec_len overflow with 64KB block size" instead.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/5] VFS: allow filesystems to implement atomic open+truncate

2007-09-21 Thread Andreas Dilger
On Sep 21, 2007  14:23 +0200, Miklos Szeredi wrote:
> Add a new attribute flag ATTR_OPEN, with the meaning: "truncation was
> initiated by open() due to the O_TRUNC flag".
> 
> This way filesystems wanting to implement truncation within their
> ->open() method can ignore such truncate requests.

This is actually something we've needed to do in Lustre for a while also.
We called it ATTR_FROM_OPEN, but I don't really mind ATTR_OPEN either -
the less patching we need to do the better.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/5] VFS: pass open file to ->xattr()

2007-09-21 Thread Andreas Dilger
On Sep 21, 2007  16:59 +0200, Miklos Szeredi wrote:
> What I'm saying is that read and write are _no_more_ related to the
> file than fstat.  Read/write operate on inode data, fstat operates on
> inode metadata.

The read and write operations are DEFINITELY related to the file descriptor
because of f_pos.  Each process opening the same file can have a different
f_pos so read/write will work in different locations of the file.

In contrast getattr and getxattr operate on the single inode and you don't
get e.g. a different i_size or i_uid or i_gid depending on who opened a
file, nor is the xattr different.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/5] VFS: pass open file to ->xattr()

2007-09-21 Thread Andreas Dilger
On Sep 21, 2007  14:23 +0200, Miklos Szeredi wrote:
> @@ -1214,10 +1214,12 @@ struct inode_operations {
> + int (*setxattr) (struct dentry *, const char *,const void *,size_t,int,
> +  struct file *);
> + ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t,
> +  struct file *);
> + ssize_t (*listxattr) (struct dentry *, char *, size_t, struct file *);
> + int (*removexattr) (struct dentry *, const char *, struct file *);

Likewise - these are no longer inode operations if you need a file.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] VFS: pass open file to ->getattr()

2007-09-21 Thread Andreas Dilger
On Sep 21, 2007  14:23 +0200, Miklos Szeredi wrote:
> @@ -1212,7 +1212,8 @@ struct inode_operations {
> - int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
> + int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *,
> + struct file *file);

It's not much of an inode operation anymore if you need to pass a file
to it...  Since the attributes are really part of the inode and not
the file, this seems like a bit of a hack.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD: use GFP_NOFS in kmalloc

2007-09-19 Thread Andreas Dilger
On Sep 19, 2007  12:22 -0700, Mingming Cao wrote:
> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
> 
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> - journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> + journal = kmalloc(sizeof(*journal), GFP_NOFS);
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);

Is there a reason for this change except "it's in a filesystem, so it
should be GFP_NOFS"?  We are only doing journal setup during mount so
there shouldn't be any problem using GFP_KERNEL.  I don't think it will
inject any defect into the code, but I don't think it is needed either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Andreas Dilger
On Sep 19, 2007  12:15 -0700, Mingming Cao wrote:
> @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
>  
>  alloc_transaction:
>   if (!journal->j_running_transaction) {
> - new_transaction = kmalloc(sizeof(*new_transaction),
> - GFP_NOFS|__GFP_NOFAIL);
> + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);

This should probably be a __GFP_NOFAIL if we are trying to start a new
handle in truncate, as there is no way to propagate an error to the caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Andreas Dilger
On Sep 15, 2007  12:20 -0400, Robin Humble wrote:
> On Sat, Sep 15, 2007 at 10:35:16AM -0400, Jeff Garzik wrote:
> >Lustre is tilted far too much towards high-priced storage,
> 
> many (most?) Lustre deployments are with SATA and md raid5 and GigE -
> can't get much cheaper than that.

I have to agree - while Lustre CAN scale up to huge servers and fat pipes,
it can definitely also scale down (which is a LOT easier to do :-).  I can
run a client + MDS + 5 OSTs in a single UML instance using loop devices
for testing w/o problems.

> interestingly, one of the ways to provide dual-attached storage behind
> a failover pair of lustre servers (apart from buying SAS) would be via
> a networked-raid-1 device like Evgeniy's, so I don't see distributed
> block devices and distributed filesystems as being mutually exclusive.

That is definitely true, and there are a number of users who run in
this mode.  We're also working to make Lustre handle the replication
internally (RAID5/6+ at the OST level) so you wouldn't need any kind of
block-level redundancy at all.  I suspect some sites may still use RAID5/6
back-ends anyways to avoid performance loss from taking out a whole OST
due to a single disk failure, but that would definitely not be required.

> >and needs improvement before it could be considered for mainline.

It's definitely true, and we are always working at improving it.  It
used to be in the past that one of the reasons we DIDN'T want to go
into mainline was because this would restrict our ability to make
network protocol changes.  Because our install base is large enough
and many of the large sites with mutliple supercomputers mounting
multiple global filesystems we aren't at liberty to change the network
protocol at will anymore.  That said, we also have network protocol
versioning that is akin to the ext3 COMPAT/INCOMPAT feature flags, so
we are able to add/change features without breaking old clients

> from what I understand (hopefully I am mistaken) they consider a merge
> task to be too daunting as the number of kernel subsystems that any
> scalable distributed filesystem touches is necessarily large.

That's partly true - Lustre has its own RDMA RPC mechanism, but it does
not need kernel patches anymore (we removed the zero-copy callback and
do this at the protocol level because there was too much resistance to it).
We are now also able to run a client filesystem that doesn't require any
kernel patches, since we've given up on trying to get the intents and
raw operations into the VFS, and have worked out other ways to improve
the performance to compensate.  Likewise with parallel directory operations.

It's a bit sad, in a way, because these are features that other filesystems
(especially network fs) could have benefitted from also.

> roadmaps indicate that parts of lustre are likely to move to userspace
> (partly to ease solaris and ZFS ports) so perhaps those performance
> critical parts that remain kernel space will be easier to merge.

This is also true - when that is done the only parts that will remain
in the kernel are the network drivers.  With some network stacks there
is even direct userspace acceleration.  We'll use RDMA and direct IO to
avoid doing any user<->kernel data copies.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Andreas Dilger
On Sep 15, 2007  16:29 +0400, Evgeniy Polyakov wrote:
> Yes, block device itself is not able to scale well, but it is the place
> for redundancy, since filesystem will just fail if underlying device
> does not work correctly and FS actually does not know about where it
> should place redundancy bits - it might happen to be the same broken 
> disk, so I created a low-level device which distribute requests itself.

I actually think there is a place for this - and improvements are
definitely welcome.  Even Lustre needs block-device level redundancy
currently, though we will be working to make Lustre-level redundancy
available in the future (the problem is WAY harder than it seems at
first glance, if you allow writeback caches at the clients and servers).

> When Chris Mason announced btrfs, I found that quite a few new ideas 
> are already implemented there, so I postponed project (although
> direction of the developement of the btrfs seems to move to the zfs side
> with some questionable imho points, so I think I can jump to the wagon
> of new filesystems right now). 

This is an area I'm always a bit sad about in OSS development - the need
everyone has to make a new {fs, editor, gui, etc} themselves instead of
spending more time improving the work we already have.  Imagine where the
internet would be (or not) if there were 50 different network protocols
instead of TCP/IP?  If you don't like some things about btrfs, maybe you
can fix them?

To be honest, developing a new filesystem that is actually widely useful
and used is a very time consuming task (see Reiserfs and Reiser4).  It
takes many years before the code is reliable enough for people to trust it,
so most likely any effort you put into this would be wasted unless you can
come up with something that is dramatically better than something existing.

The part that bothers me is that this same effort could have been used to
improve something that more people would use (btrfs in this case).  Of
course, sometimes the new code is substantially better than what currently
exists, and I think btrfs may have laid claim to the current generation of
filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] System calls for online defrag

2007-09-04 Thread Andreas Dilger
On Sep 03, 2007  20:03 +0200, Jan Kara wrote:
>   I've finally got to writing up some proposal how could look system calls
> allowing for online filesystem defragmentation and generally moving file
> blocks around for improving performance. Comments are welcome.
> 
> int sys_movedata(int datafd, int spacefd, loff_t from, size_t len)
>The call takes blocks used to carry data starting at offset @from of length
> @len in @spacefd and places them instead of corresponding blocks in @datafd.

Calling these "@spacefd" and "@datafd" is a bit confusing.  How about "@srcfd"
and "@tgtfd" instead?  For defragmentation, are you planning to have @datafd
be the "real" inode and "@spacefd" be the temporary inode with defragged data,
or the reverse?  It isn't really clear.

> Data is copied from @datafd to newly spliced data blocks. If @spacefd contains
> a hole in the specified interval, a hole is created also in @datafd in the
> corresponding place. A data block from @spacefd and also replace a hole in
> @datafd - zeros are copied to such data block. @from and @len should be
> multiples of filesystem block size (otherwise EINVAL is returned). Data blocks
> from @datafd in the interval are released, a hole is created in @spacefd.

This is mostly clear except the last sentence.  I would think that the data
blocks in @datafd are kept, getting a copy of the data, while those in
@spacefd are released?

>   Another possibility would be to just replace data blocks without any copying
> of data (that would have to be done by the caller to before calling
> sys_movedata()). The problem here is how to avoid data loss if someone writes
> to the file after userspace has copied the data and before sys_movedata() is
> called.

Isn't that true in any case?

> ssize_t sys_allocate(int fd, int mode, loff_t goal, ssize_t len)
>   Allocate new space to file @fd at offset defined by file position.  Both 
> file
> offset and @len should be a multiple of filesystem block size. The whole
> interval must not contain any allocated blocks. If the interval extends past
> EOF, the file size is changed accordingly.  @mode defines a way the filesystem
> will search for blocks. @mode is a bitwise OR of the following flags:
>   ALLOC_FIXED_START - allocation must start at @goal; if not specified, @goal
> is just a hint where to start an allocation
>   ALLOC_FIXED_LEN - allocate exactly space for @len; if not specified, upto
> @len bytes may be allocated.
>   ALLOC_CONTINGUOUS - allocation must be one continguous run of blocks

How is this much different than sys_fallocate()?

> int sys_get_free_blocks(const char *fs, loff_t start, loff_t end, int count,
>   struct alloc_extent *space)

One alternate possibility is to call the proposed FIEMAP on the block device,
to return lists of free/used extents?  We have a version of that patch for
ext4 and integration into filefrag, so it would be nice to avoid making up
yet another API/tool if that one is sufficient.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 12/26] ext2 white-out support

2007-07-31 Thread Andreas Dilger
On Jul 31, 2007  09:44 +0200, Jan Blunck wrote:
> Ok, this is pretty similar to the way I implemented this for tmpfs. The
> problem is that the union mount code is explicitly checking if the filesystem
> is supporting whiteout. I used to use a new filesystem flag (FS_WHITEOUT) for
> this but thought that disk filesystem like ext2/3/4 will have problem with
> that if you mount an old image. So I guess I still need a feature flag.

You also need whiteout support for extents.  This could be done with
unwritten extents potentially, or as I previously proposed (RFC) in
linux-ext4.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Andreas Dilger
On Jul 28, 2007  20:51 +0100, Christoph Hellwig wrote:
> That doesn't mean I want to arge against Alex's code although I'd of
> course be more happy if we could actually shared code between multiple
> filesystems.
> 
> Of ourse the code in it's current form should not go into mpage.c but
> rather into ext4 so that it doesn't bloat the kernel for everyone.

Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
was rejected because "that functionality should go into the VFS".
Since the performance improvement of delalloc is quite large, we'd
like to get this into the kernel one way or another.  Can we make a
decision if the ext4-specific delalloc is acceptable?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5][TAKE8] manpage for fallocate

2007-07-19 Thread Andreas Dilger
On Jul 18, 2007  20:41 -0700, Mark Fasheh wrote:
> On Sat, Jul 14, 2007 at 12:16:25AM +0530, Amit K. Arora wrote:
> > After a successful call, subsequent writes are guaranteed not to fail
> > because of lack of disk space.
>   
> If a write to an unwritten region requires a node split, that could result
> in the allocation of new meta data which obviously could fail if the disk is
> truly full.
> 
> Granted that's unlikely to happen but maybe we should be conservative and
> say something like:
> 
> "After a successful call, subsequent writes are guaranteed to never require
> allocation of file data." ?
> --Mark

In the worst case, the unwritten extent could be zero-filled before the write
is done, so no exent split is needed.  We discussed this recently for the
ext4 fallocate, but didn't consider it important enough to hold the code.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-16 Thread Andreas Dilger
On Jul 16, 2007  16:52 -0700, Mingming Cao wrote:
> I am not sure why we need GFP_KERNEL flag here. I think we should use
> GFP_NOFS instead. The following patch use the GFP_NOFS flag, as well as
> fixing memory leak issue introduced by the ext4 expand inode extra isize
> patch.
> 
> Fixing memory allocation issue with expand inode extra isize patch.
> 
> - use GFP_NOFS instead of GFP_KERNEL flag for memory allocation
> - use kzalloc instead of kmalloc

This doesn't need kzalloc() for buffer and b_entry_name, since they are
immediately overwritten by memcpy().

> - fix memory leak in the success case, at the end of while loop.
>   goto cleanup;
> @@ -1302,7 +1302,15 @@ retry:
>   error = ext4_xattr_block_set(handle, inode, &i, bs);
>   if (error)
>   goto cleanup;
> + kfree(b_entry_name);
> + kfree(buffer);
> + brelse(is->iloc.bh);
> + kfree(is);
> + kfree(bs);
> + brelse(bh);
>   }
> + up_write(&EXT4_I(inode)->xattr_sem);
> +return 0;
>  
>  cleanup:
>   kfree(b_entry_name);

I don't think you should have brelse(bh) inside the loop, since it is
allocated before the loop starts.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Andreas Dilger
On Jul 13, 2007  02:05 -0700, Andrew Morton wrote:
> On Tue, 10 Jul 2007 16:32:47 -0700 Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
> > > + brelse(bh);
> > > + up_write(&EXT4_I(inode)->xattr_sem);
> > > + return error;
> > > +}
> > > +
> > 
> > We're doing GFP_KERNEL memory allocations while holding xattr_sem.  This
> > can cause the VM to reenter the filesystem, perhaps taking i_mutex and/or
> > i_truncate_sem and/or journal_start() (I forget whether this still
> > happens).  Have we checked whether this can occur and if so, whether we are
> > OK from a lock ranking POV?  Bear in mind that journalled-data mode is more
> > complex in this regard.
> 
> I notice that everyone carefully avoided addressing this ;)
> 
> Oh well, hopefully people are testing with lockdep enabled.  As long
> as the fs is put under extreme memory pressure, most bugs should be reported.

I have no objection to changing these to GFP_NOFS or GFP_ATOMIC, because
the number of times this function is called is really quite small (only
for existing inodes when the size of the fixed fields in the inode is
increasing) and the buffers are freed immediately so this won't put any
undue strain on the atomic memory pools.

That said, there is also a GFP_KERNEL allocations in ext3_xattr_block_set()
under xattr_sem, so the same problem would exist there.

I also just noticed that "buffer" and "b_entry_name" are leaked in
ext4_expand_extra_isize() if the while loop is run more than one time
(again a relatively rare event).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Andreas Dilger
On Jul 13, 2007  15:33 +0200, Peter Zijlstra wrote:
> On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote:
> Or can journal_stop() be done by a different task than the one that did
> journal_start()? - in which case nothing much can be done :-/

The call to journal_stop() has to be in the same process, since the
journal handle is also held in current->journal_info so the handle
does not need to be passed as an argument all over the VFS.

> This seems to boot... albeit I did not push it hard.

Can you please also make a patch for jbd2.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Andreas Dilger
On Jul 12, 2007  13:56 +0530, Amit K. Arora wrote:
> As you suggest, let us just have two modes for the time being:
> 
> #define FALLOC_ALLOCATE   0x1
> #define FALLOC_ALLOCATE_KEEP_SIZE 0x2
> 
> As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it
> will result in file size not being changed even if the preallocation is
> beyond EOF.

What does FALLOC_ALLOCATE mean vs. not passing this flag?  I have no
objection to this as long as the code remains with these as "flags"
instead of "modes"...  Essentially just dropping the FALLOC_FL_DEALLOCATE
and FALLOC_FL_DEL_DATA from the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andreas Dilger
On Jul 11, 2007  16:04 -0400, J. Bruce Fields wrote:
> A 32-bit i_version could in theory wrap pretty quickly, couldn't it?
> That's not a problem in itself--the problem would only arise if two
> subsequent client queries of the change attribute happened a multiple of
> 2^32 i_version increments apart.
> 
> This is more likely than the previous scenario, but still very unlikely.
> I would have guessed that even in situations with a very high rate of
> updates and a low rate of client revalidations, the chance of two
> revalidations happening exactly 2^32 updates apart would still be no
> more than 1 in 2^32.  (Could odd characteristics of the workloads (like
> updates that tend to happen in power-of-2 groups?) make it any more
> likely?)
> 
> I'd be happier if ext4 at least allowed the possibility of 64 bits in
> the future.  And there's always the chance someone would find a use for
> an i_version that was nondecreasing, even if nfs didn't care.

This would indeed be the case for ext3 filesystems updated to ext4.
They will only be able to store the low 32 bits of the version, which
is itself normally enough for NFSv4 because it only uses the inequality
check.  Having the full 64 bits available eliminates the risk of
collisions, and given that the spec mandates a 64-bit version I'm sure
someone will take full advantage of it in NFS at some point.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andreas Dilger
On Jul 11, 2007  17:16 +0530, Girish Shilamkar wrote:
> > > + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
> > > + jbd2_journal_set_features(sbi->s_journal,
> > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
> > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
> > > + } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
> > > + jbd2_journal_set_features(sbi->s_journal,
> > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
> > > + jbd2_journal_clear_features(sbi->s_journal, 0, 0,
> > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
> > > + } else {
> > > + jbd2_journal_clear_features(sbi->s_journal,
> > > + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
> > > + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
> > > + }
> > 
> > Some discussion of the forward- and backward- compatibility design would be
> > appropriate.  Also a description of whether and how fsck can be used to fix
> > up these feature flags.

It is forward & backward compatible to enable COMPAT_CHECKSUM.  That just
means the commit blocks will have checksums in them, but older kernels
will just ignore them.  Hmm, I suppose there might be an issue with "upgrade,
downgrade, upgrade" in that the commit blocks would not have checksums
even though the superblock says they will...

Does that mean we should accept a checksum == 0 as being valid (which
is not very nice, given that 0 is an oft-hit bad value), or that we need
a flag in every commit block which indicates if it actually has a checksum?

The INCOMPAT_ASYNC_COMMIT can't be handled safely by older kernels, since
they would assume "commit block == complete transaction", which isn't
true if the commit block didn't wait for the rest of the blocks to make
it to the disk.

I don't think e2fsck can be used to individually clean up the feature flags,
but it is always possible to remove and recreate the journal...

> > > - /* AKPM: buglet - add `i' to tmp! */
> > 
> > Damn.  After, what, seven years, someone actually fixed it?
> > 
> > >   for (i = 0; i < bh->b_size; i += 512) {
> > > - journal_header_t *tmp = (journal_header_t*)bh->b_data;
> > > + struct commit_header *tmp =
> > > + (struct commit_header *)(bh->b_data + i);
> > >   tmp->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
> > >   tmp->h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
> > >   tmp->h_sequence = cpu_to_be32(commit_transaction->t_tid);
> > > +
> > > + if (JBD2_HAS_COMPAT_FEATURE(journal,
> > > + JBD2_FEATURE_COMPAT_CHECKSUM)) {
> > > + tmp->h_chksum_type  = JBD2_CRC32_CHKSUM;
> > > + tmp->h_chksum_size  = JBD2_CRC32_CHKSUM_SIZE;
> > > + tmp->h_chksum[0]= cpu_to_be32(crc32_sum);
> > > + }
> > >   }
> > 
> > And in doing so, changed the on-disk format of the journal commit blocks.
> > 
> > Surely this was worth a mention in the changelog, if not a standalone patch?
> > 
> > I don't think this is worth doing, really.  Why not just leave the format
> > as it was, take the loop out and run this code once rather than eight
> > times?

Well, we aren't using the rest of the commit block in any case.  I think
the original intention was that we'd get 8 copies of the commit block so
we would be sure to get a good one.

I don't know whether we'd rather have 8 copies of the commit block, or
more potential to expand the commit block?  I don't personally have any
preference, since the checksum should be a more robust way of checking
validity than having multiple copies, so we may as well remove the loop
and stick with a single copy for now.

> > > @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journa
> > >   unsigned intsequence;
> > >   int blocktype;
> > >   int tag_bytes = journal_tag_bytes(journal);
> > > + __u32   crc32_sum = ~0; /* Transactional Checksums */
> > 
> > We normally use __u32 for visible-to-userspace stuff.  Kernel code would
> > use plain old u32.
> Ok.

Since the checksum is saved to disk, it seems more appropriate to use __u32
or maybe even __be32, though I'm not sure if the crc32 functions do that
correctly or not.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  22:40 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao <[EMAIL PROTECTED]> wrote:
> > A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
> > the subdir count for any directory crosses 65000.
> 
> Would I be correct in assuming that a later fsck will clear
> EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any >65000 subdir
> directories?

Correct.

> If so, that is worth a mention in the changelog, perhaps?
> 
> Please remind us what is the behaviour of an RO_COMPAT flag?  It means that
> old ext4, ext3 and ext2 can only mount this fs read-only, yes?

Also correct.  The COMPAT flag behaviour is described in detail in
Documentation/filesystems/ext[234].txt

> > +static inline void ext4_inc_count(handle_t *handle, struct inode *inode)
> > +{
> > +   inc_nlink(inode);
> > +   if (is_dx(inode) && inode->i_nlink > 1) {
> > +   /* limit is 16-bit i_links_count */
> > +   if (inode->i_nlink >= EXT4_LINK_MAX || inode->i_nlink == 2) {
> > +   inode->i_nlink = 1;
> > +   EXT4_SET_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
> > +   }
> > +   }
> > +}
> 
> Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2?

Because that means it was previously 1 (inc_nlink() was already called).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Andreas Dilger
> > +   "inode %lu: bad block %llu", inode->i_ino,
> > +   EXT4_I(inode)->i_file_acl);
> > +   error = -EIO;
> > +   goto cleanup;
> > +   }
> > +   base = BHDR(bh);
> > +   first = BFIRST(bh);
> > +   end = bh->b_data + bh->b_size;
> > +   min_offs = end - base;
> > +   free = ext4_xattr_free_space(first, &min_offs, base,
> > +&total_blk);
> > +   if (free < new_extra_isize) {
> > +   if (!tried_min_extra_isize && s_min_extra_isize) {
> > +   tried_min_extra_isize++;
> > +   new_extra_isize = s_min_extra_isize;
> 
> Aren't we missing a brelse(bh) here?

Seems likely, yes.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  16:31 -0700, Andrew Morton wrote:
> On Sun, 01 Jul 2007 03:37:53 -0400
> Mingming Cao <[EMAIL PROTECTED]> wrote:
> > Add a "noversion" mount option to disable inode version updates.
> 
> Why is this option being offered to our users?  To reduce disk traffic,
> like noatime?
> 
> If so, what are the implications of this?  What would the user lose?

Ah, this is the patch to disable i_version updates for Lustre.  I don't
think any normal user would use this mount option, so I don't know if
there is a need to document it.  There are no performance implications,
unless we end up changing the mtime granularity JUST to update i_version,
in which case we can avoid some overhead if not exporting with NFSv4.

If we want to go in the direction of forcing extra inode updates just
for this, then we might even consider making i_version updates on disk
default to OFF unless NFSv4 has exported the filesystem at least once,
and then it should set a persistent flag in the superblock indicating
that i_version updates are needed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Andreas Dilger
On Jul 11, 2007  09:47 +0100, Christoph Hellwig wrote:
> On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
> > This patch is on top of i_version_update_vfs.
> > The i_version field of the inode is set on inode creation and incremented
> > when the inode is being modified.
> 
> Which is not what i_version is supposed to do.  It'll get you tons of misses
> for NFSv3 filehandles that rely on the generation staying the same for the
> same file.  Please add a new field for the NFSv4 sequence counter instead
> of making i_version unuseable.

You are confusing i_generation (the instance of this inode number) with
i_version (whether this file has been modified)?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 3/5] i_version:ext4 inode version read/store

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  16:31 -0700, Andrew Morton wrote:
> > This patch adds 64-bit inode version support to ext4. The lower 32 bits
> > are stored in the osd1.linux1.l_i_version field while the high 32 bits
> > are stored in the i_version_hi field newly created in the ext4_inode.
> 
> So reading the code here does serve to answer the question I raised against
> the earlier patch.  A bit.
> 
> I'd have thought that this patch and the one which adds i_version_hi should
> be folded into a single diff?

It could be - the original patch to reserve i_version_hi was submitted to
before the patches were ready to avoid that space being used by something
else.

> > +   if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
> > +   if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
> > +   inode->i_version |=
> > +   (__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
> 
> 
> 
> 
> 
> > +   }
> 
> I don't quite see how the above two tests are sufficient to unambiguously
> determine that the i_version_hi field is present on-disk.
> 
> I guess we're implicitly assuming that if the on-disk inode is big enough
> then it _must_ have i_version_hi in there?  If so, why is the comparison
> with EXT4_GOOD_OLD_INODE_SIZE needed?

The GOOT_OLD_INODE_SIZE check is needed to know if i_extra_isize is even
present or valid in the on-disk inode.

> > @@ -2852,8 +2859,14 @@ static int ext4_do_update_inode(handle_t
> > +   raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
> > +   if (ei->i_extra_isize) {
> > +   if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) {
> 
> There's no comparison with EXT4_GOOD_OLD_INODE_SIZE here...

Because this is the in-memory version and it is always valid (set to zero
if there is extra space in the on-disk inode).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  16:30 -0700, Andrew Morton wrote:
> > Signed-off-by: Mingming Cao <[EMAIL PROTECTED]>
> > Signed-off-by: Andreas Dilger <[EMAIL PROTECTED]>
> > Signed-off-by: Kalpak Shah <[EMAIL PROTECTED]>
> > ---
> > Index: linux-2.6.21/include/linux/ext4_fs.h
> > ===
> > --- linux-2.6.21.orig/include/linux/ext4_fs.h
> > +++ linux-2.6.21/include/linux/ext4_fs.h
> > @@ -342,6 +342,7 @@ struct ext4_inode {
> > __le32  i_atime_extra;  /* extra Access time  (nsec << 2 | epoch) */
> > __le32  i_crtime;   /* File Creation time */
> > __le32  i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
> > +   __le32  i_version_hi;   /* high 32 bits for 64-bit version */
> >  };
> 
> Aren't there forward- backward-compatibility issues here?  How does the
> filesystem driver work out whether this field is present and valid?

This uses the same EXT4_FITS_IN_INODE() check as any other, so the
compatibility issues are handled.  NFSv4 could live with 32-bit versions
with only a small danger of overflow, so we can still export ext3
filesystems with 128-byte inodes that have been updated to ext4.  For
Lustre (which requires 64-bit versions), we will enforce that space is
available with s_min_extra_isize and RO_COMPAT_EXTRA_ISIZE.

In the case where an older ext3/ext4 filesystem with large inodes does
not have enough space for i_version_hi the EAs that follow i_extra_isize
will be shifted to make room for it (if possible, which is likely).  There
are no critical fields inside i_extra_isize so in the rare case of a
failure to enlarge the i_extra_isize is not a cause for alarm.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  23:34 -0400, Trond Myklebust wrote:
> On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
> > So my vote is to increment i_version in common code every time any
> > change is made to the file, and alloc_inode should initialise it to
> > current time, which might be changed by the filesystem before it calls
> > unlock_new_inode. 
> > ... but doesn't lustre want to control its i_version... so maybe not :-(
> 
> If lustre wants to be exportable via pNFS (as Peter Braam has suggested
> it should), then it had better be able to return a change attribute that
> is compatible with the NFSv4.1 spec...

The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre
version can be used to compare the updates of two inodes.  It is set
to be the Lustre transaction number (which is sequentially incremented
on a per filesystem basis), so that we can use the per-inode version
to do replay of client operations even if they have been disconnected for
a long time, which is why we want to be able to control the actual value.
We don't want the version to be updated for e.g. file defragmentation
or other similar internal-only changes which need ext4_mark_inode_dirty().

We had a patch to disable ext4 inode versioning by a flag the superblock,
but we dropped it at the last minute because it needed some updates and
we didn't want to wait on that for submitting these changes upstream.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-11 Thread Andreas Dilger
gt; comment over EXT4_FITS_IN_INODE() would be a suitable place.

Hmm, I'm sure there were emails on the topic, but they aren't attached to
the patch.  s_want_extra_isize is just an override for sizeof(ext4_inode)
in case the sysadmin wants to reserve more fields in new inodes.  There is
also s_min_extra_isize which is what the kernel and e2fsck guarantee that
will be available in all in-use inodes, if RO_COMPAT_EXTRA_ISIZE is set
(ro-compat so that older kernels can't create inodes with a smaller
extra_isize).  That feature is only enabled if requested by the sysadmin.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-11 Thread Andreas Dilger
On Jul 10, 2007  22:00 -0400, Mingming Cao wrote:
> On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
> > On Sun, 01 Jul 2007 03:36:56 -0400
> > Mingming Cao <[EMAIL PROTECTED]> wrote:
> > 
> > > This patch is a spinoff of the old nanosecond patches.
> > 
> > I don't know what the "old nanosecond patches" are.  A link to a suitable
> > changlog for those patches would do in a pinch.  Preferable would be to
> > write a proper changelog for this patch.
> > 
> I found the original patch
> http://marc.info/?l=linux-ext4&m=115091699809181&w=2
> 
> Andreas or Kalpak, is changelog from the original patch is accurate to
> apply here?

Mostly, yes, but the name of the feature flag has changed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-06 Thread Andreas Dilger
On Jul 06, 2007  09:51 -0400, J. Bruce Fields wrote:
> The use of a mount option means the change attribute could be
> inconsistent across mounts.  If we really need this, wouldn't it make
> more sense for it to be a persistent feature of the filesystem, set at
> mkfs time?

Yes, having it stored into the superblock in s_flags is probably a good
idea.  Kalpak, do you think you could get a patch that adds e.g.
EXT4_FLAGS_NO_INODE_VERSION (like EXT4_FLAGS_SIGNED_HASH in e2fsprogs).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Andreas Dilger
On Jul 05, 2007  12:41 -0400, Mike Frysinger wrote:
> On Wednesday 04 July 2007, Christoph Hellwig wrote:
> > Sorry, but it's really annoying to pull in a filesystem-specific devel
> > package for that.  Having a library is fine, but please move the library
> > into util-linux so it's always available without another dependency.
> 
> ugh, moving libraries which are already actively maintained by other core 
> projects into util-linux is so not a good idea (ignoring the fact that it'd 
> easily be a pita/waste for distro maintainers)

Some distros (Debian and SuSE I think) split the e2fsprogs libraries
into separate packages so that you are not depending on "e2fsprogs",
but rather "libuuid" and/or "libblkid".

> > That way xfsprogs could for example drop it's own detection library aswell.
> 
> i dont really think this is dependent on util-linux at all.  nothing is 
> stopping xfsprogs from depending on udev or e2fsprogs now.

In fact, Eric Sandeen and I discussed splitting the xfsprogs "libdisk"
(or similar, it detects RAID geometry for DM/MD/etc) into a standalone
library so that e2fsprogs could use it.  The only issue is the increased
maintenance and packaging of separate libraries.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-04 Thread Andreas Dilger
On Jul 04, 2007  12:06 +0530, Aneesh Kumar K.V wrote:
> Mingming Cao wrote:
> >On Tue, 2007-07-03 at 15:58 +0530, Kalpak Shah wrote:
> >>On Sun, 2007-07-01 at 03:36 -0400, Mingming Cao wrote:
> >>>+
> >>>+#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)  \
> >>>+do {   \
> >>>+  (inode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime);
> >>>\
> >>>+  if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra))  
> >>>\
> >>>+  ext4_decode_extra_time(&(inode)->xtime, 
> >>>\
> >>>+ raw_inode->xtime ## _extra); 
> >>>\
> >>>+} while (0)
> >>>+
> >>>+#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode)\
> >>>+do {   \
> >>>+  if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))   
> >>>\
> >>>+  (einode)->xtime.tv_sec = le32_to_cpu((raw_inode)->xtime);   
> >>>\
> >>>+  if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra)) 
> >>>\
> >>>+  ext4_decode_extra_time(&(einode)->xtime,
> >>>\
> >>>+ raw_inode->xtime ## _extra); 
> >>>\
> >>>+} while (0)
> >>>+
> >>This nanosecond patch seems to be missing the fix below which is 
> >>required for http://bugzilla.kernel.org/show_bug.cgi?id=5079
> >>
> >>If the timestamp is set to before epoch i.e. a negative timestamp then 
> >>the file may have its date set into the future on 64-bit systems. So 
> >>when the timestamp is read it must be cast as signed.
> >
> >Missed this one.
> >Thanks. Will update ext4 patch queue tonight with this fix.
> 
> IIRC in the conference call it was decided to not to apply this patch. 
> Andreas may be able to update better.

I wasn't on the most recent concall, and I've forgotten the details of
any discussion on a previous concall.

Care really needs to be taken here that negative timestamps are handled
properly.  We can take the sign bit from the inode i_*time, but then we
need to change the load/save of the extra time to use a shift of 31
instead of 32.  If we overflow the epoch we have to ensure that the high
bits of the seconds is handled correctly.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-03 Thread Andreas Dilger
On Jul 03, 2007  18:15 -0400, J. Bruce Fields wrote:
> How will nfsd tell whether it can really on a given filesystem's
> i_version, or whether it should fall back on ctime?

Good question.

> > As to performance concerns that raise before the inode version counter
> > (at least for ext4) is done inside ext4_mark_inode_dirty), so there is
> > no extra IO work to store this counter to disk.
> 
> So what's the motivation for the "noversion" mount option?

Lustre needs to be able to control the version number directly (version
number needs to be ordered between all inodes, is set by Lustre to be a
transaction number).  Instead of trying to incorporate this unused code
into ext4 we just turn off the ext4 version code and let Lustre control
this directly.  It may even be that NFSv4 will need to control the version
numbers itself...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-03 Thread Andreas Dilger
On Jul 03, 2007  10:24 -0400, Trond Myklebust wrote:
> It looks OK to me, but you might want to strip out the now redundant
> i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename().

Agreed, and I thought we discussed that already on the ext4 list.

> I also have some questions about how this will affect the readdir code:
> unless I missed something, the filp->f_version is still unsigned long,
> so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir()
> no longer make sense.

I don't see them as any worse than existing checks.  For 32-bit systems
we only ever had a 32-bit in-memory version anyway so using only the
low 32 bits of i_version in f_version is no more racy than in the past.
For 64-bit systems using the full on-disk i_version is possible.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-03 Thread Andreas Dilger
On Jul 03, 2007  12:19 +0530, Aneesh Kumar K.V wrote:
> Mingming Cao wrote:
> >Index: linux-2.6.22-rc4/fs/ext4/super.c
> >===
> >--- linux-2.6.22-rc4.orig/fs/ext4/super.c2007-06-13 
> >17:19:11.0 -0700
> >+++ linux-2.6.22-rc4/fs/ext4/super.c 2007-06-13 17:24:45.0 -0700
> >@@ -2846,8 +2846,8 @@ out:
> > i_size_write(inode, off+len-towrite);
> > EXT4_I(inode)->i_disksize = inode->i_size;
> > }
> >-inode->i_version++;
> > inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> >+inode->i_version = 1;
> > ext4_mark_inode_dirty(handle, inode);
> > mutex_unlock(&inode->i_mutex);
> > return len - towrite;
> 
> 
> Is this correct ? . Why do we set the qutoa file inodes version to 1  
> during write ?

Hmm, I thought we had previously fixed this?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-01 Thread Andreas Dilger
On Jun 30, 2007  11:21 +0100, Christoph Hellwig wrote:
> On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
> > Currently it is left on the file system implementation. In ext4, we do
> > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> > end up with partial (pre)allocation. This is inline with dd and
> > posix_fallocate, which also do not free the partially allocated space.
> 
> I can't find anything in the specification of posix_fallocate
> (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
> that tells what should happen to allocate blocks on error.
> 
> But common sense would be to not leak disk space on failure of this
> syscall, and this definitively should not be left up to the filesystem,
> either we always leak it or always free it, and I'd strongly favour
> the latter variant.

I definitely agree that the behaviour should be specified part of
the interface.  The current behaviour of both ext4 and XFS is that the
successful part of the unallocated extent is left in place when returning
ENOSPC so we considered this the "consistent" behaviour.  This is the same
as e.g. sys_write() which does not remove the part of the write that was
successful if ENOSPC is hit.  I think this also makes sense for some usa
cases, because application like PVR may want to preallocate approximately
30min of space, but if it gets only 25min worth then it can at least start
using this while it also begins looking for and/or freeing old files.

If the space is always freed on ENOSPC, then there may be a significant
amount of work done and undone while the application is iterating over
possible sizes until one works.   It is easy for the application to
use fstat() to see the blocks/size actually preallocated on failure, and
explicitly request unallocation of this space if the outcome is undesirable.

If you think that applications have a strong preference for both kinds
of behaviour (e.g. database which requires the full allocation to succeed,
unlike PVR application above) then this could be encoded into a @mode flag.

> > > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > > don't want to expose uninitialized disk blocks to userspace.  I'm not
> > > sure if this makes sense at all.
> 
> This is the xfs unwritten extent behaviour.  But anyway, the important bit
> is uninitialized blocks should never ever leak to userspace, so there is
> not need for the flag.

I agree that we shouldn't need FA_ZERO_SPACE.  If an application wants
explicit zeros written to disk it can just do this with O_DIRECT writes
or similar.

> The more I think about it the more I'd prefer we would just put a simple
> syscall in that implements nothing but the posix_fallocate(3) semantics
> as defined in SuS, and then go on to brainstorm about advanced
> preallocation / layout hint semantics.

I don't think the current @mode flags introduce any significant complexity
in the implementation, and in fact one of the reasons these came up in the
first place was because David pointed out the XFS behaviour did NOT match
with posix_fallocate() and we started getting strange semantics enforced
by monolithic modes.  IMHO, coding for and understanding the semantics of
the monolithic modes is much more complex and less useful than the explicit
flags.

The @mode flags that are currently under consideration are (AFAIK):

FA_FL_DEALLOC   0x01 /* deallocate unwritten extent (default allocate) */
FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */
FA_FL_DEL_DATA  0x04 /* delete existing data in alloc range (default keep) */

Your concern about leaking space would imply:

FA_FL_ERR_FREE  0x08 /* free preallocation on error (default keep prealloc) */

The other possible flags that were proposed, to avoid confusing backup and
HSM applications when preallocated space is added or removed from a file
(you don't want a backup app to re-backup a file that was migrated via HSM):

FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data change) */
FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data change) */

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Andreas Dilger
On Jun 28, 2007  23:27 +0530, Amit K. Arora wrote:
> On Thu, Jun 28, 2007 at 02:55:43AM -0700, Andrew Morton wrote:
> > Are we all supposed to re-review the entire patchset (or at least #4 and
> > #7) again?
> 
> As I mentioned in the note above, only patches #4 and #7 were new and
> thus these needed to be reviewed. Other patches are _not_ replacements
> of any of the patches which are already part of -mm and/or in Ted's
> patch queue. They were posted again as just "placeholders" so that the
> two new patches (#4 & #7) could be reviewed. Sorry for any confusion.

The new patches are definitely a big improvement over the previous API,
and need to go in before fallocate() goes into mainline.  This last set
of changes allows the behaviour of these syscalls to accomodate the various
different semantics desired by XFS in a sensible manner instead of tying
all of the individual behaviours (time update, size update, alloc/free, etc)
into monolithic modes that will never make everyone happy.

My understanding is that you only need to grab #4 and #7 to get your tree
into get fallocate in sync with the ext4 patch queue (i.e. they are
incremental over the previous set).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 00/44] AppArmor security module overview

2007-06-27 Thread Andreas Dilger
Any chance you can remove linux-fsdevel from the CC list?  I don't think this
has anything to do with filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Andreas Dilger
On Jun 27, 2007  09:14 +1000, David Chinner wrote:
> Someone on the XFs list had an interesting request - preallocated
> swap files. You can't use unwritten extents for this because
> of sys_swapon()s use of bmap() (XFS returns holes for reading
> unwritten extents), so we need a method of preallocating that does
> not zero or mark the extent unread. i.e. FA_MKSWAP.

Is there a reason why unwritten extents return 0 to bmap()?  This
would seem to be the only impediment from using fallocated files
for swap files.  Maybe if FIEMAP was used by mkswap to get an
"UNWRITTEN" flag back instead of "HOLE" it wouldn't be a problem.

> That way we can allocate large swap files that don't need zeroing
> in a single, fast operation, and hence potentially bring new
> swap space online without needed very much memory at all (i.e.
> should succeed in most near-OOM conditions).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Patent or not patent a new idea

2007-06-26 Thread Andreas Dilger
On Jun 26, 2007  09:53 -0700, Bryan Henderson wrote:
> >md/raid already works happily with different sized drives from
> >different manufacturers ...
> 
> >So I still cannot see anything particularly new.
> 
> As compared to md of conventional disk partitions, it brings the ability 
> to create and delete arrays without shutting down all use of the physical 
> disks (to update the partition tables).  (LVM gives you that too).  It 
> also makes managing space much easier because the component devices don't 
> have to be carved from contiguous space on the physical disks.
> 
> Neither of those benefits is specific to RAID, but you could probably say 
> that RAID multiplies the problems they address.

That's one of the reasons why I liked AIX doing RAID1 on top of LVM.  It
was possible to safely migrate LEs across disks by virtue of creating a
RAID 1 mirror for that LE and then deleting the old copy.  It wouldn't
be too hard to extend the same notion to RAID 5 or RAID 6 on LVM.  You
could even change the RAID level and number of constituent LEs in a RAID
set on an ad-hoc basis because the LVM mappings would allow this to be
efficient.

Unfortunately, Linux still has distinct DM and MD layers and there doesn't
seem to be any work to combine the two into a more powerful single layer.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7][TAKE5] ext4: support new modes

2007-06-26 Thread Andreas Dilger
On Jun 26, 2007  17:37 +0530, Amit K. Arora wrote:
> Hmm.. I am thinking of a scenario when the file system supports some
> individual flags, but does not support a particular combination of them.
> Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
> file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
> FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
> for some reason). This means that although we support FA_FL_DEALLOC,
> FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
> combination of all these flags (which is nothing but FA_UNRESV_SPACE).

That is up to the filesystem to determine then.  I just thought it should
be clear to return an error for flags (or as you say combinations thereof)
that the filesystem doesn't understand.

That said, I'd think in most cases the flags are orthogonal, so if you
support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC)
then you will also support other combinations of those flags just from
the way it is coded.

> > I also thought another proposed flag was to determine whether mtime (and
> > maybe ctime) is changed when doing prealloc/dealloc space?  Default should
> > probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
> > should decide if we want to allow changing the file w/o changing ctime, if
> > that is required even though the file is not visibly changing.  Maybe the
> > ctime update should be implicit if the size or mtime are changing?
> 
> Is it really required ? I mean, why should we allow users not to update
> ctime/mtime even if the file metadata/data gets updated ? It sounds
> a bit "unnatural" to me.
> Is there any application scenario in your mind, when you suggest of
> giving this flexibility to userspace ?

One reason is that XFS does NOT update the mtime/ctime when doing the
XFS_IOC_* allocation ioctls.

> I think, modifying ctime/mtime should be dependent on the other flags.
> E.g., if we do not zero out data blocks on allocation/deallocation,
> update only ctime. Otherwise, update ctime and mtime both.

I'm only being the advocate for requirements David Chinner has put
forward due to existing behaviour in XFS.  This is one of the reasons
why I think the "flags" mechanism we now have - we can encode the
various different behaviours in any way we want and leave it to the
caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Andreas Dilger
On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
> On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
> > In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
> > For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
> > each extent.  For some workloads this would be much faster than truncate
> > and reallocate of all the blocks in a file.
> 
> In ext4, we already mark each extent having preallocated blocks as
> uninitialized. This is done as part of following code (which is part of
> patch 5/7) in ext4_ext_get_blocks() :  

What I meant is that with XFS_IOC_ALLOCSP the previously-written data
is ZEROED OUT, unlike with fallocate() which leaves previously-written
data alone and only allocates in holes.

So, if you had a sparse file with some data in it:

 A BB

fallocate() would allocate the holes:

0A0BB

XFS_IOC_ALLOCSP would overwrite everything:

0

In order to specify this for allocation, FA_FL_DEL_DATA would need to make
sense for allocations (as well as the deallocation).  This is farily easy
to do - just mark all of the existing extents as unallocated, and their
data disappears.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Andreas Dilger
On Jun 26, 2007  16:02 +0530, Amit K. Arora wrote:
> On Mon, Jun 25, 2007 at 03:46:26PM -0600, Andreas Dilger wrote:
> > Can you clarify - what is the current behaviour when ENOSPC (or some other
> > error) is hit?  Does it keep the current fallocate() or does it free it?
> 
> Currently it is left on the file system implementation. In ext4, we do
> not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
> end up with partial (pre)allocation. This is inline with dd and
> posix_fallocate, which also do not free the partially allocated space.

Since I believe the XFS allocation ioctls do it the opposite way (free
preallocated space on error) this should be encoded into the flags.
Having it "filesystem dependent" just means that nobody will be happy.

> > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
> > don't want to expose uninitialized disk blocks to userspace.  I'm not
> > sure if this makes sense at all.
> 
> I don't think we need to make it default - atleast for filesystems which
> have a mechanism to distinguish preallocated blocks from "regular" ones.

What I mean is that any data read from the file should have the "appearance"
of being zeroed (whether zeroes are actually written to disk or not).  What
I _think_ David is proposing is to allow fallocate() to return without
marking the blocks even "uninitialized" and subsequent reads would return
the old data from the disk.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vm/fs meetup in september?

2007-06-25 Thread Andreas Dilger
On Jun 26, 2007  12:35 +1000, Nick Piggin wrote:
> Leaving my opinion of higher order pagecache aside, this _may_ be an
> example of something that doesn't need a lot of attention, because it
> should be fairly uncontroversial from a filesystem's POV? (eg. it is
> more a relevant item to memory management and possibly block layer).
> OTOH if it is discussed in the context of "large blocks in the buffer
> layer is crap because we can do it with higher order pagecache", then
> that might be interesting :)

FWIW, being able to have large (8-64kB) blocksize would be great for
ext2/3/4.  We'd sort of been betting on this by limiting the on-disk
extent format to 48-bit physical block numbers, and to have 2 patches
to implement this in as many weeks is excellent.

To me the mechanism doesn't matter, whether through fsblock or high-order
PAGE_SIZE.  I'll let the rest of you duke it out as long as at least one
of them makes it into the kernel.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7][TAKE5] ext4: support new modes

2007-06-25 Thread Andreas Dilger
On Jun 25, 2007  19:20 +0530, Amit K. Arora wrote:
> @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
>* currently supporting (pre)allocate mode for extent-based
>* files _only_
>*/
> - if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
> + !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
>   return -EOPNOTSUPP;

This should probably just check for the individual flags it can support
(e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA).

I also thought another proposed flag was to determine whether mtime (and
maybe ctime) is changed when doing prealloc/dealloc space?  Default should
probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
should decide if we want to allow changing the file w/o changing ctime, if
that is required even though the file is not visibly changing.  Maybe the
ctime update should be implicit if the size or mtime are changing?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-25 Thread Andreas Dilger
On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
> +#define FA_FL_DEALLOC0x01 /* default is allocate */
> +#define FA_FL_KEEP_SIZE  0x02 /* default is extend/shrink size */
> +#define FA_FL_DEL_DATA   0x04 /* default is keep written data on DEALLOC 
> */

In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
each extent.  For some workloads this would be much faster than truncate
and reallocate of all the blocks in a file.

In that light, please change the comment to /* default is keep existing data */
so that it doesn't imply this is only for DEALLOC.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-25 Thread Andreas Dilger
On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
> I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
> *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
> If it is decided that these flags are also needed, I will update this
> patch. Thanks!

Can you clarify - what is the current behaviour when ENOSPC (or some other
error) is hit?  Does it keep the current fallocate() or does it free it?

For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
don't want to expose uninitialized disk blocks to userspace.  I'm not
sure if this makes sense at all.

> On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote:
> > Implement new flags and values for mode argument.
> > 
> > This patch implements the new flags and values for the "mode" argument
> > of the fallocate system call. It is based on the discussion between
> > Andreas Dilger and David Chinner on the man page proposed (by the later)
> > on fallocate.
> > 
> > Signed-off-by: Amit Arora <[EMAIL PROTECTED]>
> > 
> > Index: linux-2.6.22-rc4/include/linux/fs.h
> > ===
> > --- linux-2.6.22-rc4.orig/include/linux/fs.h
> > +++ linux-2.6.22-rc4/include/linux/fs.h
> > @@ -267,15 +267,16 @@ extern int dir_notify_enable;
> >  #define SYNC_FILE_RANGE_WAIT_AFTER 4
> > 
> >  /*
> > - * sys_fallocate modes
> > - * Currently sys_fallocate supports two modes:
> > - * FA_ALLOCATE  : This is the preallocate mode, using which an 
> > application/user
> > - *   may request (pre)allocation of blocks.
> > - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
> > - *   the preallocated blocks.
> > + * sys_fallocate mode flags and values
> >   */
> > -#define FA_ALLOCATE0x1
> > -#define FA_DEALLOCATE  0x2
> > +#define FA_FL_DEALLOC  0x01 /* default is allocate */
> > +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */
> > +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC 
> > */
> > +
> > +#define FA_ALLOCATE0
> > +#define FA_DEALLOCATE  FA_FL_DEALLOC
> > +#define FA_RESV_SPACE  FA_FL_KEEP_SIZE
> > +#define FA_UNRESV_SPACE(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
> > FA_FL_DEL_DATA)
> > 
> >  #ifdef __KERNEL__
> > 
> > Index: linux-2.6.22-rc4/fs/open.c
> > ===
> > --- linux-2.6.22-rc4.orig/fs/open.c
> > +++ linux-2.6.22-rc4/fs/open.c
> > @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
> >   * sys_fallocate - preallocate blocks or free preallocated blocks
> >   * @fd: the file descriptor
> >   * @mode: mode specifies if fallocate should preallocate blocks OR free
> > - *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
> > - *   FA_DEALLOCATE modes are supported.
> > + *   (unallocate) preallocated blocks.
> >   * @offset: The offset within file, from where (un)allocation is being
> >   * requested. It should not have a negative value.
> >   * @len: The amount (in bytes) of space to be (un)allocated, from the 
> > offset.
> >   *
> >   * This system call, depending on the mode, preallocates or unallocates 
> > blocks
> >   * for a file. The range of blocks depends on the value of offset and len
> > - * arguments provided by the user/application. For FA_ALLOCATE mode, if 
> > this
> > + * arguments provided by the user/application. For FA_ALLOCATE and
> > + * FA_RESV_SPACE modes, if the sys_fallocate()
> >   * system call succeeds, subsequent writes to the file in the given range
> >   * (specified by offset & len) should not fail - even if the file system
> >   * later becomes full. Hence the preallocation done is persistent (valid
> > - * even after reopen of the file and remount/reboot).
> > + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
> > + * is passed, the file size will not be changed even if the preallocation
> > + * is beyond EOF.
> >   *
> >   * It is expected that the ->fallocate() inode operation implemented by the
> >   * individual file systems will update the file size and/or ctime/mtime
> > - * depending on the mode and also on the success of the operation.
> > + * depending on the mode (change is visible to user or not - say file size)
> > + * and obviously, on the success of the operation.
> >   *
> >   * Note: Incase the file system does not support preallocation,
> >   * posix_fal

Re: [36/37] Large blocksize support for ext2

2007-06-20 Thread Andreas Dilger
On Jun 20, 2007  14:27 -0700, Christoph Lameter wrote:
> > > Hmmm... Actually there is nothing additional to be done after the earlier
> > > cleanup of the macros. So just modify copyright.
> > 
> > It is NOT possible to have 64kB blocksize on ext2/3/4 without some small
> > changes to the directory handling code.  The reason is that an empty 64kB
> > directory block would have a rec_len == (__u16)2^16 == 0, and this would
> > cause an error to be hit in the filesystem.  What is needed is to put
> > 2 empty records in such a directory, or to special-case an impossible
> > value like rec_len = 0x to handle this.
> > 
> > There was a patch to fix the 64kB blocksize directory problem, but it
> > hasn't been merged anywhere yet seeing as there wasn't previously a
> > patch to allow larger blocksize...
> 
> mke2fs allows to specify a 64kb blocksize and IA64 can run with 64kb 
> PAGE_SIZE. So this is a bug in ext2fs that needs to be fixed regardless.

True.  I had increased the e2fsprogs blocksize to 16kB after testing it,
and after that it seems Ted increased it to 64kB after that.  The 64kB
directory problem only came out recently.

> > Having 32kB blocksize has no problems that I'm aware of.  Also, I'm not
> > sure how it happened, but ext2 SHOULD have an explicit check (as
> > ext3/4 does) limiting it to EXT2_MAX_BLOCK_SIZE.  Otherwise it appears
> > that there would be no error reported if the superblock reports e.g. 16MB
> > blocksize, and all kinds of things would break.
> 
> mke2fs fails for blocksizes > 64k so you are safe there. I'd like to see 
> that limit lifted?

I don't think extN can go to past 64kB blocksize in any case.

> > There shouldn't be a problem with increasing EXT{2,3,4}_MAX_BLOCK_SIZE to
> > 32kB (AFAIK), but I haven't looked into this in a while.
> 
> I'd love to see such a patch. That is also useful for arches that have 
> PAGE_SIZE > 4kb without this patchset.

Definitely, which is why we had been working on this originally.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [36/37] Large blocksize support for ext2

2007-06-20 Thread Andreas Dilger
On Jun 20, 2007  11:29 -0700, [EMAIL PROTECTED] wrote:
> This adds support for a block size of up to 64k on any platform.
> It enables the mounting filesystems that have a larger blocksize
> than the page size.

Might have been good to CC the ext2/3/4 maintainers here?  I definitely
have been waiting for a patch like this for ages (so definitely no
objection from me), but there are a few caveats before this will work
on ext2/3/4.

> Hmmm... Actually there is nothing additional to be done after the earlier
> cleanup of the macros. So just modify copyright.

It is NOT possible to have 64kB blocksize on ext2/3/4 without some small
changes to the directory handling code.  The reason is that an empty 64kB
directory block would have a rec_len == (__u16)2^16 == 0, and this would
cause an error to be hit in the filesystem.  What is needed is to put
2 empty records in such a directory, or to special-case an impossible
value like rec_len = 0x to handle this.

There was a patch to fix the 64kB blocksize directory problem, but it
hasn't been merged anywhere yet seeing as there wasn't previously a
patch to allow larger blocksize...

Having 32kB blocksize has no problems that I'm aware of.  Also, I'm not
sure how it happened, but ext2 SHOULD have an explicit check (as
ext3/4 does) limiting it to EXT2_MAX_BLOCK_SIZE.  Otherwise it appears
that there would be no error reported if the superblock reports e.g. 16MB
blocksize, and all kinds of things would break.

There shouldn't be a problem with increasing EXT{2,3,4}_MAX_BLOCK_SIZE to
32kB (AFAIK), but I haven't looked into this in a while.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [34/37] Large blocksize support in ramfs

2007-06-20 Thread Andreas Dilger
On Jun 20, 2007  11:29 -0700, [EMAIL PROTECTED] wrote:
> If you apply this patch and then you can f.e. try this:
> 
>   mount -tramfs -o10 none /media

> @@ -164,10 +165,15 @@ static int ramfs_fill_super(struct super
> + if (options && *options)
> + order = simple_strtoul(options, NULL, 10);

This is probably a bad name for a mount option.  What about "order=10"?
Otherwise you prevent any other option from being used in the future.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-18 Thread Andreas Dilger
On Jun 16, 2007  16:53 +0200, Jörn Engel wrote:
> On Fri, 15 June 2007 15:51:07 -0700, alan wrote:
> > >Thus, in the end it turns out that this stuff is better handled by
> > >explicit version-control systems (which require explicit operations to
> > >manage revisions) and atomic snapshots (for backup.)
> > 
> > ZFS is the cool new thing in that space.  Too bad the license makes it 
> > hard to incorporate it into the kernel.
> 
> It may be the coolest, but there are others as well.  Btrfs looks good,
> nilfs finally has a cleaner and may be worth a try, logfs will get
> snapshots sooner or later.  Heck, even my crusty old cowlinks can be
> viewed as snapshots.
> 
> If one has spare cycles to waste, working on one of those makes more
> sense than implementing file versioning.

Too bad everyone is spending time on 10 similar-but-slightly-different
filesystems.  This will likely end up with a bunch of filesystems that
implement some easy subset of features, but will not get polished for
users or have a full set of features implemented (e.g. ACL, quota, fsck,
etc).  While I don't think there is a single answer to every question,
it does seem that the number of filesystem projects has climbed lately.

Maybe there should be a BOF at OLS to merge these filesystem projects
(btrfs, chunkfs, tilefs, logfs, etc) into a single project with multiple
people working on getting it solid, scalable (parallel readers/writers on
lots of CPUs), robust (checksums, failure localization), recoverable, etc.
I thought Val's FS summits were designed to get developers to collaborate,
but it seems everyone has gone back to their corners to work on their own
filesystem?

Working on getting hooks into DM/MD so that the filesystem and RAID layers
can move beyond "ignorance is bliss" when talking to each other would be
great.  Not rebuilding empty parts of the fs, limit parity resync to parts
of the fs that were in the previous transaction, use fs-supplied checksums
to verify on-disk data is correct, use RAID geometry when doing allocations,
etc.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread Andreas Dilger
On Jun 14, 2007  22:04 +1000, David Chinner wrote:
> On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
> > > B FA_DEALLOCATE
> > > removes the underlying disk space with the given range. The disk space
> > > shall be removed regardless of it's contents so both allocated space
> > > from
> > > B FA_ALLOCATE
> > > and
> > > B FA_PREALLOCATE
> > > as well as from
> > > B write(3)
> > > will be removed.
> > > B FA_DEALLOCATE
> > > shall never remove disk blocks outside the range specified.
> > 
> > So this is essentially the same as "punch".
> 
> Depends on your definition of "punch".
> 
> > There doesn't seem to be
> > a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
> > end.
> 
> ftruncate()

No, that will delete written data also.  What I'm thinking is in cases
where an application does fallocate() to reserve a lot of space, and
when the application is finished it wants to unreserve any unused space.

> > > B FA_DEALLOCATE
> > > shall never change the file size. If changing the file size
> > > is required when deallocating blocks from an offset to end
> > > of file (or beyond end of file) is required,
> > > B ftuncate64(3)
> > > should be used.
> > 
> > This also seems to be a bit of a wart, since it isn't a natural converse
> > of either of the above functions.  How about having two modes,
> > similar to FA_ALLOCATE and FA_PREALLOCATE?
> 
> 
> 
> whatever.
> 
> > Say, FA_PUNCH (which
> > would be as you describe here - deletes all data in the specified
> > range changing the file size if it overlaps EOF,
> 
> Punch means different things to different people. To me (and probably
> most XFS aware ppl) punch implies no change to the file size.

If "punch" does not change the file size, how is it possible to determine
the end of the actual written data?  Say you have a file with records
in it, and these records are cancelled as they are processed (e.g. a
journal of sorts).  One usage model for punch() that we had in the past
is to punch out each record after it finishes processing, so that it will
not be re-processed after a crash.  If the file size doesn't change with
punch then there is no way to know when the last record is hit and the
rest of the file needs to be scanned.

> i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
> holes to leave the file size unchanged. This is the behaviour I
> described for FA_DEALLOCATE.
> 
> > and FA_DEALLOCATE,
> > which only deallocates unused FA_{PRE,}ALLOCATE space?
> 
> That's an "unwritten-to-hole" extent conversion. Is that really
> useful for anything? That's easily implemented with FIEMAP
> and FA_DEALLOCATE.

But why force the application to do this instead of making the
fallocate API sensible and allowing it to be done directly?

> Anyway, because we can't agree on a single pair of flags:
> 
>   FA_ALLOCATE== posix_fallocate()
>   FA_DEALLOCATE  == unwritten-to-hole ???

I'd think this makes sense, being natural opposites of each other.
FA_ALLOCATE doesn't overwrite existing data with zeros, so FA_DEALLOCATE
shouldn't erase existing data.  If FA_ALLOCATE extends the file size,
then FA_DEALLOCATE should shrink it if there is no data at the end.

>   FA_RESV_SPACE  == XFS_IOC_RESVSP64
>   FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64

> > We might also consider making @mode be a mask instead of an enumeration:
> > 
> > FA_FL_DEALLOC   0x01 (default allocate)
> > FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
> > FA_FL_DEL_DATA  0x04 (default keep written data on DEALLOC)
> 
> #define FA_ALLOCATE   0
> #define FA_DEALLOCATE FA_FL_DEALLOC
> #define FA_RESV_SPACE FA_FL_KEEP_SIZE
> #define FA_UNRESV_SPACE   FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

OK, this makes the semantics of XFS_IOC_RESVSP64 and XFS_IOC_UNRESVSP64
clear at least.  The benefit is that it would also be possible (I'm
not necessarily advocating this as a flag, just an example) to have
semantics that are like XFS_IOC_ALLOCSP64 (zeroing written data while
preallocating) with:

#define FA_ZERO_SPACEFA_DEL_DATA

or whatever semantics the caller actually wants, instead of restricting
them to the subset of combinations given by FA_ALLOCATE and FA_DEALLOCATE
(whatever it is we decide on in the end).

> > > B ENOSPC
> > > There is not enough space left on the device containing the file
> > > referred to by
> > > IR fd.
> > 
> > Should probably say whether space is removed on failure or n

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread Andreas Dilger
On Jun 14, 2007  09:52 +1000, David Chinner wrote:
> B FA_PREALLOCATE
> provides the same functionality as
> B FA_ALLOCATE
> except it does not ever change the file size. This allows allocation
> of zero blocks beyond the end of file and is useful for optimising
> append workloads.
> TP
> B FA_DEALLOCATE
> removes the underlying disk space with the given range. The disk space
> shall be removed regardless of it's contents so both allocated space
> from
> B FA_ALLOCATE
> and
> B FA_PREALLOCATE
> as well as from
> B write(3)
> will be removed.
> B FA_DEALLOCATE
> shall never remove disk blocks outside the range specified.

So this is essentially the same as "punch".  There doesn't seem to be
a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
end.

> B FA_DEALLOCATE
> shall never change the file size. If changing the file size
> is required when deallocating blocks from an offset to end
> of file (or beyond end of file) is required,
> B ftuncate64(3)
> should be used.

This also seems to be a bit of a wart, since it isn't a natural converse
of either of the above functions.  How about having two modes,
similar to FA_ALLOCATE and FA_PREALLOCATE?  Say, FA_PUNCH (which
would be as you describe here - deletes all data in the specified
range changing the file size if it overlaps EOF, and FA_DEALLOCATE,
which only deallocates unused FA_{PRE,}ALLOCATE space?

We might also consider making @mode be a mask instead of an enumeration:

FA_FL_DEALLOC   0x01 (default allocate)
FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
FA_FL_DEL_DATA  0x04 (default keep written data on DEALLOC)

We might then build FA_ALLOCATE and FA_DEALLOCATE out of these flags
without making the interface sub-optimal.

I suppose it might be a bit late in the game to add a "goal"
parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
the API more suitable for XFS?  The goal could be a single __u64, or
a struct with e.g. __u64 byte offset (possibly also __u32 lun like
in FIEMAP).  I guess the one potential limitation here is the
number of function parameters on some architectures.

> B ENOSPC
> There is not enough space left on the device containing the file
> referred to by
> IR fd.

Should probably say whether space is removed on failure or not.  In
some (primitive) implementations it might no longer be possible to
distinguish between unwritten extents and zero-filled blocks, though
at this point DEALLOC of zero-filled blocks might not be harmful either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read/write counts

2007-06-04 Thread Andreas Dilger
On Jun 04, 2007  06:20 -0400, David H. Lynch Jr. wrote:
> The net result is that implimentation would be simpler if I could
> just read/write, the amount of data that can be done with the least
> amount of work, even if that is less than was requested.
> 
> If I receive a request to read 512 bytes, and I return that I have read
> 486, is either the OS, libc, or something else going to treat that as an
> error, or are they coming back for the rest in a subsequent call ?
> 
> I though I recalled that read()/write() returning a cound less than
> requested is not an error.

It is not strictly an error to read/write less than the requested amount,
but you will find that a lot of applications don't handle this correctly.
They will assume that if the amount read/written is != amount requested
that this is an error.  Of course the opposite is also true - some
applications assume that the amount requested == amount read/written and
don't even check whether that is actually the case or not.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread Andreas Dilger
On May 31, 2007  17:11 -0700, H. Peter Anvin wrote:
> NFS takes a binary option block anyway.  However, that's the exception,
> not the rule.

There was recently a patch submitted to linux-fsdevel to change NFS to
use text option parsing.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/2] i_version update - ext4 part

2007-05-29 Thread Andreas Dilger
On May 29, 2007  12:44 -0700, Mingming Cao wrote:
> I am a little bit confused about the two patches. 
> 
> It appears in the ext4_expand_inode_extra_isize patch by Kalpak, there a
> new 64 bit i_fs_version field is added to ext4 inode structure for inode
> versioning support. read/store of this counter are properly handled, but
> missing the inode versioning counter update.

For the Lustre use of the inode version we don't care about the VFS changes
to i_version.  In fact - we want to be able to control the changes to
inode version ourselves so that e.g. file defragmenting or atime updates
don't change the inode version, and that recovery can restore the version
to a known state along with the rest of the metadata.

That said, since Lustre isn't in the kernel and we patch our version of
ext3 anyways it doesn't really matter what is done for NFS.  We will just
patch in our own behaviour if the final ext4 code isn't suitable in all
of the details.  Having 99% of the code the same at least makes this a
lot less work.

> But later in the second patch by Jean Noel, he re-used the VFS inode-
> >i_version for ext4 inode versioning, the counter is being updated every
> time the file is being changed. 

I don't know what the NFS requirements for the version are.  There may
also be some complaints from others if the i_version is 64 bits because
this contributes to generic inode growth and isn't used for other
filesystems.

> To me, i_fs_version and inode_version are the same thing, right?
> Shouldn't we choose one(I assume inode i_version?), and combine these
> two patch together? How about split the inode versioning part from the
> ext4_expand_inode_extra_isize patch(it does multiple things, and
> i_versioning doesn't longs there) and put it together with the rest of
> i_version update patches?

I don't have an objection to that, but I don't think it is required.

> BTW, how could NFS/user space to access the inode version counter?

If the Bull patch uses i_version then knfsd can just access it directly.
I don't think there is any API to access it from userspace.  One option
is to add a virtual EA like user.inode_version and have the kernel fill
this in from i_version.

Lustre will manipulate the ei->i_fs_version directly.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Andreas Dilger
On May 25, 2007  17:58 +1000, Neil Brown wrote:
>These devices would find it very hard to support BIO_RW_BARRIER.
>Doing this would require keeping track of all in-flight requests
>(which some, possibly all, of the above don't) and then:
>  When a BIO_RW_BARRIER request arrives:
> wait for all pending writes to complete
> call blkdev_issue_flush on all devices
> issue the barrier write to the target device(s)
>as BIO_RW_BARRIER,
> if that is -EOPNOTSUP, re-issue, wait, flush.

We noticed when testing the SLES10 kernel (which has barriers enabled
by default) that ext3 write throughput went from about 170MB/s to about
130MB/s (on high-end RAID storage using no-op scheduler).

The reason (as far as we could tell) is that the barriers are implemented
by flushing and waiting for all previosly submitted IOs to finish, but
all that ext3/jbd really care about is that the journal blocks are safely
on disk.

Since the journal blocks are only a small fraction of the total IO in
flight, the barrier + write cache ends up being a lot worse than just
doing synchronous IO with the write cache disabled because no new IO can
be submitted past the barrier, and since that IO is large and contiguous
it might complete much faster than the scattered metadata updates that are
also being checkpointed to disk from the previous transactions.  With jbd
there can be both a running and a committing transaction, and multiple
checkpointing transactions, and the use of barriers breaks this important
optimization.

If ext3 used an external journal this problem would be avoided,
but then there isn't really a need for barriers in the first place, since
the jbd code already will handle the wait for the commit block itself.

We've got a pretty-much complete version of the ext3 journal checksumming
patch that avoids the need to do the pre-commit barrier, since the checksum
can verify at recovery time whether all of the transaction's blocks made
it to disk or not (which is what the commit block is all about in the end).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 4/5] inode reservation v0.1 (benchmark result)

2007-05-24 Thread Andreas Dilger
On May 24, 2007  02:08 +0800, coly wrote:
> Due to the bad design of magic inode and the on-disk layout of magic
> inode. When 30 files created alternatively in each directory, no
> performance advantage exists. When 50 files created alternatively in
> each directory, the patched ext4 will use double time on removing all
> the files and directories.

I don't think the use of magic inodes is the right approach.  One possibility
to avoid changing the on-disk format at all is to only do the reservation in
memory, scaling the reservation with the size of the directory.

The only issue that arises is how to regenerate the same reservation
after a remount.  This might be possible to do by looking into the leaf
block at create time to see which inode numbers are already in use for
that leaf and checking whether there are free inodes in each group.

One way to get the "best" mapping is possibly checking groups in order of
decreasing number of inodes for that leaf in each group and once a suitable
group has been found doing a few name->hash->inode numbers to get the old
mapping back.  Once this leaf->group mapping has been established it
can be re-used for a given leaf block until that window is full.

Since you need to scan all of a leaf block's dir entries in a hash block
at insert time to look for duplicate names, and the inode numbers are
in the dir entries, this shouldn't introduce any additional disk IO.

Also, regardless of what the mapping turns out to be - the goal is to place
inodes with a similar hash into nearby inodes, and this heuristic works
relatively well for that.  Once the given leaf block's inode range is full
then new inodes can be allocated from a new window as it was done for the
newly-created directory.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Andreas Dilger
On May 16, 2007  11:09 +1200, Jeff Zheng wrote:
> We are using two 3ware disk array controllers, each of them is connected
> 8 750GB harddrives. And we build a software raid0 on top of that. The
> total capacity is 5.5TB+5.5TB=11TB
> 
> We use jfs as the file-system, we have a test application that write
> data continuously to the disks. After writing 52 10GB files, jfs
> crashed. And we are not able to recover it, fsck doesn't recognise it
> anymore.
> We then tried xfs, same application, lasted a little longer, but gives
> kernel crash later.

Check if your kernel has CONFIG_LBD enabled.

The kernel doesn't check if the block layer can actually write to
a block device > 2TB.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5][TAKE2] fallocate system call

2007-05-14 Thread Andreas Dilger
On May 14, 2007  18:59 +0530, Amit K. Arora wrote:
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
> 
> fd: The descriptor of the open file.
> 
> mode*: This specifies the behavior of the system call. Currently the
>   system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
>   FA_ALLOCATE: Applications can use this mode to preallocate blocks to
> a given file (specified by fd). This mode changes the file size if
> the preallocation is done beyond the EOF. It also updates the
> ctime/mtime in the inode of the corresponding file, marking a
> successfull allocation.
>   FA_DEALLOCATE: This mode can be used by applications to deallocate the
> previously preallocated blocks. This also may change the file size
> and the ctime/mtime.
> * New modes might get added in future. One such new mode which is
>   already under discussion is FA_PREALLOCATE, which when used will
>   preallocate space but will not change the filesize and [cm]time.
>   Since the semantics of this new mode is not clear and agreed upon yet,
>   this patchset does not implement it currently.
> 
> offset: This is the offset in bytes, from where the preallocation should
>   start.
> 
> len: This is the number of bytes requested for preallocation (from
>   offset).

What is the return value?  I'd hope it is the number of bytes preallocated,
in case of interrupted preallocation for whatever reason (interrupt, out of
space, etc) like a regular write(2) call.  In this case the return type needs
to also be an loff_t to match @len.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 13/14] ext3 whiteout support

2007-05-14 Thread Andreas Dilger
On May 14, 2007  15:14 +0530, Bharata B Rao wrote:
>  #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV0x0008 /* Journal device */
>  #define EXT3_FEATURE_INCOMPAT_META_BG0x0010
> +#define EXT3_FEATURE_INCOMPAT_WHITEOUT   0x0020

Is this flag reserved with Ted?  It isn't listed in the e2fsprogs repo.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] file capabilities: accomodate >32 bit capabilities

2007-05-10 Thread Andreas Dilger
On May 08, 2007  16:49 -0500, Serge E. Hallyn wrote:
> Quoting Andreas Dilger ([EMAIL PROTECTED]):
> > One of the important use cases I can see today is the ability to
> > split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
> > grained attributes.
> 
> Sounds plausible, though it suffers from both making capabilities far
> more cumbersome (i.e. finding the right capability for what you wanted
> to do) and backward compatibility.  Perhaps at that point we should
> introduce security.capabilityv2 xattrs.  A binary can then carry
> security.capability=CAP_SYS_ADMIN=p, and
> security.capabilityv2=cap_may_clone_mntns=p.

Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing
12 bytes worth of data, so it is probably just better to keep extending
the original capability fields as was in the proposal.

> > What we definitely do NOT want to happen is an application that needs
> > priviledged access (e.g. e2fsck, mount) to stop running because the
> > new capabilities _would_ have been granted by the new kernel and are
> > not by the old kernel and STRICTXATTR is used.
> > 
> > To me it would seem that having extra capabilities on an old kernel
> > is relatively harmless if the old kernel doesn't know what they are.
> > It's like having a key to a door that you don't know where it is.
> 
> If we ditch the STRICTXATTR option do the semantics seem sane to you?

Seems reasonable.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Andreas Dilger
On May 09, 2007  21:31 +0530, Amit K. Arora wrote:
> 2) For FA_UNALLOCATE mode, should the file system allow unallocation
>of normal (non-preallocated) blocks (blocks allocated via
>regular write/truncate operations) also (i.e. work as punch()) ?
>- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
>  we need to finalize on the convention here as a general guideline
>  to all the filesystems that implement fallocate.

I would only allow this on FA_ALLOCATE extents.  That means it won't be
possible to do this for filesystems that don't understand unwritten
extents unless there are blocks allocated beyond EOF.

> 3) If above is true, the file size will need to be changed
>for "unallocation" when block holding the EOF gets unallocated.
>- If we do not "unallocate" normal (non-preallocated) blocks and we
>  do not change the file size on preallocation, then this is a
>  non-issue.

Not necessarily.  That will just make the file sparse.  If FA_ALLOCATE
does not change the file size, why should FA_UNALLOCATE.

> 4) Should we update mtime & ctime on a successfull allocation/
>unallocation ?

I would say yes.  If glibc does the fallback fallocate via write() the
mtime/ctime will be updated, so it makes sense to be consistent for
both methods.  Also, it just makes sense from the "this file was modified"
point of view.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] file capabilities: accomodate >32 bit capabilities

2007-05-08 Thread Andreas Dilger
On May 08, 2007  14:17 -0500, Serge E. Hallyn wrote:
> As the capability set changes and distributions start tagging
> binaries with capabilities, we would like for running an older
> kernel to not necessarily make those binaries unusable.
> 
>   (0. Enable the CONFIG_SECURITY_FS_CAPABILITIES option
>  when CONFIG_SECURITY=n.)
>   (1. Rename CONFIG_SECURITY_FS_CAPABILITIES to
>  CONFIG_SECURITY_FILE_CAPABILITIES)
>   2. Introduce CONFIG_SECURITY_FILE_CAPABILITIES_STRICTXATTR
>  which, when set, prevents loading binaries with capabilities
>  set which the kernel doesn't know about.  When not set,
>  such capabilities run, ignoring the unknown caps.
>   3. To accomodate 64-bit caps, specify that capabilities are
>  stored as
>   u32 version; u32 eff0; u32 perm0; u32 inh0;
>   u32 eff1; u32 perm1; u32 inh1; (etc)

Have you considered how such capabilities will be used in the future?
One of the important use cases I can see today is the ability to
split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
grained attributes.

What we definitely do NOT want to happen is an application that needs
priviledged access (e.g. e2fsck, mount) to stop running because the
new capabilities _would_ have been granted by the new kernel and are
not by the old kernel and STRICTXATTR is used.

To me it would seem that having extra capabilities on an old kernel
is relatively harmless if the old kernel doesn't know what they are.
It's like having a key to a door that you don't know where it is.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-08 Thread Andreas Dilger
On May 07, 2007  21:43 -0400, Theodore Tso wrote:
> On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
> > Userspace could presumably repair the mess in most situations by truncating
> > the file back again.  The kernel cannot do that because there might be live
> > data in amongst there.
> 
> Actually, the kernel could do it, in that could simply release all
> unitialized extents back to the system.  The problem is distinguishing
> between the unitialized extents that had just been newly added, versus
> the ones that had there from before.  (On the other hand, if the
> filesystem was completely full, releasing unitialized blocks wouldn't
> be the worse thing in the world to do, although releasing previously
> fallocated blocks probably does violate the princple of least
> surprise, even if it's what the user would have wanted.)

I tend to agree with this.  Having fallocate() fill up the filesystem
is exactly what the caller asked.  Doing a write() hit ENOSPC doesn't
trucate off the whole write either, nor does "dd" delete the whole file
when the filesystem is full.

Even checking the statfs() space before doing the fallocate() may be
counter intuitive, since it will return ENOSPC but the filesystem will
not actually be full.  Some applications (e.g. database) may WANT to
fill the filesystem and then get the actual file size back to avoid
trusting statfs() because of metadata overhead (e.g. indirect blocks).

One of the design goals for sys_fallocate() was to allow FA_DELALLOC
to deallocate unwritten extents in a safe manner.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 07, 2007  19:02 -0400, Jeff Garzik wrote:
> Andreas Dilger wrote:
> >Actually, this is a non-issue.  The reason that it is handled for 
> >extent-only is that this is the only way to allocate space in the
> >filesystem without doing the explicit zeroing.
> 
> Precisely /how/ do you avoid the zeroing issue, for extents?
> 
> If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
> otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not.  Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 07, 2007  13:58 -0700, Andrew Morton wrote:
> Final point: it's fairly disappointing that the present implementation is
> ext4-only, and extent-only.  I do think we should be aiming at an ext4
> bitmap-based implementation and an ext3 implementation.

Actually, this is a non-issue.  The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error (e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger
On May 03, 2007  21:31 -0700, Andrew Morton wrote:
> On Thu, 26 Apr 2007 23:43:32 +0530 "Amit K. Arora" <[EMAIL PROTECTED]> wrote:
> > + * ext4_fallocate:
> > + * preallocate space for a file
> > + * mode is for future use, e.g. for unallocating preallocated blocks etc.
> > + */
> 
> This description is rather thin.  What is the filesystem's actual behaviour
> here?  If the file is using extents then the implementation will do
> .  If the file is using bitmaps then we will do .
> 
> But what?   Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

> > +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
> > len)
> > +{
> > +   handle_t *handle;
> > +   ext4_fsblk_t block, max_blocks;
> > +   int ret, ret2, nblocks = 0, retries = 0;
> > +   struct buffer_head map_bh;
> > +   unsigned int credits, blkbits = inode->i_blkbits;
> > +
> > +   /* Currently supporting (pre)allocate mode _only_ */
> > +   if (mode != FA_ALLOCATE)
> > +   return -EOPNOTSUPP;
> > +
> > +   if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
> > +   return -ENOTTY;
> 
> So we don't implement fallocate on bitmap-based files!  Well that's huge
> news.  The changelog would be an appropriate place to communicate this,
> along with reasons why, or a description of the plan to fix it.
> 
> Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

> > +   block = offset >> blkbits;
> > +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits) >> blkbits)
> > +- block;
> > +   mutex_lock(&EXT4_I(inode)->truncate_mutex);
> > +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
> > +   mutex_unlock(&EXT4_I(inode)->truncate_mutex);
> 
> Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
> space, and that this disk space will require an arbitrary amount of
> metadata, how can we work out how much journal space we'll be needing
> without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks.  That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

> > +   ret = ext4_ext_get_blocks(handle, inode, block,
> > + max_blocks, &map_bh,
> > + EXT4_CREATE_UNINITIALIZED_EXT, 0);
> > +   BUG_ON(!ret);
> 
> BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
> ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] fallocate system call

2007-05-03 Thread Andreas Dilger
On May 02, 2007  18:23 +0530, Amit K. Arora wrote:
> On Sun, Apr 29, 2007 at 10:25:59PM -0700, Chris Wedgwood wrote:
> > On Mon, Apr 30, 2007 at 10:47:02AM +1000, David Chinner wrote:
> > 
> > > For FA_ALLOCATE, it's supposed to change the file size if we
> > > allocate past EOF, right?
> > 
> > I would argue no.  Use truncate for that.
> 
> The patch I posted for ext4 *does* change the filesize after
> preallocation, if required (i.e. when preallocation is after EOF).
> I may have to change that, if we decide on not doing this.

I think I'd agree - it may be useful to allow preallocation beyond EOF
for some kinds of applications (e.g. PVR preallocating live TV in 10
minute segments or something, but not knowing in advance how long the
show will actually be recorded or the final encoded size).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-03 Thread Andreas Dilger
On May 02, 2007  20:57 +1000, David Chinner wrote:
> On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:
> > HSM_READ is definitely _NOT_ required because all  
> > it means is "if the file is OFFLINE, bring it ONLINE and then return  
> > the extent map".
> 
> You've got the definition of HSM_READ wrong. If the flag is *not*
> set, then we bring everything back online and return the full extent
> map.
> 
> Specifying the flag indicates that we do *not* want the offline
> extents brought back online.  i.e. it is a HSM or a datamover
> (e.g. backup program) that is querying the extents and we want to
> known *exactly* what the current state of the file is right now.
> 
> So, if the HSM_READ flag is set, then the application is
> expecting the filesytem to be part of a HSM. Hence if it's not,
> it should return an error because somebody has done something wrong.

In my original proposal I specifically pointed out that the
FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX
BMV_IF_NO_DMAPI_READ flag.  Data is retrieved from HSM only if the
HSM_READ flag is set.  That's why the flag is called "HSM_READ" instead
of "HSM_NO_READ".

The reason is that it seems bad if the default behaviour for calling
ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is
only disabled by specifying a flag.  It makes a lot more sense to just
leave the data as it is and return the extent mapping by default (i.e.
this is the principle of least surprise).  It would probably be equally
surprising and undesirable if the default behaviour was to force all
data out to HSM.



For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should
even be a part of this interface?  I have no problem with returning a
flag that reports if the data is migrated to HSM and whether it is UNMAPPED.

Having FIEMAP force the retrieval of data from HSM strikes me as something
that should be a part of a separate HSM interface, which also needs to be
able to do things like push specific files or parts thereof out to HSM,
set the aging policy, and return information like "where does the HSM
file live" and "how many copies are there".

Do you know the reasoning behind including this into XFS_IOC_GETBMAPX?
Looking at the bmap.c comments it appears it is simply because the API
isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate
there is data in HSM but it has no blocks allocated in the filesystem.

I don't think it makes the operation significantly more efficient than
say "ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP)" if an application actually
needs the data to be present instead of just returning mapping info that
includes "UNMAPPED.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Andreas Dilger
On May 02, 2007  00:20 +1000, David Chinner wrote:
> My point was that there is a difference between specification and
> implementation - if the specification says something is compulsory,
> then they must be implemented in the filesystem. This is easy
> enough to ensure by code review - we don't need additional interface
> complexity for this

What you seem to be missing about my proposal is that the FLAG_INCOMPAT
is for future use by that part of the specification we haven't thought
of yet...  Having COMPAT/INCOMPAT flags has been very useful for ext2/3/4,
and is much better than having version numbers for the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Andreas Dilger
On May 01, 2007  14:22 +1000, David Chinner wrote:
> On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
> > Hmm, I'd thought "offline" would migrate to EXTENT_UNKNOWN, but I didn't
> 
> I disagree - why would you want to indicate the state is unknown when we know
> very well that it is offline?

If you don't like "UNKNOWN", what about "UNMAPPED"?  I just want a
catch-all flag that indicates "this extent contains data but there is
nothing sensible to be returned for the extent mapping."

> Effectively, when your extent is offline in the HSM, it is inaccessable, and
> you have to bring it back from tape so it becomes accessible again. i.e. some
> action is necessary on behalf of the user to make it accessible. So I think
> that OFFLINE is a good name for this state because it really is inaccessible.

What you are calling OFFLINE I would prefer to call UNMAPPED, since that
can be used by applications as a catch-all for "no mapping".  There can
be further flags that give refinements to UNMAPPED that some applications
might care about them (e.g. HSM_RESIDENT), but many users/apps will not
if they just want the number of fragments in a given file.

> Also, I don't think "secondary" is a good term because most large systems
> have more than one tier of storage. One possibility is "HSM_RESIDENT"
> which indicates the extent is current and resident with a HSM's archive

Sure.

> > Can you propose reasonable flag names for these (I can't think of anything
> > very good) and a clear explanation of what they mean.  I suspect it will
> > only be XFS that uses them initially.  In mke2fs and ext4+mballoc there is
> > the concept of stripe unit and stripe width, but as yet they are not
> > communicated between the two very well.  I'd be much happier if this info
> > could be queried in a standard way from the block layer instead of the
> > user having to specify it and the filesystem having to track it.
> 
> My preference is definitely for a separate ioctl to grab the
> filesystem geometry so this stuff can be calculated in userspace.
> i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't
> bother trying to define names until we decide which appraoch we take
> to implement this.

Hmm, previously you wrote "This information could be easily passed up in the
flags fields if the filesystem has geometry information".  So, I _think_
what you are saying is that you want 4 flags to convey this start/end
alignment information, but the exact semantics of what a "stripe unit" and
a "stripe width" is filesystem specific?

I definitely do NOT want to get into any issues of querying the block
device geometry here.  I was just making a passing comment that ext4+mballoc
can already do RAID-specific allocation alignment, but it depends on the
admin to specify this information and it would be nice if there was some
easy way to get this from userspace/kernel interfaces.

Having an API that can request "tell me the number of blocks from this
offset until the next physical disk boundary" or similar would be useful
to any allocator, and the block layer already needs to know this when
submitting IO.

> In XFS, mkfs.xfs does the work of getting this information
> to see in the filesystem superblock. Here's the code for getting
> sunit/swidth from the underlying block device:
> 
> http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/
> 
> Not much in common there ;)

It looks like this might be just what e2fsprogs needs also.

> > It does make sense to specify zero for the fm_extent_count array and a
> > new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
> > extent data itself, for the non-verbose mode of filefrag, and for
> > pre-allocating a buffer large enough to hold the file if that is important.
> 
> Rather than rely on implicit behaviour of "pass in extent count of
> zero and a don't try to return any extents" to return the number of
> extents on the file, why not just explicitly define this as a valid
> input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS

That's what I said, isn't it?  FIEMAP_FLAG_NO_EXTENTS.  I wonder if my
clever-clever for "return no extents" and "return number of extents"
is wasted :-/.

> > - does XFS return an extent for the metadata parts of the file (e.g. btree)?
> 
> No, but we can return the extent map for the attribute fork (i.e.
> extended attrs) if asked for (XFS_IOC_GETBMAPA).

This seems like it would be a useful addition to the interface also, having
FIEMAP_FLAG_METADATA request the return of metadata allocations too.

> > - does XFS return preallocated extents beyond EOF

Re: Ext2/3 block remapping tool

2007-05-01 Thread Andreas Dilger
On May 01, 2007  11:28 -0400, Theodore Tso wrote:
> On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote:
> > Except one other issue with online shrinking is that we need to move
> > inodes on occasion and this poses a bunch of other problems over just
> > remapping the data blocks.
> 
> Well, I did say "necessary", and not "sufficient".  But yes, moving
> inodes, especially if the inode is currently open gets interesting.  I
> don't think there are that many user space applications that would
> notice or care if the st_ino of an open file changed out from under
> them, but there are obviously userspace applications, such as tar,
> that would most definitely care.

I think "rm -r" does a LOT of this kind of operation, like:

stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.)

I think "find" does the same to avoid security problems with malicious
path manipulation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >