Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-06-14 Thread David Chinner
On Thu, Jun 14, 2007 at 03:14:58AM -0600, Andreas Dilger wrote:
 On Jun 14, 2007  09:52 +1000, David Chinner wrote:
  B FA_PREALLOCATE
  provides the same functionality as
  B FA_ALLOCATE
  except it does not ever change the file size. This allows allocation
  of zero blocks beyond the end of file and is useful for optimising
  append workloads.
  TP
  B FA_DEALLOCATE
  removes the underlying disk space with the given range. The disk space
  shall be removed regardless of it's contents so both allocated space
  from
  B FA_ALLOCATE
  and
  B FA_PREALLOCATE
  as well as from
  B write(3)
  will be removed.
  B FA_DEALLOCATE
  shall never remove disk blocks outside the range specified.
 
 So this is essentially the same as punch.

Depends on your definition of punch.

 There doesn't seem to be
 a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
 end.

ftruncate()

  B FA_DEALLOCATE
  shall never change the file size. If changing the file size
  is required when deallocating blocks from an offset to end
  of file (or beyond end of file) is required,
  B ftuncate64(3)
  should be used.
 
 This also seems to be a bit of a wart, since it isn't a natural converse
 of either of the above functions.  How about having two modes,
 similar to FA_ALLOCATE and FA_PREALLOCATE?

shrug

whatever.

 Say, FA_PUNCH (which
 would be as you describe here - deletes all data in the specified
 range changing the file size if it overlaps EOF,

Punch means different things to different people. To me (and probably
most XFS aware ppl) punch implies no change to the file size.

i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching
holes to leave the file size unchanged. This is the behaviour I
described for FA_DEALLOCATE.

 and FA_DEALLOCATE,
 which only deallocates unused FA_{PRE,}ALLOCATE space?

That's an unwritten-to-hole extent conversion. Is that really
useful for anything? That's easily implemented with FIEMAP
and FA_DEALLOCATE.

Anyway, because we can't agree on a single pair of flags:

FA_ALLOCATE== posix_fallocate()
FA_DEALLOCATE  == unwritten-to-hole ???
FA_RESV_SPACE  == XFS_IOC_RESVSP64
FA_UNRESV_SPACE== XFS_IOC_UNRESVSP64

 We might also consider making @mode be a mask instead of an enumeration:
 
 FA_FL_DEALLOC 0x01 (default allocate)
 FA_FL_KEEP_SIZE   0x02 (default extend/shrink size)
 FA_FL_DEL_DATA0x04 (default keep written data on DEALLOC)

i.e:

#define FA_ALLOCATE 0
#define FA_DEALLOCATE   FA_FL_DEALLOC
#define FA_RESV_SPACE   FA_FL_KEEP_SIZE
#define FA_UNRESV_SPACE FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA

 I suppose it might be a bit late in the game to add a goal
 parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
 the API more suitable for XFS?

It would suffice for the simpler operations, I think, but we'll
rapidly run out of flags and we'll still need another interface
for doing complex stuff.

 The goal could be a single __u64, or
 a struct with e.g. __u64 byte offset (possibly also __u32 lun like
 in FIEMAP).  I guess the one potential limitation here is the
 number of function parameters on some architectures.

To be useful it needs to __u64.

  B ENOSPC
  There is not enough space left on the device containing the file
  referred to by
  IR fd.
 
 Should probably say whether space is removed on failure or not.  In

Right. I'd say on error you need to FA_DEALLOCATE to ensure any space
allocated was freed back up. That way the error handling in the allocate
functions is much simpler (i.e. no need to undo there).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-14 Thread Chuck Lever

Hi Chris-

John Stoffel wrote:

As a user of Netapps, having quotas (if only for reporting purposes)
and some way to migrate non-used files to slower/cheaper storage would
be great.

Ie. being able to setup two pools, one being RAID6, the other being
RAID1, where all currently accessed files are in the RAID1 setup, but
if un-used get migrated to the RAID6 area.  


And of course some way for efficient backups and more importantly
RESTORES of data which is segregated like this.  


I like the way dump and restore was handled in AFS (and now ZFS and 
NetApp).  There is a simple command to flatten a file system and send it 
to another system, which can receive it and re-expand it.  The 
dump/restore process uses snapshots and can easily send incremental 
backups which are significantly smaller than 0-level.  This is somewhat 
better than rsync, because you don't need checksums to discover what 
data has changed -- you already have the new data segregated into 
copied-on-write blocks.


NetApp happens to use the standard NDMP protocol for sending the 
flattened file system.  NetApp uses it for synchronous replication, 
volume migration, and back up to nearline storage and tape.  AFS used 
vol dump and vol restore for migration, replication, and back-up. 
ZFS has the zfs send and zfs receive commands that do basically the 
same (Eric Kustarz recently published a blog entry that described how 
these work).  And of course, all file system objects are able to be sent 
this way:  streams, xattrs, ACLs, and so on are all supported.


Note also that NFSv4 supports the idea of migrated or replicated file 
objects.  All that is needed to support it is a mechanism on the servers 
to actually move the data.
begin:vcard
fn:Chuck Lever
n:Lever;Chuck
org:Oracle Corporation;Corporate Architecture: Linux Projects Group
adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA
email;internet:chuck dot lever at nospam oracle dot com
title:Principal Member of Staff
tel;work:+1 248 614 5091
x-mozilla-html:FALSE
version:2.1
end:vcard



Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-14 Thread Florian D.
Chris Mason wrote:
 The basic list of features looks like this:
[amazing stuff snipped]

 The current status is a very early alpha state, and the kernel code
 weighs in at a sparsely commented 10,547 lines.  I'm releasing now in
 hopes of finding people interested in testing, benchmarking,
 documenting, and contributing to the code.
ok, what kind of benchmarks would help you most? bonnie? compilebench?
sth. other?

is it possible to test it on top of LVM2 on RAID at this stage?

thanks,
florian
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-14 Thread Chris Mason
On Thu, Jun 14, 2007 at 08:29:10PM +0200, Florian D. wrote:
 Chris Mason wrote:
  The basic list of features looks like this:
 [amazing stuff snipped]
 
  The current status is a very early alpha state, and the kernel code
  weighs in at a sparsely commented 10,547 lines.  I'm releasing now in
  hopes of finding people interested in testing, benchmarking,
  documenting, and contributing to the code.
 ok, what kind of benchmarks would help you most? bonnie? compilebench?
 sth. other?

Thanks! Lets start with a list of the things I know will go badly:

O_SYNC (not implemented)
O_DIRECT (not implemented)
aio (not implemented)
multi-threaded (brain dead tree locking)
things that fill the drive (will oops)
mmap() writes (not supported, mmap reads are ok)

Also, overlapping writes are not that well supported.  For example, tar
by default will write in 10k chunks, and btrfs_file_write currently cows
on every single write.  So, if your tar file has a bunch of 16k files,
it'll go much faster if you tell tar to use 16k (or 8k) buffers.

In general, I was hoping for a generic delayed allocation facility to
magically appear in the kernel, and so I haven't spent a lot of time
tuning btrfs_file_write for this yet.

Any other workload is fair game, and I'm especially interested in seeing
how badly the COW hurts.  For example, on a big file, I'd like to see
how much slower big sequential reads are after small random writes (fio
is good for this).  Or, writing to every file on the FS in random order
and then seeing how much slower we are at reading.

Benchmarks that stress the directory structure are interesting too, huge
numbers of files + directories etc.  Ric Wheeler's fs_mark has a lot of
options and output.

But, that's just my list, you can pick anything that you find
interesting ;)  Please try btrfsck after the run to see how well it
keeps up.  If you use blktrace to generate io traces, graphs can be
generated:

http://oss.oracle.com/~mason/seekwatcher/

Not that well documented, but drop me a line if you need help running
it.  btt is a good alternative to the graphs too, and easier to run.

 
 is it possible to test it on top of LVM2 on RAID at this stage?

Yes, I haven't done much multi-spindle testing yet, so I'm definitely
interested in these numbers.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching

2007-06-14 Thread Jack Stone
[EMAIL PROTECTED] wrote:
 On Sun, 10 Jun 2007, Pavel Machek wrote:
 But you have that regex in _user_ space, in a place where policy
 is loaded into kernel.
 
 then the kernel is going to have to call out to userspace every time a
 file is created or renamed and the policy is going to be enforced
 incorrectly until userspace finished labeling/relabeling whatever is
 moved. building this sort of race condigion for security into the kernel
 is highly questionable at best.
 
 AA has regex parser in _kernel_ space, which is very wrong.
 
 see Linus' rants about why it's not automaticaly the best thing to move
 functionality into userspace.
 
 remember that the files covered by an AA policy can change as files are
 renamed. this isn't the case with SELinux so it doesn't have this sort
 of problem.

How about using the inotify interface on / to watch for file changes and
 updating the SELinux policies on the fly. This could be done from a
userspace daemon and should require minimal SELinux changes.

The only possible problems I can see are the (hopefully) small gap
between the file change and updating the policy and the performance
problems of watching the whole system for changes.

Just my $0.02.

Jack
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching

2007-06-14 Thread david

On Thu, 14 Jun 2007, Jack Stone wrote:


[EMAIL PROTECTED] wrote:

On Sun, 10 Jun 2007, Pavel Machek wrote:

But you have that regex in _user_ space, in a place where policy
is loaded into kernel.


then the kernel is going to have to call out to userspace every time a
file is created or renamed and the policy is going to be enforced
incorrectly until userspace finished labeling/relabeling whatever is
moved. building this sort of race condigion for security into the kernel
is highly questionable at best.


AA has regex parser in _kernel_ space, which is very wrong.


see Linus' rants about why it's not automaticaly the best thing to move
functionality into userspace.

remember that the files covered by an AA policy can change as files are
renamed. this isn't the case with SELinux so it doesn't have this sort
of problem.


How about using the inotify interface on / to watch for file changes and
updating the SELinux policies on the fly. This could be done from a
userspace daemon and should require minimal SELinux changes.

The only possible problems I can see are the (hopefully) small gap
between the file change and updating the policy and the performance
problems of watching the whole system for changes.


as was mentioned by someone else, if you rename a directory this can 
result in millions of files that need to be relabeled (or otherwise have 
the policy changed for them)


that can take a significant amount of time to do.

David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


NILFS version 2 now available

2007-06-14 Thread amagai
Hi,

NILFS (a New Implementation of a Log-structured Filesystem) Version 2 have
been available at the project website

http://www.nilfs.org/

If you are interested, please visit to our website.

NILFS version 2 is equipped with the Garbage Collector that can keep
numerous snapshots of NILFS filesystem.  You can preserve some
consistent states (checkpoints) of NILFS filesystem as snapshots
``after'' the states are established.  For example, when you have
removed some files accidentally, you can get the snapshot that
represent the past state which holds the vanished files.  Checkpoints
are collected by the garbage collector unless they are marked as
``snapshot''.

We welcome any comments or contributions.
Thanks for your attention,

AMAGAI Yoshiji

Nippon Telegraph and Telephone Corporation
NTT Cyber Space Laboratories
Open Source Software Computing Project

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] [RFC] configfs: Pin config_items when in use by other subsystems

2007-06-14 Thread Joel Becker
Many folks know that I've been pretty stubborn on the subject of
configfs item removal.  configfs_rmdir() cannot currently be aborted by
a client driver.  This is to ensure that userspace has control - if
userspace wants to remove an item, it should have that ability.  The
client driver is left to handle the event.
However, there are dependencies in the kernel.  One kernel
subsystem may depend on a configfs item and be unable to handle that
item disappearing.  So we need a mechanism to describe this dependency.
After lots of beating me over my head, I've been convinced to give it a
shot.
The canonical example is ocfs2 and its heartbeat.  Today, you
can stop o2cb heartbeat (o2hb) by rmdir(2) of the heartbeat object in
configfs.  o2hb handles this just fine - all of o2hb's clients get
node-down notifications.  However, ocfs2 can't handle this.  When the
node stops speaking to the cluster it can't just continue, and it can't
force an unmount of itself.  It only has one solution - crash.  This is
ugly any way you look at it.
With the configfs_depend_item() API,
heartbeat_register_callback() can then pin the heartbeat item.  Any
rmdir(2) of the heartbeat item will return -EBUSY until
heartbeat_unregister_callback() removes the dependancy.  A similar API
can be created for any other configfs client driver.
The first patch is the configfs mechanism.  The second patch is
the heartbeat use thereof.
Comments and curses welcome.

Joel

-- 

You must remember this:
 A kiss is just a kiss,
 A sigh is just a sigh.
 The fundamental rules apply
 As time goes by.

Joel Becker
Principal Software Developer
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] ocfs2: Depend on configfs heartbeat items.

2007-06-14 Thread Joel Becker
ocfs2 mounts require a heartbeat region.  Use the new configfs_depend_item()
facility to actually depend on them so they can't go away from under us.

First, teach cluster/nodemanager.c to depend an item on the o2cb subsystem.
Then teach o2hb_register_callbacks to take a UUID and depend on the
appropriate region.  Finally, teach all users of o2hb to pass a UUID or
NULL if they don't require a pin.

Signed-off-by: Joel Becker [EMAIL PROTECTED]
---
 fs/ocfs2/cluster/heartbeat.c   |   64 +++-
 fs/ocfs2/cluster/heartbeat.h   |6 +++-
 fs/ocfs2/cluster/nodemanager.c |   10 ++
 fs/ocfs2/cluster/nodemanager.h |3 ++
 fs/ocfs2/cluster/tcp.c |8 +++--
 fs/ocfs2/dlm/dlmdomain.c   |8 +++--
 fs/ocfs2/heartbeat.c   |   10 +++---
 7 files changed, 92 insertions(+), 17 deletions(-)

diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 9791134..e331f4c 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -1665,7 +1665,56 @@ void o2hb_setup_callback(struct o2hb_cal
 }
 EXPORT_SYMBOL_GPL(o2hb_setup_callback);
 
-int o2hb_register_callback(struct o2hb_callback_func *hc)
+static struct o2hb_region *o2hb_find_region(const char *region_uuid)
+{
+   struct o2hb_region *p, *reg = NULL;
+
+   assert_spin_locked(o2hb_live_lock);
+
+   list_for_each_entry(p, o2hb_all_regions, hr_all_item) {
+   if (!strcmp(region_uuid, config_item_name(p-hr_item))) {
+   reg = p;
+   break;
+   }
+   }
+
+   return reg;
+}
+
+static int o2hb_region_get(const char *region_uuid)
+{
+   int ret = 0;
+   struct o2hb_region *reg;
+
+   spin_lock(o2hb_live_lock);
+
+   reg = o2hb_find_region(region_uuid);
+   if (!reg)
+   ret = -ENOENT;
+   spin_unlock(o2hb_live_lock);
+
+   if (!ret)
+   ret = o2nm_depend_item(reg-hr_item);
+
+   return ret;
+}
+
+static void o2hb_region_put(const char *region_uuid)
+{
+   struct o2hb_region *reg;
+
+   spin_lock(o2hb_live_lock);
+
+   reg = o2hb_find_region(region_uuid);
+
+   spin_unlock(o2hb_live_lock);
+
+   if (reg)
+   o2nm_undepend_item(reg-hr_item);
+}
+
+int o2hb_register_callback(const char *region_uuid,
+  struct o2hb_callback_func *hc)
 {
struct o2hb_callback_func *tmp;
struct list_head *iter;
@@ -1681,6 +1730,12 @@ int o2hb_register_callback(struct o2hb_c
goto out;
}
 
+   if (region_uuid) {
+   ret = o2hb_region_get(region_uuid);
+   if (ret)
+   goto out;
+   }
+
down_write(o2hb_callback_sem);
 
list_for_each(iter, hbcall-list) {
@@ -1702,16 +1757,21 @@ out:
 }
 EXPORT_SYMBOL_GPL(o2hb_register_callback);
 
-void o2hb_unregister_callback(struct o2hb_callback_func *hc)
+void o2hb_unregister_callback(const char *region_uuid,
+ struct o2hb_callback_func *hc)
 {
BUG_ON(hc-hc_magic != O2HB_CB_MAGIC);
 
mlog(ML_HEARTBEAT, on behalf of %p for funcs %p\n,
 __builtin_return_address(0), hc);
 
+   /* XXX Can this happen _with_ a region reference? */
if (list_empty(hc-hc_item))
return;
 
+   if (region_uuid)
+   o2hb_region_put(region_uuid);
+
down_write(o2hb_callback_sem);
 
list_del_init(hc-hc_item);
diff --git a/fs/ocfs2/cluster/heartbeat.h b/fs/ocfs2/cluster/heartbeat.h
index cc6d40b..35397dd 100644
--- a/fs/ocfs2/cluster/heartbeat.h
+++ b/fs/ocfs2/cluster/heartbeat.h
@@ -69,8 +69,10 @@ void o2hb_setup_callback(struct o2hb_cal
 o2hb_cb_func *func,
 void *data,
 int priority);
-int o2hb_register_callback(struct o2hb_callback_func *hc);
-void o2hb_unregister_callback(struct o2hb_callback_func *hc);
+int o2hb_register_callback(const char *region_uuid,
+  struct o2hb_callback_func *hc);
+void o2hb_unregister_callback(const char *region_uuid,
+ struct o2hb_callback_func *hc);
 void o2hb_fill_node_map(unsigned long *map,
unsigned bytes);
 void o2hb_init(void);
diff --git a/fs/ocfs2/cluster/nodemanager.c b/fs/ocfs2/cluster/nodemanager.c
index 9f5ad0f..50a0524 100644
--- a/fs/ocfs2/cluster/nodemanager.c
+++ b/fs/ocfs2/cluster/nodemanager.c
@@ -900,6 +900,16 @@ static struct o2nm_cluster_group o2nm_cl
},
 };
 
+int o2nm_depend_item(struct config_item *item)
+{
+   return configfs_depend_item(o2nm_cluster_group.cs_subsys, item);
+}
+
+void o2nm_undepend_item(struct config_item *item)
+{
+   configfs_undepend_item(o2nm_cluster_group.cs_subsys, item);
+}
+
 static void __exit exit_o2nm(void)
 {
if (ocfs2_table_header)
diff --git a/fs/ocfs2/cluster/nodemanager.h b/fs/ocfs2/cluster/nodemanager.h
index 0705221..55ae1a0